User Documentation¶

Getting Started¶

The Big Query system has some main features that will be described in detail below. Before that we’ll get familiar with the flow and nomenclature of the system.

Submit Job

Job History

Schedule Job

Dashboard

Admin Panel

Query from S3

Overview¶

The general flow of using the system is

Submit a query

Query is sent to server and an ID is returned

Query is queued and then processed

Result is returned to the user

This result can be viewed any time from the job history

After ad-hoc queries, if user wants to generate a recurring report, then he/she can schedule the job and this schedule request will be reviewed by the Admin.

Nomenclature¶

To get familiar with the UI, you need to understand some terms used in the system.

Job

Any query submitted by the user to the system is called a Job

Job Statuses

Queued: The job is submitted to the server and waiting to get processed in spark

Processing: The job is being processed in spark

Completed: The job is successfully processed and the result is generated

Failed: The job has failed after execution

Killed: The user has killed the job after submission

Cleaned: The job result’s file stored on the server got cleaned

Schedule Job Statuses

Requested: The job is submitted successfully and waiting for admin’s approval

Approved: The job is approved by admin and is ready to be executed on a schedule

Rejected: The job is rejected because some is wrong with it

Cancelled: The job has been cancelled from scheduled execution

Resubmit

The job will not be resubmitted instantly, rather is opens the Submit Job Page with the previous job data populated

Execution Time

The time taken for the job’s execution from the time of submission

Features¶

In this section we’ll discuss in detail the features of the Big Query system.

Submit Job¶

Submit Job Page allows user to get quick result for ad-hoc jobs. The result of the jobs will be stored in a file.

Submit Job Page

Procedure to submit a job

Choose date time range

Select cassandra table name

Select number of rows to fetch

Select the Categories on which you want to query (applicable only for eventdata table)

Enter the query

You can also use advanced join functionality for joining with some whitelisted mysql table.

(Optional) Subscribe of email on completion of the job

Submit the job

Currently there is restrictions of 100000 rows can be fetched of any job. But if your output contains more than 100000 rows you could save the data in Databases given in the dropdown. Table in which you are dumping the rows should already exist in the database.

See also

See the restrictions

Job Result¶

This is the page where the result of the job is displayed.

Job Result Page

For Completed jobs, the result is shown with pagination of 20 rows per page. The full job result can be downloaded in a CSV file. For Failed jobs, the page shows the error message to make correction in the earlier job. A Resubmit option is there for which populates the job details in Submit Job Page.

Job History¶

This page shows the history of all the jobs submitted by the user.

Job History Page

The user can filter the jobs based on the category(applicable on queries on eventdata table), status and also sort them using submit time or the execution time. Any job’s result can be viewed by clicking on the Result button corresponding to the job.

Scheduling¶

Job Schedule Page allows user to schedule recurring jobs and their result will be stored in Mysql database.

Job Schedule Page

Procedure to schedule a job

Select the frequency of the job

Select date for initial execution of the job

Select relative date range for the data

Select cassandra table name

Select the Categories on which you want to run the job (applicable only for eventdata table)

Enter the query

You can also use advanced join functionality for joining with some whitelisted mysql table.

(Optional) Select the DataStore in which you want to store the output of the job

Select Database from dropdown.

Enter table name: Enter existing table name

Submit the job

User can view his/her schedule job history. The user can filter the jobs based on the status. And approved job’s execution history can be viewed by clicking on the Execution History button corresponding to the job. A sample result of 10 rows will be displayed in the execution history result page.

Schedule Job History Page

The user can view the remarks made by admin on any job by expanding the job tile in the page. He/She also can cancel the scheduled job.

Note

The user will get an alert E-Mail if the job execution fails on any day
The approved scheduled jobs run at 02:15 AM every day

Important

Credentials for Read Only access of the Default MySQL Data store are:

DB Host: 10.10.3.70

DB Port: 3306

Database: bigquery_datastore

UserName: big.query

Password: carwale

You need AWS VPN access to connect to this MySQL database. For other Data Store’s credentials please contact admin.

Dashboard¶

The dashboard shows the overall numbers of the jobs grouped by job status.

Dashboard

Admin Panel¶

The admin panel is for system admins to manage users, view all jobs submitted by the users and review the schedule job requests by all the users.

Querying¶

Job Query could be written in standard SQL syntax which will run as Spark Jobs, user does not need to worry about all the complexity involved in creating, running and managing Spark Jobs.

Supported SQL¶

Spark SQL supports the vast majority of SQL features, such as:

SQL query statements, including:

SELECT

GROUP BY

ORDER BY

CLUSTER BY

SORT BY

All SQL operators, including:

Relational operators (=, ==, <>, <, >, >=, <=, etc)

Arithmetic operators (+, -, \*, /, %, etc)

Logical operators (AND, OR, etc)

Mathematical functions (SIGN, LN, COS, etc)

String functions (INSTR, LEN, LOWER, REPLACE, etc)

Aggregate functions (COUNT, AVG, SUM, MIN, etc)

Joins and Unions

JOIN

{LEFT|RIGHT|FULL} OUTER JOIN

LEFT SEMI JOIN

CROSS JOIN

UNION

Sub-queries

SELECT col FROM ( SELECT a + b AS col from t1) t2

See also

We could not cover all the available SQL functions here. Please visit this spark docs page.

Custom Functions¶

Apart from the SQL supported functions we have defined and registered our custom functions to ease out the extraction of data and perform operations. The custom functions are:

getpagetype
This function takes input argument of page url and returns page type (ex: ‘ModelPage’, ‘QuotationPage’, etc)

Example :
select getpagetype(pageinfo) from eventdatanew where category='PageViews'
getvaluefromlabel
This function takes three input arguments of label string (ex: key1=value1|key2=value2), key string and default value (this string is returned if specified key is not found). And this returns value corresponding to the key.

Example :
select getvaluefromlabel(label,'make','NA') from eventdatanew where category='QuotationPage'
getvaluefromcookie
This function takes three input arguments of cookie string (ex: key1:value1|key2:value2), key string and default value (this string is returned if specified key is not found). And this returns value corresponding to the key.

Example :
select getvaluefromcookie(cookie,'_cwutmz','NA') from eventdatanew
getwebsiteinfo
This function takes one input argument of page url and returns website information (ex: ‘CW-MSite’, ‘BW-DesktopSite’, etc)

Example :
select getwebsiteinfo(pageinfo) from eventdatanew
datetime
This function is useful to convert logdatetime column which is unix time to SQL Timestamp.

Example :
select date(datetime(logdatetime)) from eventdatanew
calculatetimespent
This function is useful to get timespent by the user on the previous event/page. This previous page can be accessed from referrer filed in the eventdatanew table.

Example :
select referrer, calculatetimespent(cookie) from eventdatanew where category = 'PageViews'
Note

This function is useful only for CarWale site events as of now. Not applicable for BikeWale site events because the cookie in BikeWale is not in required format.
getmapfromlabel
This function can be used to parse label into a key-value map.

Example :
select getmapfromlabel(label) from eventdatanew where category='QuotationPage'
getvaluefrombigmap
This function is to retrive the value given a key and subkey from a BigMap Map<String,Map<String,Int>>. This data structure has been used in userprofiling tables.

Example :
select getvaluefrombigmap(userdata,'models','Nano',0) from userprofilesdaily
getstringfrommap
This function is to retrive a string value given a key from a Map<String,String>. This is similar to getvaluefromlabel.

Example :
select getstringfrommap(userpreference,'budgetsegment','NA') from userprofilesagg
getintfrommap

This function is to retrive a integer value given a key from a Map<String,Int>.

Query from S3:¶

Now you can also query files stored on S3 which are of type orc or parquet or csv. There are two type of queries supported on s3, they are getschema and getdata.

getschema : This will give you the schema of the data stored on s3 file.

getdata : This will give you the result of the Query applied on the data.

Now lets look into each of them with little more explanation.

getschema

To get the schema of the data stored on s3, you need to provide the list of comma separated urls and the type of the files.

If you provide the more than one url, then result will be the combined schema of each file you have specified. Lets suppose you have provided Two comma separated urls s3a://path/to/file1,s3a://path/to/file2 with columns [a,b] and [c,d] respectively and file type is orc.Then the resultant schema would be output of union of these two schemas i.e [a,b] U [c,d] = [a,b,c,d]

Note

You can get schema of files of same type only i.e you can not get schema of file types orc and csv at the same time. If you provide multiple urls in comma separated format, then the type of all the files must be same. Otherwise it will throw an error.

Get schema input page

Get schema output page

getdata

To perform a sql query on data stored on s3, you need to provide

urls : s3a://path/to/file1,s3a://path/to/file2,s3a://path/to/file3 list of comma separated urls.

file type : orc or parquet or csv type of the files.

table name. table01. A temporary table is created with the provided table name.

Query : select * from table01. Query should contain the provided table name. Otherwise you would get error.

Point two of getschema is applicable here also i.e Query would be performed on Union of schema.

If you want to query from files of different type, then provide each of them as a separate input like below[From UI you need to add new source].

files of type orc[source1]

urls : s3a://path/to/file1,s3a://path/to/file2,s3a://path/to/file3 list of comma separated urls.

file type : orc

table name : table01. Temporary table with name table01 will be created.

files of type csv[source2]

urls : s3a://path/to/file1,s3a://path/to/file2,s3a://path/to/file3 list of comma separated urls.

file type : csv

table name: table02. Temporary table with name table02 will be created.

Query : select * from table01 where somecolumn in (select colum1 from table02)

Query should only contain the table names provided, Otherwise you would get an error.

The type of the files provided in each source must be same. i.e You can not have files of type orc and csv in the same source.

Note

You can aslo apply custom functions. i.e below sql query is valid. select getvaluefromlabel(label) where category='somecategory'

Get data input page

Get data output page

Note

Url of S3 file must start with s3a:// i.e s3a://path/to/file

When Querying from multiple urls, provide them in comma separated format without quotes. i.e s3a://path/to/file1,s3a://path/to/file2,s3a://path/to/file3

Restrictions¶

There are some restrictions as this system is not a relational database which supports pure SQL and the volume of data is huge:

Can query on only 7 days event data in a single job (There are approx. 20 million rows per day and fetching such huge data for querying may crash the system)

Cannot query on data older than 90 days (We archive the data older than 3 months and save it to S3)

For data older than 90 days, you need to query from S3.

Querying may take significant amount of time approx (5 mins for query on one day’s data)

offset keyword doesn’t work

If query is from cassandra, Cannot apply joins with other cassandra tables but whitelisted mysql table. However cassandra table self joins are feasible

If query is from s3, You can apply joins with other tables.

Cannot create tables or views

Cannot index any columns