Getting started in BigQuery can be daunting. Even with experience writing SQL, there are some platform-specific nuances you need to learn along the way. But if you stick to it, you can unlock some really great analysis opportunities.
BigQuery enables you to interrogate raw, hit-based, user level data from Google Analytics. You can see every interaction throughout a session using page views, events and custom dimensions. You can aggregate the data across all sessions for each user to understand how often they return, what channels they come through and how frequently they do certain actions.
If you’re just getting started using SQL and/or BigQuery, or you’re looking for a scalable solution for writing long SQL codes, this blog post is for you.
A common use case for BigQuery is flagging if certain behaviours happen within a session. Think about your own website and how many different actions you can perform. Whether it be visiting key pages, clicking various buttons or trigging goal conversions, I’m sure there is a lot.
Our Data Runs Deep BigQuery Statement Builder enables you to flag all the behaviours you’re interested in at scale (so long as you’ve got the event tracking set up in Google Analytics).
Our statement builder is a Google Sheets document that consists of two parts.
This sheet gives you a handy SQL base template to extract hit-level data from BigQuery. There are 'choose-your-path' options such as creating a saved table, querying a dynamic date range and filtering on aggregated columns. It’s a great starting point for interrogating Google Analytics data so paste this code into BigQuery and save it because you will be using it a lot.
Let’s break down the components of the code.
CREATE OR REPLACE TABLE `projectname.datasetname.newtablename` AS
This statement will save the resulting data as a table in BigQuery that you can then reference within Data Studio. Just be mindful that BigQuery does charge a small amount for storing data.
SELECT fullVisitorId ,clientid ,date ,CONCAT(fullVisitorId, CAST(visitStartTime AS STRING)) AS uniqueVisitId
These are the common columns you’ll use with the majority of your queries. Fullvisitorid is the unique user ID for each customer. Clientid is similar to the fullvisitorid but is in the form that you can match against user explorer in the Google Analytics interface.
To get a unique session id, you can concatenate the fullvisitorid and the visit start time. This will match the Google Analytics interface when aggregating session counts. If you want to avoid the session breakage at midnight, you should concatenate fullvisitorid and visitid.
If you are wondering why we’ve put our commas at the front of each line, it’s simply to make it easier to add and remove columns without having to mess around with a trailing comma.
FROM `projectname.datasetname.ga_sessions_20200101`, UNNEST(hits) AS hits
This FROM clause pulls a single date of data from your ga_sessions table. We need to unnest hits so that we are able to use event information and page paths.
FROM `projectname.datasetname.ga_sessions_*`, UNNEST(hits) AS hits
Google Analytics data in BigQuery is stored as a partitioned table for every date. To query a date range, you need to replace the date at the end of the statement with an asterisk which in SQL means ‘select all’. To only select data within a given range, you should use _TABLE_SUFFIX in your WHERE clause.
We have given you three ready made options for this.
AND _TABLE_SUFFIX BETWEEN '20200101' AND '20200131'
This queries a fixed date range of your choosing.
AND _TABLE_SUFFIX BETWEEN '20200101' AND FORMAT_DATE( '%Y%m%d', DATE_SUB( CURRENT_DATE ('Australia/Melbourne'), INTERVAL 1 DAY))
In this statement you can select a fixed start date and a rolling, dynamic end date (such as yesterday). To get a rolling end date, the statement subtracts the current date by a set number of days and then formats it into a friendly format to match the other date. If you do not include a timezone inside the brackets, it will default to UTC timezone. You can get a list of the supported timezones here.
AND _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d', DATE_SUB( CURRENT_DATE( 'Australia/Melbourne'), INTERVAL 30 DAY)) AND FORMAT_DATE('%Y%m%d', DATE_SUB( CURRENT_DATE ('Australia/Melbourne'), INTERVAL 1 DAY))
This last option can be used to query a rolling time frame, in this example, a rolling 30-day window. This statement follows the same format as the above but replaces the fixed start date with a dynamic formula.
If it is crucial for your sessions data to match the sessions reporting in the Google Analytics interface, you should always have totals.visits=1 in your WHERE clause. Google Analytics automatically filters out any non-interactive sessions but BigQuery doesn’t unless you use this statement.
GROUP BY fullVisitorId ,clientid ,date ,uniqueVisitId
GROUP BY just means that you want to group all rows with the same value together so you have one row for every unique combination of these columns. In other words, it gets rid of duplication.
HAVING behaviourcolumn > 0
Once you have flagged the behaviours you want, you can use this HAVING clause to shrink your results to when one of them is true. I use this mostly for testing if the code is working accurately.
ORDER BY fullVisitorId ASC ,date ASC ,uniqueVisitId ASC
This is simply sorts the columns based on the user first, then date and then session in ascending order. You can use DESC to make it descending order.
Now let’s get to your inputs. To build out your specific behaviours you will want the second sheet of the tool.
This sheet builds SQL statements that flag whether certain events are triggered or pages are viewed within a session.
What you need to do is fill in what you want the output column to be called and add any Event Category, Action, Label and Page Path criteria you want to identify. You can do it in any combination e.g. just Event Label or just Page Path or all four criteria.
Then the spreadsheet formulates the entries in a SQL statement that uses the following logic.
MAX( IF( eventInfo.eventCategory = 'Event Category' AND eventInfo.eventAction = 'Event Action' AND eventInfo.eventLabel = 'Event Label' AND page.pagePath = 'Page Path', 1, 0)) AS OutputColumnName
If the criteria are met (e.g. if the specific event is fired), then give a 1 otherwise give a 0. Then because we only want to know if the behaviour is performed at least once, we take the MAX. If you are interested in seeing the total number of times the event is fired within a session, change this to SUM.
There are two versions of this statement. Exact match with an ‘=’ and a REGEXP_CONTAINS.
REGEXP_CONTAINS lets you get a bit fancier with the regular expression options such as ‘^’ starts with and ‘$’ ends with. You could also do a variable with ‘.*’ with the use case of having a variable promo code in the label. Bear in mind, you will need to escape any special characters like /|*^$ using a backslash ‘\’ before them.
When you have filled in all the behaviours into the sheet, copy the relevant column into the SELECT statement of the your base query and run!
This is an example of what your output will look like with binary 1’s for yes and 0’s for no. This is a perfect output for machines when you want to do predictive modelling. However, for us humans, you’ll want to aggregate this output to user level or total sessions level to derive insights from it.
Once you’ve done that you can start to answer some interesting questions for example:
- What products or categories of products are commonly being bought together
- What are the common behaviours of the people who transact.
- What are the behaviours that happen early in the purchase journey vs later.
The options are endless!
Start building out your behavioural analyses now with our BigQuery Statement Builder.