Lately, I’ve found myself using a number of window functions in BigQuery. They are really handy functions but can be difficult to wrap your head around. In this post, I will take you through the basic principles of window functions and then in subsequent posts, I will share some examples of how you can utilise them with Google Analytics data.
SQL window functions enable you to query either a subset or full set of rows within your data set and return a value on each row of the results. You can use them to calculate running totals, rolling averages or simply to remove duplicate rows from your data set.
There are many types of functions that can be used with a window and they fall into three main categories:
- Numbering: RANK, DENSE_RANK, ROW_NUMBER
- Aggregating: SUM, AVG, COUNT, MIN, MAX
- Navigational: LEAD, LAG, FIRST_VALUE, LAST_VALUE
A window function follows this general format though you won’t necessarily use each clause in every statement:
1. Function_name: This is the analytics function of your choice e.g. SUM, RANK, LEAD
2. Expression: This is the column you are querying or can be a CASE WHEN logical statement. It is left blank for some numbering functions like RANK and ROW_NUMBER
3. OVER: This determines the window or the set of rows the function will operate within. If you leave it blank between the parentheses, it will query all rows in the data set.
4. PARTITION BY: This clause is essentially like a group by. If you partition by fullvisitorid, the function will query all the rows for each user at a time. If you partition by fullvisitorid and visitstarttime, the function will query all the rows within each session for each user.
5. ORDER BY: Sometimes you will need to order the rows in your partition. This is important for navigational and numbering functions so the query knows where to start and finish. You can order by multiple columns and order in ascending (ASC) or descending (DESC) order.
6. Frame_clause: This section lets you choose a subset of rows within your partition. If you wanted to query a running total or a rolling average you specify here.
Still with me? Here’s some simple examples to illustrate it in action.
Example 1 - Find the total sales across the full catalogue
If you want to query all rows in the data set, all you need to do is omit the PARTITION BY/ORDER BY and leave between the parentheses of the OVER clause as blank (). The example below is summing the sales for all products in the data set.
Example 2 - Find the total sales within each category
A partition is like a GROUP BY function except this time you get the aggregated sum on each row of the data set. The statement below sums the sales for each category.
Example 3 - Find the total sales within each subcategory
You can partition by multiple columns with simply a comma between them and it will sum all the values within each partitioned column combination, in this case each category and subcategory combo.
Now that you have the total / subtotals on each row, you can do calculations across the columns. For example, what percent of total sales came from each product.
Example 4 - Find the top selling products in each category
Another window function you could use is RANK. This requires an ORDER BY clause. The example below will rank each product within each category from highest sales to lowest (ORDER BY Sales DESC).
Rank can be useful for limiting your results to the top X products.
Example 5 - Find the running total
A frame clause follows this format: ROWS BETWEEN X AND Y (where X is the starting point and Y is the end point of the window).
You can either designate a set number of rows (e.g. 5 PRECEDING) or you can use UNBOUNDED means unlimited number of rows. If you wanted to query the full set of rows you would say UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING which is simply all rows before and after the current row.
Below is an example of a running total. In this example, I have used ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, because I want to add all the sales before and up to the current row. On the 3rd August, the running total of $25,400 is the result of adding $9,000 + $15,000 (UNBOUNDED PRECEDING) and $1,400 (CURRENT ROW).
The frame clause gives you flexibility depending on your use case. You could add the sales from the current row and the next row using ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING. Alternatively, if you want to add 5 rows before and after the current row, then your statement will look like this ROWS BETWEEN 5 PRECEDING and 5 FOLLOWING.
Example 6 - Rolling Average
Below is an example using a frame clause to calculate a 3 day rolling average of sales. The function changes to an AVG (instead of SUM) and the frame clause looks at ROWS BETWEEN 2 PRECEDING AND CURRENT ROW.
Note: the first two rows are not really a 3 day average because there are not 2 preceding rows in the data set. Another thing to be mindful of is the number of rows in your partitions because this could have unexpected results if they are uneven. For example, if you had a date with no sales, there could be a missing row in the data set. When the statement looks back 2 rows, it will actually be looking back 3 dates resulting in a 4 day rolling average.
So these are the basic principles of window functions. You can find the full list of the compatible functions that can have an OVER clause in BigQuery on these pages (Numbering, Aggregating, Navigational).
Stay tuned for some specific examples using Google Analytics data in BigQuery that you can use to up your BigQuery game.