We previously did a blog post about identifying session breakage using BigQuery. Today we are going to show how you can correct historical data and re-attribute your conversions to the correct traffic source.
Let’s say your third party payment gateway is breaking your sessions because it takes your user to a different domain. You find sessions with your payment provider as the source (Afterpay, PayPal etc) having extremely high conversion rates.
The first thing you want to do is prevent this from happening in the future. Add any payment domains to your Referral Exclusion List. This will stop Google Analytics from restarting the user’s session when one of these domains is found as the referrer.
This is not a retroactive solution so you can use the following method to re-attribute the session to the last non-payment traffic source. This attributes the conversion to the genuine source rather than your payment gateway.
If you only need to backtrack one session, you can use a LAG function with a fixed offset (we will do another post on LAG). However, what we’re showing you today is when you need to dynamically offset the data based on the value found. You might need to skip three consecutive “sessions” for User A and only one for User B.
We are going to use a navigational window function to achieve this. If you need an intro to window functions as a concept read this blog first.
Navigational window functionsallow you to find a value on a different row and return it to your current row. There are several functions but today we’ll cover:
- FIRST_VALUE - returns the value on the first row of your ordered partition
- LAST_VALUE - returns the value on the last row of your ordered partition. Note, if you want to find the last value of the entire partition, you will need a frame clause like UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. Otherwise, it will give you the final value up until the current row.
Let’s take the example of landing and exit pages. These are session based metrics in Google Analytics but in BigQuery, you need to calculate them based on the entrance / exit hits. The landing page is the hits.page.pagePath when hits.isEntrance is true and the exit page is the hits.page.pagePath when hits.isExit is true.
This, however, falls down if you were wanting to look at various hits in the session. As you can see, you don’t get the landing page or exit page on each row. If you were to filter on sessions that landed on /blog, you would be filtering out all subsequent hits of those sessions, making your data less useful to you.
Window functions, by contrast, can add the value to each row. We can take the same CASE WHEN statement and put it into a FIRST_VALUE and LAST_VALUE window function. We need to partition by each user and session so that we are treating each session individually. Then we need to order by hitnumber so we find the first and last hit of the session sequentially.
To find the last row of the partition, we need to use the frame clause ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. Otherwise, it will give you the last value up until the current row.
Now we have the landing page on every row of the session. But you can see that LAST_VALUE function is not working for the exit page. This is because the last hit of the session is not the exit hit. It is common for a user to have events fire as the last hit because they are interacting with the final page. Sometimes your first hit of the session might not be a pageview hit due to your tags firing in a different order than expected.
To get around this, we are going to use the IGNORE NULLS clause. This is a great addition to your FIRST_ and LAST_VALUE functions and will be crucial for our re-attribution solution that we’re slowly getting to.
You can specify to IGNORE or RESPECT NULLS after your field you’re assessing. In our exitPage function, we are using a CASE WHEN so it goes after the END and before the closed bracket.
Now the function will find the last non-NULL value in the partition which in this case is the page “/blog/bigquery” and we now have the landing page and exit page on every row of the session.
So now that we’ve covered the basics. Let’s get back to our re-attribution solution. We are going to use LAST_VALUE to give us the last non-payment traffic source on each relevant row.
First, let’s set up a simple query with the source/medium and transaction data for each session. If you don’t recognise the “||”, this is a concatenation operator that you can use in place of the CONCAT function.
SELECT fullvisitorid ,visitstarttime ,trafficSource.source||' / '||trafficSource.medium AS sourceMedium ,totals.transactions ,totals.totalTransactionRevenue / 1000000 AS revenue FROM `project_name.dataset_name.ga_sessions_202011*`
Next, we are going to replace every unwanted traffic source with NULL and then ignore said NULL to give us the last real traffic source for the user.
Our CASE statement will look like:
CASE WHEN trafficsource.source = 'afterpay' THEN NULL ELSE trafficsource.source END
You can see that for every session where Afterpay is the traffic source, the CASE WHEN statement is generating a NULL.
Now let’s insert that CASE statement into our LAST_VALUE window function, ignoring the NULLs. We need our partition to be at the user level (fullvisitorid) and we are going to order by visitstarttime so we have the sessions in chronological order.
The last part of the window function is the frame clause. This time we only want to consider the traffic sources prior to and including the current row. If the current row has a NULL we want to take the previous row to that. LAST_VALUE defaults to this frame, but just for completeness, we will specify the frame clause (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) so that others reading our code know what’s going on.
,LAST_VALUE( CASE WHEN trafficsource.source = 'afterpay' THEN NULL ELSE trafficSource.source||' / '||trafficSource.medium END IGNORE NULLS) OVER ( PARTITION BY fullvisitorId ORDER BY visitStartTime ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS reattributed_source
Now any time the query finds a NULL (an unwanted traffic source), it will look to the preceding row for the value. If that is a NULL it will keep going backwards until it finds a non-NULL traffic source.
There are some things you might want to think about with this method:
- How you treat true direct sessions
- How long has passed since the previous session that you’re getting the re-attributed channel from (e.g. you might want to cap it at 24hours or 7 days)
- What other metrics need correcting (excluding the broken sessions from session counts, totalling timeOnSite across the sessions etc).
This is another example of the flexibility that BigQuery enables you over and above what is possible in Google Analytics. We have more examples linked below and stay tuned for more practical window function solutions to come.
SELECT fullvisitorid ,visitstarttime ,trafficSource.source||' / '||trafficSource.medium AS sourceMedium ,totals.transactions ,totals.totalTransactionRevenue / 1000000 AS revenue ,LAST_VALUE( CASE WHEN trafficsource.source = 'afterpay' THEN NULL ELSE trafficSource.source||' / '||trafficSource.medium END IGNORE NULLS) OVER ( PARTITION BY fullvisitorId ORDER BY visitStartTime ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS reattributed_source FROM `project_name.dataset_name.ga_sessions_202011*`