Skip to main content
Data Runs Deep Menu

Main navigation

  • Home
  • Who We Are
    • Careers
    • What We're Up To
    • What We Believe In
    • Ethical Policy
  • What We Do
  • Google Marketing Platform
    • Analytics 360
    • Tag Manager 360
    • Salesforce and GA360
    • Google Analytics App+Web
    • Data Studio
    • Optimize
  • Google Cloud Platform
    • Google BigQuery
      • Google BigQuery Training
      • Google Analytics Integration
  • Consulting
    • Strategy
    • Implementation
      • Google Analytics Audit
      • Google Tag Manager Audit
      • Enhanced Ecommerce Implementation
      • Google Tag Manager
    • Analysis
      • Google Data Studio
        • Google Data Studio Case Study - Mona
      • UX Analysis
  • Training
    • Five Day Google Analytics Online Workshop
    • Advanced Google Analytics Training
    • Google Tag Manager Training
    • Google Analytics 360 Training
    • Advanced Analytics Bootcamp
  • Partnerships
    • Conductrics
  • Blog
  • The Flying Beagle
  • Merch
  • Contact

How to re-attribute your conversions to the correct traffic source in BigQuery

Use LAST_VALUE to dynamically offset data in BigQuery
By: Corinne Brooker
Posted in
  • Google BigQuery
Article Date
December 04, 2020

We previously did a blog post about identifying session breakage using BigQuery. Today we are going to show how you can correct historical data and re-attribute your conversions to the correct traffic source. 

Let’s say your third party payment gateway is breaking your sessions because it takes your user to a different domain. You find sessions with your payment provider as the source (Afterpay, PayPal etc) having extremely high conversion rates. 

The first thing you want to do is prevent this from happening in the future. Add any payment domains to your Referral Exclusion List. This will stop Google Analytics from restarting the user’s session when one of these domains is found as the referrer. 

This is not a retroactive solution so you can use the following method to re-attribute the session to the last non-payment traffic source. This attributes the conversion to the genuine source rather than your payment gateway. 

If you only need to backtrack one session, you can use a LAG function with a fixed offset (we will do another post on LAG). However, what we’re showing you today is when you need to dynamically offset the data based on the value found. You might need to skip three consecutive “sessions” for User A and only one for User B.

We are going to use a navigational window function to achieve this. If you need an intro to window functions as a concept read this blog first.

Navigational window functionsallow you to find a value on a different row and return it to your current row. There are several functions but today we’ll cover:

  • FIRST_VALUE - returns the value on the first row of your ordered partition
  • LAST_VALUE - returns the value on the last row of your ordered partition. Note, if you want to find the last value of the entire partition, you will need a frame clause like UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. Otherwise, it will give you the final value up until the current row. 

 

Let’s take the example of landing and exit pages. These are session based metrics in Google Analytics but in BigQuery, you need to calculate them based on the entrance / exit hits. The landing page is the hits.page.pagePath when hits.isEntrance is true and the exit page is the hits.page.pagePath when hits.isExit is true. 

CASE WHEN statement to calculate landing and exit pages in BigQuery

This, however, falls down if you were wanting to look at various hits in the session. As you can see, you don’t get the landing page or exit page on each row. If you were to filter on sessions that landed on /blog, you would be filtering out all subsequent hits of those sessions, making your data less useful to you. 

 

 

Window functions, by contrast, can add the value to each row. We can take the same CASE WHEN statement and put it into a FIRST_VALUE and LAST_VALUE window function. We need to partition by each user and session so that we are treating each session individually. Then we need to order by hitnumber so we find the first and last hit of the session sequentially. 

To find the last row of the partition, we need to use the frame clause ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. Otherwise, it will give you the last value up until the current row. 

Using FIRST_VALUE and LAST_VALUE to calculate landing and exit pages in BigQuery

Now we have the landing page on every row of the session. But you can see that LAST_VALUE function is not working for the exit page. This is because the last hit of the session is not the exit hit. It is common for a user to have events fire as the last hit because they are interacting with the final page. Sometimes your first hit of the session might not be a pageview hit due to your tags firing in a different order than expected. 

 

To get around this, we are going to use the IGNORE NULLS clause. This is a great addition to your FIRST_ and LAST_VALUE functions and will be crucial for our re-attribution solution that we’re slowly getting to. 

You can specify to IGNORE or RESPECT NULLS after your field you’re assessing. In our exitPage function, we are using a CASE WHEN so it goes after the END and before the closed bracket. 

IGNORE NULLS is a good addition to LAST_VALUE when calculating exit page in BigQuery

Now the function will find the last non-NULL value in the partition which in this case is the page “/blog/bigquery” and we now have the landing page and exit page on every row of the session. 

 

So now that we’ve covered the basics. Let’s get back to our re-attribution solution. We are going to use LAST_VALUE to give us the last non-payment traffic source on each relevant row. 

First, let’s set up a simple query with the source/medium and transaction data for each session. If you don’t recognise the “||”, this is a concatenation operator that you can use in place of the CONCAT function.

SELECT
fullvisitorid
,visitstarttime
,trafficSource.source||' / '||trafficSource.medium AS sourceMedium
,totals.transactions
,totals.totalTransactionRevenue / 1000000 AS revenue

FROM `project_name.dataset_name.ga_sessions_202011*`

Next, we are going to replace every unwanted traffic source with NULL and then ignore said NULL to give us the last real traffic source for the user.  

Our CASE statement will look like: 

CASE WHEN trafficsource.source = 'afterpay' THEN NULL 
     ELSE trafficsource.source 
END 
Using CASE WHEN to replace unwanted traffic sources with NULL

You can see that for every session where Afterpay is the traffic source, the CASE WHEN statement is generating a NULL. 

 

Now let’s insert that CASE statement into our LAST_VALUE window function, ignoring the NULLs. We need our partition to be at the user level (fullvisitorid) and we are going to order by visitstarttime so we have the sessions in chronological order. 

The last part of the window function is the frame clause. This time we only want to consider the traffic sources prior to and including the current row. If the current row has a NULL we want to take the previous row to that. LAST_VALUE defaults to this frame, but just for completeness, we will specify the frame clause (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) so that others reading our code know what’s going on. 

,LAST_VALUE(
            CASE WHEN trafficsource.source = 'afterpay'  THEN NULL
                 ELSE trafficSource.source||' / '||trafficSource.medium 
            END 
IGNORE NULLS)
OVER (
      PARTITION BY fullvisitorId 
      ORDER BY visitStartTime ASC 
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS reattributed_source
Using LAST_VALUE to dynamically look back to find a previous non-payment gateway source

Now any time the query finds a NULL (an unwanted traffic source), it will look to the preceding row for the value. If that is a NULL it will keep going backwards until it finds a non-NULL traffic source. 

 

There are some things you might want to think about with this method:

  • How you treat true direct sessions
  • How long has passed since the previous session that you’re getting the re-attributed channel from (e.g. you might want to cap it at 24hours or 7 days)
  • What other metrics need correcting (excluding the broken sessions from session counts, totalling timeOnSite across the sessions etc). 

 

This is another example of the flexibility that BigQuery enables you over and above what is possible in Google Analytics. We have more examples linked below and stay tuned for more practical window function solutions to come.

 

Full code:

SELECT
fullvisitorid
,visitstarttime
,trafficSource.source||' / '||trafficSource.medium AS sourceMedium
,totals.transactions
,totals.totalTransactionRevenue / 1000000 AS revenue
,LAST_VALUE( 
            CASE WHEN trafficsource.source = 'afterpay' THEN NULL 
                 ELSE trafficSource.source||' / '||trafficSource.medium END 
IGNORE NULLS) 
OVER ( 
      PARTITION BY fullvisitorId 
      ORDER BY visitStartTime ASC 
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS reattributed_source

FROM `project_name.dataset_name.ga_sessions_202011*`

Related Posts

  • Identify Potential Session Breakage Using BigQuery

    Article Date
    May 18, 2020
  • Remove Duplicate Transactions From Your Reporting in BigQuery

    Article Date
    November 20, 2020
  • Cool things you can do using window functions in BigQuery

    Article Date
    September 11, 2020
  • Use our BigQuery Statement Builder To Better Understand The Behaviours Of Your Customers

    Article Date
    February 22, 2020

Get in touch

To find out more, email us on hello@datarunsdeep.com.au or complete a quick contact form.

Get in touch

Talk to us: 1300 737 885

08:30 - 16:30, Monday to Friday. We love to chat

Visit us, we’re at:

Level 8, 313 Little Collins Street Melbourne, VIC 3000

Footer

  • Home
    • Who We Are
      • What We Do
      • What We're Up To
      • Careers
      • Contact
  • Training
    • Google Analytics Training Courses
    • Advanced Google Analytics Training
    • Google Tag Manager Training
  • Consulting
    • Analysis
    • Implementation
    • Strategy
    • Google Analytics Audit
    • Google Tag Manager Audit
  • Google Cloud Partner
  • Google Marketing Platform Sales Partner
  • Bitcoin accepted here
  • Registered Looker Consulting Partner Melbourne
  • Twitter
  • LinkedIn
  • Google Plus
  • Blog RSS feed
Copyright © 2009-2021 Data Runs Deep Pty Ltd. All Rights Reserved. Google Analytics is a trademark of Google Inc.
Website by Marin Sekesan