Skip to main content
Data Runs Deep Menu

Main navigation

  • Home
  • Who We Are
    • Careers
    • What We're Up To
    • What We Believe In
    • Ethical Policy
  • What We Do
  • Google Marketing Platform
    • Analytics 360
    • Tag Manager 360
    • Salesforce and GA360
    • Google Analytics App+Web
    • Data Studio
    • Optimize
  • Google Cloud Platform
    • Google BigQuery
      • Google BigQuery Training
      • Google Analytics Integration
  • Consulting
    • Strategy
    • Implementation
      • Google Analytics Audit
      • Google Tag Manager Audit
      • Enhanced Ecommerce Implementation
      • Google Tag Manager
    • Analysis
      • Google Data Studio
        • Google Data Studio Case Study - Mona
      • UX Analysis
  • Training
    • Five Day Google Analytics Online Workshop
    • Advanced Google Analytics Training
    • Google Tag Manager Training
    • Google Analytics 360 Training
    • Advanced Analytics Bootcamp
  • Partnerships
    • Conductrics
  • Blog
  • The Flying Beagle
  • Merch
  • Contact

Remove Duplicate Transactions From Your Reporting in BigQuery

Use ROW_NUMBER SQL window function to identify duplicate values in BigQuery
By: Corinne Brooker
Posted in
  • Google BigQuery
Article Date
November 20, 2020

A lot of companies report duplicate transactions (and therefore increased revenue) in Google Analytics. From our experience with eCommerce websites, we have found that roughly 1% of revenue is double counted when there are no custom fixes in place. The most common cause is when a customer reloads the order confirmation page in a new session, potentially on a new day and even sometimes as a new user (when the cookie has expired). 

There are technical ways you can prevent this from happening going forward (read our blog here) but if you need to correct historical reporting, you can use the following solution in BigQuery utilising a SQL numbering window function.

If you are new to window functions as a concept, read our introduction blog first.

Numbering window functions will label each row in your data set with a number based on the order of a column you specify. The three most common numbering functions differ solely on how they handle duplicate values. 

Let’s visualise a sporting event where athletes Tim and Eva tie in first place and Leo finishes next. 

The RANK function will rank Tim and Eva both in first place and Leo will come in third (as typically seen in sporting competitions). RANK will give duplicates the same number and then skip the next number or numbers (depending on how many duplicates). E.g. if it was a three way tie, there would be no 3rd place only 4th. 

How RANK() OVER window functions handle duplicate values

DENSE_RANK will also give Tim and Eva first place medals but this time Leo happily takes home a silver medal. DENSE_RANK will give duplicates the same value and will not skip the next number. 

How DENSE_RANK() OVER window functions handle duplicate values

ROW_NUMBER does not double up at all and gives a unique number to each row. Unfortunately for Tim, that means he gets bumped to second place. BigQuery will decide randomly between Eva and Tim who gets first place and this could change each time you run the query. If there were other criteria you could rank the athletes by such as the results of previous races (using multiple fields in your ORDER BY clause) you might be able to make this a fairer distinction.

How ROW_NUMBER() OVER window functions handle duplicate values

Below is an example of these functions where you are ranking customers based on totalTransactionRevenue. 

BigQuery example of RANK, DENSE_RANK and ROW_NUMBER

RANK and DENSE_RANK are good if you want to return the top selling products or top searched terms where you want to include those that generate the same amount of sales/searches (top 10 products could be 11 because two have the same revenue). But for deduping datasets, you want a unique number on each row. Therefore, we are going to use ROW_NUMBER to clean up our duplicate transactions. 

We are going to be user, session and date agnostic. If the user reloads the confirmation page on a different day or after their cookie has expired, we want to disregard the transaction. For this to work, you will need to query more than one day of ga_sessions data. 

First let’s create a simple data set of all the transactionids in August. 

WITH base AS (
SELECT
h.transaction.transactionid
,fullvisitorid
,visitstarttime
,date

FROM `project_name.dataset_name.ga_sessions_202008*` , UNNEST(hits) AS h

WHERE h.transaction.transactionid IS NOT NULL

GROUP BY
h.transaction.transactionid
,fullvisitorid
,visitstarttime
,date
)

This would be the subquery results:

Subquery results with a row per transactionid

Now in our next subquery, we will add our ROW_NUMBER counter. 

We need to define our ‘partition’ for the window function. In the case of ROW_NUMBER, the partition essentially signals where the numbering will restart. We want to restart the counter for each new transactionid so we will partition by transactionid.

Next, we need to define the order of the rows in the partition. To label the rows in chronological order, we will order by visitstarttime in ascending order.

,transactioncounter AS (
SELECT 
base.*
,ROW_NUMBER() OVER (PARTITION BY transactionid ORDER BY visitstarttime ASC) AS transaction_counter

FROM base
)

This is an example of the function in action:

Partition by transactionid to find duplicates with ROW_NUMBER

Now you will have a counter on each row of your data set, whereby the counter increases by 1 for each transactionid repeat.

Lastly, we want to filter out any transactionids with a transaction_counter greater than 1, meaning the transactionid has already appeared. You cannot filter on a window function field in the same query it was created in so we need to do a subsequent query. This will filter the results to show only the first time the transaction appears.

SELECT
transactioncounter.*

FROM transactioncounter

WHERE transaction_counter = 1 

Et voila! A nice, deduped data set with no double counting of transactions. 

Deduped transactionids with ROW_NUMBER = 1

This is just one example of what you might want to dedupe in your data set. You can use this technique to identify duplicates across any kind of field, just remember to PARTITION BY the field in question to restart the counter. 

Good luck!

 

Full code:

WITH base AS (
SELECT
h.transaction.transactionid
,fullvisitorid
,visitstarttime
,date

FROM `project_name.dataset_name.ga_sessions_202008*` , UNNEST(hits) AS h

WHERE h.transaction.transactionid IS NOT NULL

GROUP BY
h.transaction.transactionid
,fullvisitorid
,visitstarttime
,date
)

,transactioncounter AS (
SELECT 
base.*
,ROW_NUMBER() OVER (PARTITION BY transactionid ORDER BY visitstarttime ASC) AS transaction_counter

FROM base
)

SELECT
transactioncounter.*

FROM transactioncounter

WHERE transaction_counter = 1

Related Posts

  • Cool things you can do using window functions in BigQuery

    Article Date
    September 11, 2020
  • Identify Potential Session Breakage Using BigQuery

    Article Date
    May 18, 2020
  • How to re-attribute your conversions to the correct traffic source in BigQuery

    Article Date
    December 04, 2020
  • How To Calculate Table Level Percentages In BigQuery

    Article Date
    June 26, 2019
  • Use our BigQuery Statement Builder To Better Understand The Behaviours Of Your Customers

    Article Date
    February 22, 2020

Get in touch

To find out more, email us on hello@datarunsdeep.com.au or complete a quick contact form.

Get in touch

Talk to us: 1300 737 885

08:30 - 16:30, Monday to Friday. We love to chat

Visit us, we’re at:

Level 8, 313 Little Collins Street Melbourne, VIC 3000

Footer

  • Home
    • Who We Are
      • What We Do
      • What We're Up To
      • Careers
      • Contact
  • Training
    • Google Analytics Training Courses
    • Advanced Google Analytics Training
    • Google Tag Manager Training
  • Consulting
    • Analysis
    • Implementation
    • Strategy
    • Google Analytics Audit
    • Google Tag Manager Audit
  • Google Cloud Partner
  • Google Marketing Platform Sales Partner
  • Bitcoin accepted here
  • Registered Looker Consulting Partner Melbourne
  • Twitter
  • LinkedIn
  • Google Plus
  • Blog RSS feed
Copyright © 2009-2021 Data Runs Deep Pty Ltd. All Rights Reserved. Google Analytics is a trademark of Google Inc.
Website by Marin Sekesan