A lot of companies report duplicate transactions (and therefore increased revenue) in Google Analytics. From our experience with eCommerce websites, we have found that roughly 1% of revenue is double counted when there are no custom fixes in place. The most common cause is when a customer reloads the order confirmation page in a new session, potentially on a new day and even sometimes as a new user (when the cookie has expired).
There are technical ways you can prevent this from happening going forward (read our blog here) but if you need to correct historical reporting, you can use the following solution in BigQuery utilising a SQL numbering window function.
If you are new to window functions as a concept, read our introduction blog first.
Numbering window functions will label each row in your data set with a number based on the order of a column you specify. The three most common numbering functions differ solely on how they handle duplicate values.
Let’s visualise a sporting event where athletes Tim and Eva tie in first place and Leo finishes next.
The RANK function will rank Tim and Eva both in first place and Leo will come in third (as typically seen in sporting competitions). RANK will give duplicates the same number and then skip the next number or numbers (depending on how many duplicates). E.g. if it was a three way tie, there would be no 3rd place only 4th.
DENSE_RANK will also give Tim and Eva first place medals but this time Leo happily takes home a silver medal. DENSE_RANK will give duplicates the same value and will not skip the next number.
ROW_NUMBER does not double up at all and gives a unique number to each row. Unfortunately for Tim, that means he gets bumped to second place. BigQuery will decide randomly between Eva and Tim who gets first place and this could change each time you run the query. If there were other criteria you could rank the athletes by such as the results of previous races (using multiple fields in your ORDER BY clause) you might be able to make this a fairer distinction.
Below is an example of these functions where you are ranking customers based on totalTransactionRevenue.
RANK and DENSE_RANK are good if you want to return the top selling products or top searched terms where you want to include those that generate the same amount of sales/searches (top 10 products could be 11 because two have the same revenue). But for deduping datasets, you want a unique number on each row. Therefore, we are going to use ROW_NUMBER to clean up our duplicate transactions.
We are going to be user, session and date agnostic. If the user reloads the confirmation page on a different day or after their cookie has expired, we want to disregard the transaction. For this to work, you will need to query more than one day of ga_sessions data.
First let’s create a simple data set of all the transactionids in August.
WITH base AS ( SELECT h.transaction.transactionid ,fullvisitorid ,visitstarttime ,date FROM `project_name.dataset_name.ga_sessions_202008*` , UNNEST(hits) AS h WHERE h.transaction.transactionid IS NOT NULL GROUP BY h.transaction.transactionid ,fullvisitorid ,visitstarttime ,date )
This would be the subquery results:
Now in our next subquery, we will add our ROW_NUMBER counter.
We need to define our ‘partition’ for the window function. In the case of ROW_NUMBER, the partition essentially signals where the numbering will restart. We want to restart the counter for each new transactionid so we will partition by transactionid.
Next, we need to define the order of the rows in the partition. To label the rows in chronological order, we will order by visitstarttime in ascending order.
,transactioncounter AS ( SELECT base.* ,ROW_NUMBER() OVER (PARTITION BY transactionid ORDER BY visitstarttime ASC) AS transaction_counter FROM base )
This is an example of the function in action:
Now you will have a counter on each row of your data set, whereby the counter increases by 1 for each transactionid repeat.
Lastly, we want to filter out any transactionids with a transaction_counter greater than 1, meaning the transactionid has already appeared. You cannot filter on a window function field in the same query it was created in so we need to do a subsequent query. This will filter the results to show only the first time the transaction appears.
SELECT transactioncounter.* FROM transactioncounter WHERE transaction_counter = 1
Et voila! A nice, deduped data set with no double counting of transactions.
This is just one example of what you might want to dedupe in your data set. You can use this technique to identify duplicates across any kind of field, just remember to PARTITION BY the field in question to restart the counter.
WITH base AS ( SELECT h.transaction.transactionid ,fullvisitorid ,visitstarttime ,date FROM `project_name.dataset_name.ga_sessions_202008*` , UNNEST(hits) AS h WHERE h.transaction.transactionid IS NOT NULL GROUP BY h.transaction.transactionid ,fullvisitorid ,visitstarttime ,date ) ,transactioncounter AS ( SELECT base.* ,ROW_NUMBER() OVER (PARTITION BY transactionid ORDER BY visitstarttime ASC) AS transaction_counter FROM base ) SELECT transactioncounter.* FROM transactioncounter WHERE transaction_counter = 1