There are occasions where you would like to randomly select rows from your BigQuery tables; but at the same time like to ensure you keep getting the same set of rows every time you work with the dataset. For example, you might be looking to undersample a BigQuery table to handle a class imbalance problem for your machine learning model. You want to feed your model with a random sample of records. As you troubleshoot your model, you want to ensure you keep feeding it the same sample of records.
The solution is to use FARM_FINGERPRINT() to provide a “random” sample of rows from our dataset. We say “random” in inverted commas because each time you run the function, you should get the same set of results (i.e. not random because the results are repeatable). Mind blown!
In this scenario, we are looking to randomly select around 20,000 complete rows from a table with 10 million rows. Making sure the selection is well-distributed is very important, as we do not want to compromise our machine learning model through systematically biased sampling.
If you have a unique identifier field then the only thing you need to do is append the below WHERE predicate to your SELECT * query. For this example, our unique identifier for each row is the session_id column.
If you don’t have a ‘session_id’ field, you can concatenate a few other fields to make a unique identifier to feed into the FARM_FINGERPRINT() function.
For example, if you had fields for name, address and telephone, you can concatenate them and append it to the existing table using the below query.
Otherwise, you can use each record’s row number as a unique identifier - the Farm hash functions should give uniformly distributed output despite being given sequential input. As with any good hashing function, the FARM_FINGERPRINT aims to do two things.
First, for any input it should return output that has no direct correlation to the input or relative correlation to other inputs. For example, if I hash the string ‘A’ the result is ‘-1531450017087491052’. If I hash the string ‘B’ the result is ‘8703560047562643528’. Both of these hash results have no obvious relationship to either their inputs or to each other, despite being sequential values.
Secondly, the hash function consistently returns the same output for a particular input. In other words, every time I hash the string ‘B’ the result will always be ‘8703560047562643528’.
The FARM_FINGERPRINT function returns an integer result of a hashing function, which can be either positive or negative.
The ABS (or absolute) function is there to flip around any negative results returned by the farm hash into positive integers.
The MOD (or modulo) function takes an argument to determine how many equal portions you will end up with.
500 is chosen because there are 500 lots of 20,000 rows in our table of 10 million rows. We want to break the 10 million rows into 500 equally sized lots, of which we will pick one lot of 20,000 random rows. The way we pick that lot is to choose a random number between 1 and 500 - in this case 313 (our street address, a sufficiently random number) - and filter out every row that is not in this ‘lot number’.
The result is 20,000 randomly chosen rows out of the original 10 million.
There are many ways to undersample a dataset, but we have found this to be a rather simple and clean method. Best of all, we keep getting the same set of rows every time we run the above solution, making it easier to troubleshoot and iterate our models.