Happy Valentine’s Day 2021! Let’s make some love connections.
So you want to do your data science-y stuff using R. Fine. But gee whiz, the memory limitations on your computer / laptop / RStudio server really stink. You’ve got terabytes, maybe hundreds of terabytes (or petabytes?) of data, and you’re stuck downsampling to a few measly gigabytes or hundreds of megabytes because that’s all the data your system can process before it bursts into flames. What you really need is a way to munge through all of your data in a super-powerful system that can create relevant summaries or produce aggregate insights, and *THEN* download that processed data to your local system and use all of that R coolness to generate your patented, stylistic time series visualizations or whatever. Preferably, you don’t want to pay through the nose for this capability either — in fact, if you could avoid having to do anything other than load your data and run your queries…
Enter Google BigQuery. BigQuery is a serverless, global, petabyte-scale data warehouse platform that also happens to be ridiculously easy to use and laughably inexpensive. Serverless means you are not configuring and deploying servers, so you’re not having to make decisions like how many machines, how many CPUs, how much RAM, or any of that. You literally just log into a SQL interface, point to data, and run a SQL query. That data can be KB, MB, GB, TB, or PB (and heck, you have an EB? Let’s chat…) Since you’re not paying for server capacity, the way Google makes money on it is by charging you for data stored (at around 2 cents per GB depending on where you put it) and for the amount of data processed during a query to the tune of about $5 per TB.
Say, for example, you have a dataset that has 200 GB of data — millions and millions of rows — and you decide to run a SELECT * FROM on that big bad. Oops. Your total charge for that massive blunder? Roughly a dollar. When I say laughably inexpensive, I really do mean laughably inexpensive.
Most of the queries I run are on tables with between 50 MB and 500 MB of data, meaning even if I’m sloppy and processing the entire table every time I use it, and I’m never able to use a cached query (which are free), never follow best practices, partition data, etc. etc. etc. my charge per query is going to run me somewhere between…
(…math in my head: 1 TB = 1,000 GB = 1,000,000 MB, so $5 per 1,000,000 MB = $0.50 per 100,000 MB = $0.05 per 10,000 MB = $0.01 per 2,000 MB, so <fade back to reality>…)
… well, let’s just say on average I have to run somewhere between 4 and 40 queries to generate a full penny’s worth of data processing charges. When you add that to the roughly 2 cents per GB I’m paying every month to store it, I’m pretty sure Google loses money on the credit card processing fees if they decide to bill me for my typical usage. And then to complicate things, they give you the first TB of query data processing and 10 GB of data storage for free each month. It’s like they’re not even trying to make money off of most of us.
(That’s not to say you can’t spend real money on queries. For example, if that oops were done on a data set that was 200 PB instead of 200 GB, then you’re looking at a million-dollar query and should have been buying dedicated capacity instead of using the on-demand pricing. Do be careful when playing with really, really, really large data! Of course, that’s what we in the business refer to as a quality problem to have…)
It’s a global platform, so you can use it with data all over the world on GCP with low latency, and even if that data is located on another cloud provider. But the real reason you want to use it is because you can do a whole lot of number crunching and data munging for next to nothing cost-wise on an enterprise-grade cloud-provided service at scales you’d never dream of doing on any laptop or even most RStudio servers. It allows you to expand the power of your R programming to full cloud power without costing an arm and a leg in the process. The trick, then, is making this love connection happen.
Which I have handily illustrated with an R script for you here.
There’s some setup you need to do — create a GCP account and project, enable BigQuery and Cloud Storage APIs, and you’ll note that I started by copying a publicly available dataset (which there are hundreds of just waiting to be explored — I’ll probably write a whole blog post just on that at some point) to my own project, but then in the final code switched to just using the public dataset directly. Why pay to store it if you don’t need to?
Odds are if you’re new to GCP there’s a lot of stuff in that script you don’t understand — maybe things like service accounts, keys, scopes, and so on are new to you. Just leave me a comment about the thing that you get hung up on when trying to do this on your own and, if enough folks have the same issue, I’ll write another post that explains how to do that particular thing or set of things. If you’ll be explicit and brave in asking your questions, I’ll try to be responsive and helpful in return. Deal?
So there you go. The first, tiny, itty-bitty baby step towards extending the power of your R code into the cloud with Google BigQuery. You’re on the precipice of doing some really amazing stuff, and I can’t wait to see what you end up doing with it!