Simple, Powerful ML at Massive Scale

Jason Eden
4 min readFeb 16, 2021

This is going to spruce up that resume quite nicely.

One of the cool things about having connected your R and/or Python environments to BigQuery is that you no longer simply have access to vast quantities of public data and the ability to process them at massive scale for pennies per GB, but you also have inherited the ability to do Machine Learning right from the same interface.

What? You mean you can use a petabyte-scale data warehouse platform and build Machine Learning models right on top of it, using SQL? It’s like… it’s like…

Some things were just meant to be together.

Enter BigQuery ML.

With BigQuery ML you can leverage vast amounts of data and build both simple and complex machine learning models — linear regression, logistic regression, deep neural networks, time series, and more. I’m assuming I will run across a number of these as part of my Master’s in Health Data Science program, but the one I want to highlight at this point is the ability to interact with AutoML Tables.

With AutoML Tables in BigQuery ML, you can generate amazingly powerful ML models without knowing much about what you’re doing. Not to oversimplify things, but feature engineering, splitting data into training, test, and validate, and figuring out which machine learning algorithm is best suited to solve your problem? You’re behind the times, like…

…like you’re living in an Amish paradise. (Ugh, sorry. Even I felt that one.)

OK, so Weird Al Yankovic aside, AutoML really is bleeding edge cool. It’s been explained to me as artificial intelligence that uses artificial intelligence to do artificial intelligence, and extending it into BigQuery opened up entire universes for those of you who have just gotten as far as I have in my R and Python journey. To use it, you craft a SQL statement that includes the data you want to analyze with a basic SELECT statement. Say you’ve copied the NYT COVID cases and deaths data by county into your own project and dataset. A basic query might look something like this:

Pretty basic SQL statement on a publicly available dataset

Then, you simply prepend this SQL statement with the information AutoML needs to know — what do you want to call your shiny new model? Of the columns you’re pulling in, which one of them are you trying to predict on? Is this a classification or regression? And how long do you want to let AutoML keep trying to optimize the result? The final code looks something like this:

This is literally all you need to do to generate a machine learning model. Pretty slick…

This is super easy, so I guess all of those in-demand data scientists should probably go learn how to pump gas or something useful now, right? Weeeeeellllll….. It’s not quite that simple. It turns out things like knowing what kinds of data to feed into your model can have a pretty dramatic impact on the overall accuracy. For example, I was really surprised at how good the results were just feeding AutoML the basic data on location and number of cases over time, however when I joined this table with another table that contained survey data, by county, about mask usage, AutoML was able to generate significantly better results and in less time too. I suspect the more I learn about how to build out models, the better I will get at making sure the data I feed AutoML will generate ever-improving outcomes. That said, the fact that I — as a complete newbie here — was able to in a few lines of SQL code generate anything resembling a predictive ML model in the first place is nothing short of amazing if you ask me. And if you’re reading this, you have implicitly asked me. If you don’t like it, you have only yourself to blame.

A note about costs: AutoML consumes a lot of resources. You are basically getting access to a cluster of more than 90 servers in order to chunk through as many possible model permutations in the time you have allotted, and the cost for this at the time of this writing was just over $19 per hour (let’s round it up to $20 to keep the forthcoming math simpler.) You can generate a model in as little as one hour (in my experience, not a great model, but a model nonetheless), or you can set it to run for up to 72 hours. If AutoML completes all possible tests before that time is up, you won’t get charged for all 72 hours, but if you have a complex and large enough data set, you could get billed the full $1,440. If you run multiple iterations of models to compare performance and aren’t careful, you can pretty quickly rack up a few extra tens-of-thousands of dollars on your GCP bill for the month. You might want to consider A) knowing what you’re doing before you go playing with the massive cluster of hardware, and B) doing some low-time tests on different data sets and seeing which combinations perform better relative to each other, then choosing the better combinations to run for longer time periods. Again, compared to what it would cost you to replicate this in a non-cloud environment, the money we’re talking about here is pretty insignificant. But if you’re watching your pennies, don’t get carried away. Just because you **can** do something doesn’t mean you **should** do it.

With great power… comes potential for greatly inflated cloud bills.

So there you have it — simple ML models generated from SQL code and publicly available data, without requiring you to understand much at all about machine learning in the first place. Welcome to the future folks!

I, for one, welcome our new AI robot overlords!

--

--

Jason Eden

Data Science & Cloud nerd with a passion for making complex topics easier to understand. All writings and associated errors are my own doing, not work-related.