AutoML and Cloud AI/ML APIs
Was I Wrong About the Death of Data Science?
Well, that didn’t take long.
In response to my previous post, two resounding replies immediately returned: AutoML and Cloud AI/ML APIs. While I thought I had addressed these indirectly, apparently I could have been more clear as to these specific topics and why I’m not even remotely concerned. So here you go.
…is Just Another Automation Tool
When I spoke about automation in my previous post, AutoML was one of the tools I specifically had in mind. If you’re not familiar, AutoML can be summarized as “AI that builds AI models.” You can feed AutoML a data set, ask it to build the best predictive model possible, and then come back later to see what the best model is.
You can see why this causes some people some angst, especially if their entire concept of data science was “building a predictive model on a data set.”
But stop, just a second, and think about this. If you viewed data science as just building the predictive models and finding the best one given the data you’re working with and its current state, then you were **not** thinking about the science / experimentation / reality affecting part of the process. And if you think that AutoML makes it unnecessary to understand how the algorithms that AutoML is using to build a prediction work, then by the same token sufficient knowledge of chemistry and molecular structures / interactions is not necessary when performing experiments using, say, a mass spectrometer to analyze a chemical compound. Which, of course, is a ridiculous notion. And yet, it persists.
As the name implies, AutoML is automatically applying a bunch of machine learning techniques on a data set. Things that a good data scientest used to have to do by hand are now scripted and intelligently evaluated. But like I mentioned in the previous blog post, coming up with the “best” model given the data provided in the form it was provided may or may not actually mean anything in the real world. All of the results that AutoML spits out still have to be vetted, tested, and most likely will spur retooling of the data and running it through the engine again rather than being a “one and done” scenario. Furthermore understanding what model worked best and being able to develop and test a hypothesis as to why this was true is absolutely critical. If you can’t do that, you’re not a data scientist. You’re the data science equivalent of a script kiddie.
… Doesn’t Scale Well
AutoML is a really great tool on relatively smaller-variable (narrow) data sets. However, as the number of features grows, the random hack and slash approach that AutoML by necessity takes becomes unworkable. Think about it — every algorithm, every feature combination, abstraction, and every low-level data science task that has to be performed requires an exponential increase in compute power with every new feature. Up to a certain point, this isn’t a problem, but beyond that point, AutoML is unable to converge on the “best” solution in anything resembling a reasonable time period at a reasonable cost. So for more complex, wider data sets, AutoML just isn’t going to work.
AutoML for the smaller, simpler problems is an amazing automation tool that can dramatically speed up the process of building predictive models. But if you define data science as only the scenarios in which AutoML is a good fit, you are either woefully overestimating AutoML’s capabilities or underestimating the complexity of real-world challenges that data could potentially address in the next century.
Cloud AI/ML APIs
The other strong response revolved around all of the big data science problems already being solved — not from a math standpoint, but from a real-world model standpoint, with publicly available Cloud APIs used as the proof. If you’re unfamiliar, major cloud vendors like Google, Amazon, and Microsoft all provide the ability to simply upload a generic image, text, video, sound file, and so forth and produce highly tuned and accurate results in terms of identification, NLP, translation, and so forth. You as a data scientist working in a company of nearly any size will never have access to the amount of data and compute power these companies have in order to generate these models, so you’re significantly better off simply using what they have and not even trying on your own.
I grant you all of that. You’re still being myopic if you think this is anything but fantastic for the field and career opportunities within it.
Science in just about any other category is not defined as “resolving the same problems over and over again.” How dumb would that be? We’d still be celebrating the yet-again invention of porcelain nearly 300 years later if that were the case. And yet, people claim that because some types of data science problems have been mostly solved, that negates the opportunity for data science careers. Balderdash.
What the existence of Cloud APIs means is that you don’t have to tackle the generic part of a data science problem. You don’t have to build an algorithm that can identify a dog vs. a tree vs. a bicycle vs. a person. But these public APIs absolutely, unequivocally, and in no uncertain terms do not encompass all possible image / video / NLP / translation / etc. possibilities. In fact, by their very nature, they have to be “general purpose” and as such lack access to any non-public data, and are generally not deeply trained on domain-specific data.
Any guesses as to what percentage of the world’s data is non-public and domain specific?
The fear of Cloud APIs in terms of data science careers reflects either a drastic misunderstanding about the nature of how they’re built and how they work or an abject laziness on the part of a “data scientist” (and the air quotes are intended to convey I’m using the term loosely) that simply wants to get paid for doing the same kind of thing over and over again. For those who fall into the former category — Cloud APIs take away the generic gruntwork of <insert AI practice here> but you are still going to be very, very employable if you can perform science on data that was not part of their corpus, and for your employer, leverage the generic capabilities while extending them based on your employer’s private, domain-specific data.
Again, folks, as long as you’re looking at data science as a scientific pursuit and not just a fancy script writer, neither of these should be cause for concern, and indeed, both of them should excite you because of the size and scope of the challenges they allow you to pursue as you push the field further.
Gauntlet thrown again. Next!