Data Science Layer Cake

Defining “Real” Data Science and AI/ML Expertise

What is a data scientist? I have been reading for years about the difficulty in defining exactly what qualifies one to become one. The broad strokes are you have to have appropriate degrees of statistical knowledge, coding skills, and domain expertise. Quantifying those areas in any meaningful way (i.e. a definition that can be generally agreed to) in the current data science landscape turns out to be next to impossible.

What I think of when I read a “real” data science post.

In some cases, a data scientist needs to have a Ph.D. in math or statistics but have only a few years of industry experience, and maybe couldn’t code their way out of a wet paper bag. In other cases, a data scientist needs to have a M.S. in Computer Science (or equivalent), generally understand intermediate math and statistics concepts, and can be relatively new to a given industry. Or yet another permutation might be a person with deep industry experience who can grasp (but maybe just barely) statistical concepts and write basic code/scripts. All three of those profiles match what I have seen in the wild in terms of real-world data scientists, and the possible permutations don’t stop there.

And that’s OK! The religious debate about what makes a “real” data scientist is meaningless, normally egged on by folks who fit a certain profile who then demand that everyone who shares their job title meets their standards in their area of specific expertise.

Meh.

I personally could not care less about whether or not you look at any of the three profiles I described and consider them “true” data scientists or not. What matters is context — what is the job they need to do, and does their specific skill set align? Furthermore, in most cases since you can’t find a single person that hits the pinnacle of all three types of expertise (and if you do, you probably can’t afford them), you should probably build out data science teams that include multiple permutations in order to round out the skill set. Building too heavily into one type is likely going to leave you with unsatisfactory results.

This doesn’t get any more clear when we take a look at what “doing” data science actually means in the real world. It turns out, there are at least three different domain areas of AI/ML expertise here as well.

The first, which I have been able to go relatively deep in through the Cornell Machine Learning Certificate program, is the development of algorithms. Someone has to create these concepts like SVM, Deep Neural Networks, Decision Trees, and so forth. Usually those folks are going to come from a heavy math and statistics background, and will have built out an entire approach in theory perhaps prior to writing a single line of code to implement it.

The second, which I am going deep in through my M.S. in Health Data Science program, is the practical application of algorithms — using tools like like scikit-learn or PyCaret in Python or various packages and tools in R. In this domain, a person learns to use machine learning algorithms and tools to solve practical problems. They may or may not completely understand the theory behind the algorithms, but they can easily mitigate this in most cases by simply trying different ones and finding the one that produces the best results.

The third is the in-depth application of AI/ML in production environments. One might take something that had been prototyped using R or Python and convert it into something more scalable — say Spark ML or MLlib, or in more modern approaches, using a plethora of powerful Google Cloud-based tools like BigQuery ML, Cloud AutoML, or Tensorflow on TPUs. This skill set requires an understanding of how to build an ML pipeline from end to end, maintain and update ML models, and interact with APIs and distributed systems, plus a solid understanding of the goals of the model and how to evaluate effectiveness.

In my observation, data scientists have varying levels of expertise across the three AI/ML domains. And again, that’s OK! The important thing isn’t that you master every possible tool (theoretical, R&D, or practical application) but rather that you have enough knowledge in each area to perform the task that is currently at hand. Again, individuals with deep expertise across all three are going to be hard (and expensive) to find, so building out a team-skilled approach where you are balancing the skill sets is usually going to be your best bet.

</rant>

Data Science, Big Data, & Cloud nerd with a focus on healthcare & a passion for making complex topics easier to understand. All thoughts are mine & mine alone.