Understanding Results vs. Predicting the Future
I recently completed a summer course as part of my Master’s in Health Data Science program that focused on Inferential Modeling. It was a really informative course that opened my eyes to a whole side of data science that I had never really been exposed to in my previous work. Since I assume there are readers in the same boat, I will try to explain it as I understand it today, where it fits, and some of the cool capabilities it enables.
In predictive analytics / machine learning, another term for “prediction” is “inference.” Thus, when I registered for the class, I thought I was signing up for a class on predictive modeling. While it’s true that there’s a significant overlap between the two areas, the focus of the two is quite different. Briefly, inferential modeling is a data science approach that seeks to infer properties about the relationships between a predicted / response variable and input variables, rather than simply make the most accurate prediction possible.
For me, this clicked best when I walked through an example. In inferential analysis there are some constraints in place that don’t exist for predictive analysis. For example, in predictive analysis, understanding why a model achieves a certain accuracy is important, but really getting to the most accurate answer possible is the goal (assuming the reasons you got there are sound…) This allows us to use the entire universe of machine learning algorithms (including decision trees and deep learning), regardless of how well we understand how they get to their answers. We don’t have to be able to tightly explain the relationships between each predictor variable and how it affects the overall prediction. We just need to know that our basket of variables gives us what we need to be as accurate as possible. The downside, of course, is our ability to then attribute the relative importance of one variable vs. another, which means if you are trying to do something like decide where to target investment dollars for the biggest return, that ability might be limited depending on the modeling approach taken.
Inferential modeling, then, takes a look at a data set and generates a predictive model, but the model itself is somewhat secondary to the information it provides about the impact of the individual variables on the values in the data. Every model that we built in-class was some sort of regression model — where outputs included some coefficient value on a variable by variable basis that indicated the overall impact of a unit of change on the overall predicted value, as well as some value of overall “fit” of the model with that set of variables as predictors. In addition, because of the potential interaction certain variables might have with each other, detecting and mitigating any negative impact of those interactions was a significant focus.
Say you have a data set that contains 1,000 variables for a given predictor variable. For predictive modeling purposes, assuming you had the compute power available, you might decide to feed all 1,000 of them into a deep learning engine and see how good the predictions might get. Let’s assume that the predictive power of this final model was 93%. Now, take that same data set and apply inferential analysis. For starters, 1,000 variables is probably way too many for humans to evaluate and prioritize, so you might start using LASSO regression to specifically eliminate variable whose predictive power was minimal in comparison to others. Once that number had been reduced to a reasonable number — let’s say, the top 30 variables, you then run them through a regression algorithm — not deep learning — and running some experiments to see if your variables met all of the modeling assumptions required. After this analysis, you decide to drop six of the variables from the model and then compare the fit of the overall model vs. the 30-variable original. The 30-variable model is about 83% accurate, and the 24-variable model is 81%, however it is a statistically significant better fit, so we choose to keep the 24-variable version to use for further analysis.
Don’t get me wrong — a 12% drop (and even 2%, in the case of the two regression models) in predictive performance, if prediction is your primary goal, is absolutely material, so you’d never select the 24-variable inferential model as a model to deploy for predictive inference in this situation. The deep learning model wins that award hands-down. However, if you were trying to make an investment decision, the deep learning model probably wouldn’t be that helpful. There’s potentially an awful lot of noise in those 1,000 variables (given that you only needed 24 of them to get 87% of the predictive capability), so choosing the top three of them to focus on for best ROI would be extremely difficult using the first approach, whereas in the second you would have a relatively straightforward way to inform such a decision.
There’s a lot more to inferential modeling that I found fascinating — for example, using observational studies and refactoring them to mimic double-blind research. And there’s also an ever-increasing (and important!) push towards “Explainable AI” to help take some of the mystery out of those deep learning models to help explain why they make the predictions they make and what they value in the process. But at a high level, this is the way I think about the two fields of data science, at least at the moment. Based on past analysis, I can infer with great confidence that I will be smarter about this in the future.