Get Real. On Second Thought, Don’t.

Jason Eden
4 min readNov 18, 2021


The Promise and Challenge of Synthetic Health Data

It is hard to overstate the potential for machine learning / data science to dramatically impact healthcare outcomes. The potential value in terms of better treatments, lives saved, higher quality of life, less expensive care, reduced load on clinical staff, and other areas is vast in terms of social and financial returns on investment. And yet, healthcare remains a laggard behind other industries such as finance and retail. There are several reasons for this, including the need for special care because the cost of being wrong could equal lives lost in addition to financial outcomes. In addition, a tremendous blocker to advancement of AI/ML in the healthcare industry is heavy regulation around PHI (Protected Health Information) and the severe penalties that accompany missteps.

One of the approaches taken to try to unlock some nascent potential has been to attempt to de-identify data sets before feeding them into machine learning models. This poses several challenges and risks, including the need to keep records for the same patient associated in some way, sometimes across disparate data sets, which leads to a potential for re-identification and violation of privacy (again, with associated penalties should the breach be discovered.) The need to be 100% sure about these risks being mitigated often leads to data that isn’t really useful for predictive modeling purposes because of the level of redaction required. As a result, the industry is by and large still trying to figure out a way to make this work.

One approach that has been proposed is to use synthetic data rather than real data to produce machine learning models. This data is based entirely on fictional individuals, and so there is no PHI involved. Indeed, one open source project ( — was a significant step in the direction of making this idea real. Synthea generates synthetic patient data based on public data sets — things like Census data and information collected from various government agencies regarding disease prevalence and progression statistics — and generates synthetic data that can statistically match the entire population of the United States, or just a particular state, or zip code, or even town.

When I first started exploring, this was an exciting prospect. I explored how you can modify the tool’s data and templates to generate a data set that is statistically similar to an existing population / cohort of actual patients by customizing demographic information, disease pathways, and so on. From a standpoint of “what might we be able to do with ML” and “what kind of results might we expect,” this capability might be valuable. The real-world value beyond POC, however, is where you start to run into the devil in the details.

When you start by generating data the follows a set of rules to get to a predefined outcome, what value does building a predictive model actually provide? Let’s take an extreme hypothetical example to prove the point. Say your data is made up of just two patients. Patient A is perfectly healthy and only comes in for annual check-ups, etc. Patient B has numerous health issues of varying severity — let’s call them disease 1, 2, 3, 4, and 5. We create a rule to represent this cohort from a statistically similar perspective, so we end up with two synthetic patients. SynthPatient A is generated with diseases 1, 3, and 4, and SynthPatient B is generated with diseases 2, 3, and 5. Both patients are tagged with appropriate outcomes given their relative diseases.

If you look at core statistics, the two data sets are very close. Two patients, five diseases, and with some fluctuation because disease 3 is present in both instead of just one. The question, then: Can you draw any real world conclusions from a predictive model based on the SynthPatient data that will apply to the real patients in our example?

Nope. Not even close.

In order to make predictions meaningful you can’t just have statistical similarity between the synthetic and real data at a macro level. You need to be able to generate synthetic versions of patients that actually match the real-world combinations and progression of diseases on a person-by-person basis, since this is the level of detail at which patterns (and thus prediction) can become meaningful.

Figuring out how to do this correctly and in a way that meets regulatory requirements is a hard thing to think through — certainly a lot harder than I thought it was going to be when I first started exploring. While Synthea is a good start and has some value in terms of Proof of Concepts and exploring machine learning approaches, it is not currently a tool that is capable of providing meaningful outputs for predictive analytics for real-world patients.



Jason Eden

Data Science & Cloud nerd with a passion for making complex topics easier to understand. All writings and associated errors are my own doing, not work-related.