Generating Useful Synthetic Patient Data for Machine Learning

A Theoretical Approach Moving from Impossible to Plausible


I have previously written about the need for and potential value of synthetic patient data for building machine learning models in healthcare. My exploration into ways to generate it from a common data format, such as FHIR, has taken a few twists and turns, and this will likely continue. As part of this exploration, I have decided to build and test theories by first generating synthetic patient records using Synthea and treating these as my base-level “real” patient data. The approaches described below are in reference to creating synthetic patient records based off of what started as synthetic data in the first place. My presumption is that the approaches would perform similarly on actual patient data in comparison, however this would need to be validated. With those caveats in mind, let’s begin!


The broad strokes of my data-creation theory are three-fold: One, generate small cohorts of n similar patients using real world (or in my case, “real world”) data, gather relevant statistics about those cohorts, and then generate an equal number of synthetic patient records that statistically match the cohort. Assuming this worked, one should be able to create a machine learning model using either real or synthetic data and produce very similar results.

Initially I had theorized that converting FHIR data into a graph database format and then performing something akin to k-means clustering would produce cohorts of patients that shared statistically similar characteristics — comorbidities being the big target, but disease progression, age at onset vs. present, and other characteristics would likely be valuable for evaluation. The n values per cohort would be TBD, but likely to be in the dozens or low hundreds for complex records, depending on the initial number of patient records and the allowable variance between patients that still generates predictive accuracy. Low-information patient cohorts may number in the thousands/tens of thousands (or more), whereas complex patient records might group in the dozens even when starting with millions of records. However, as I dove deeper into FHIR formatted data generated by Synthea, it quickly became clear that the curse of dimensionality was going to drown any attempts at this kind of clustering. There are simply too many possible variables and numbers of records / events on a patient by patient basis to make this approach viable. The graph database conversion might still prove a useful step down the road, but it became clear that some kind of rationalization of possible features, and related transformations into usable forms/formats was going to be required before the clustering attempts.

In addition, upon further research I decided to swap out k-means clustering with a more probabilistic, faster approach using minhash and locality-sensitive hashing to perform the patient groupings. For high-dimensional spaces (which, even with rationalization, this is still going to be an issue), this turns out to be a preferred approach due to reasonably high accuracy along with dramatically improved performance.

Rationalize the Scope of Synthetic Data

As of right now, this is a purely theoretical exercise. Step one is going to be creating the similar patient cohorts, based on some definition of similar that ends up being relevant for predictive modeling. This definition of similarity is going to be one of the major points of exploration. The things we care about will change based on the type of question we are being asked to predict, and it may not be obvious from the start which view of a patient record will produce the best results with the least amount of overhead.

Furthermore, the scope of the data problem requires us to further refine the approach. FHIR data (and EMR/EHR records in general) are extraordinarily interconnected, complex objects that have more in common with JSON and noSQL data stores than they do fully structured data. Updating cohorts on a near real-time basis is computationally expensive — if it’s even possible — and perhaps more importantly, probably unnecessary. Rather than try to keep a running record of virtual patients and perform all the back-end work on an ongoing basis to keep it up to date, it will be easier and possibly more effective to recreate synthetic records that contain the specific data needed for a given question on an as-needed basis. Each model creation exercise, then, would begin with a distinct virtual data set where the information present was based on the results of something akin to a probabilistic ETL process. The features to be collected and synthesized would be part of the overall exploration in the process.

Defining Similarity

The features that matter from a similarity standpoint may differ significantly based on the type of question being asked. For example, if you want to predict whether a high-risk person is likely to be hospitalized or need to visit the emergency room in the next 30 days, their health history almost certainly matters, so having complete records of procedure and diagnosis codes (for example) are going to be relevant. Part of the exploration of the right patient view for prediction with the lowest number of overhead will require experimentation. It may be that not all procedure and diagnosis codes are created equal, so the number of times a patient has been seen for a well visit or general check-up (physical, etc.), may contain less signal than other variables. The number of bills they have paid on time in the past two years may or may not be indicative of propensity to use the emergency room as a primary care substitute, and so on.

Furthermore, the number of times a specific diagnosis has been given or procedure performed may be less important, from a predictive standpoint, than the fact that the diagnosis or procedure code exists at all in the patient record. Or, it may be that only repeated codes matter, but only if they have been given in the last 14 days. Depending on the hypothesis being approached, one could envision asking a single question and generating multiple similarity definitions containing the potential ways in which we think patients might need to be grouped. These varying data sets would then become the starting point for predictive modeling — rather than the modeling approach itself — to determine which view of the patients was most accurate from a starting point of prediction, before launching into more complex, computationally expensive approaches to build the more advanced production-level models.

Initial Hypothesis Testing

In our theoretical example, we might start with an examination of just procedure and diagnosis codes, and take two different views on them. We might initially experiment with a Bag of Words approach borrowed from NLP (word2vec, etc.) and a one-hot encoding approach, both of which would produce vectors that could be used to compute Jaccard Distance between patients.

In the Bag of Words approach, the FHIR data values (the data behind the colon in key/value pairs) would be broken down into single-”word” subsets, with the number of times a word appears being tracked in a vector. Only the subset of data that contained unique information would be included. The theory here is that patient similarity in both coded procedures and diagnoses, and the frequency with which these codes and relevant data would appear, would naturally group patients according to comorbidities and their relative strength / influence on overall patient health outcomes and be predictive in terms of near-term hospitalization risk.

In the one-hot encoding approach, we would ignore the number of times a relevant code appeared and simply check for it ever being included in the patient record. The idea here is that the frequency of codes matters less than their actual existence, and if we can build similar cohorts using this approach as compared to the Bag of Words approach, then the computational requirements are significantly reduced and thus this would be a preferred approach. This is a case where we are intentionally generating a lower-fidelity data set to see how much signal is being lost.

In both cases, the data could be enriched in a third/fourth approach by something such as “days since last recorded” for each code, with the Bag of Words value potentially being some integer, or perhaps using a one-hot-encoding approach for both where you have categories of days (<5, 5–30, >30 for example…) Numerous additional enrichment properties exist in FHIR data, but the idea would be to start from a minimalist perspective, enrich, and compare results, perhaps through many iterations.

Synthetic Data Generation

One of the benefits of this approach is that this rationalized reduction in the patient record also simplifies the process of generating the synthetic data. Rather than recreate an entire patient record, the synthetic equivalents simply need to replicate in a statistically accurate way their aggregate cohort in the rationalized data sets. This moves the problem of creating synthetic patient data for machine learning purposes from the nearly intractable to doable, while acknowledging the loss of data fidelity these approaches would necessarily require.

For example, if we take the semi-structured FHIR data output to JSON and parse its contents into a vector that simply indicate the existence of every code that exists in the overall cohort, you have a d-dimensional representation of every patient from that point of view that can now be sorted and grouped using minhash and locality-sensitive hashing. If you take the Bag of Words approach, the results are similar, except you have a lower number of variables in exchange for more informationally dense values. Once we let the clustering algorithm do its thing, we have x number of groups that contain n patient cohorts. Again, you can imagine numbers ranging from 20–30 for relatively unique diagnosis groupings to something into the hundreds or thousands+ for low-information patients — again, depending on the size and scale of the original patient data you are basing the synthesized records off of.

We then take each patient sub-group and gather appropriate statistics for each variable: n, min, max, mean, median, mode, etc. Then we write an algorithm that randomly generates n records that statistically align to the sub-group across all variables. Like any untested process, this seems relatively straightforward. :) But time will tell.

Practical Application and Validation

The first validation point would probably be done using relatively simple or quick to train modeling approaches like MLR, GLM, and GBM. Once the synthetic patient records were created from a certain patient perspective (one-hot vs. BoW, for example), we would train a simple predictive model and determine its effectiveness on test data, which would necessarily be based on real patient records, to determine overall accuracy. We would repeat this approach for every patient record view we were hypothesizing, from simplest to most complex, until we found the model — based **completely** on synthetic records — that did the best job at predicting real-world patient outcomes on our test data. We would also want to test this against the real patient data vectors we used to generate our synthetic data to determine the relative performance hit from a predictive standpoint.

Once the best data format had been determined, we would then go through and implement more complex AI/ML modeling approaches and practices (grid search, deep learning, etc.) to fine-tune the model intended for production. Once again, we would test this more robust model on actual patient data to determine improvement, if any, and upon finding the most effective approach and parameters, deploy it into production.


There’s a sense of “moving the goalposts” in terms of the goal (experimenting with subsets of data vs. creating full synthetic patient records), but I believe it’s inevitable if the final desired output is predictive modeling with a repeatable process that can deal with the realities of the complexity of patient medical records while still potentially producing usable results. The useful view of the patient data as a starting point makes the problem more extensive and exploratory in the early stages, but vastly simplifies the creation and ongoing upkeep / management of potentially useful synthetic patient data. In fact, it may be the only practical way to approach the problem given the limitations of modern technology.

As with all machine learning in the healthcare space, and particularly when the intended use would be diagnostic in nature, significant regulatory hurdles would have to be overcome in order to deploy models into production. That said, if this approach ended up being able to improve patient outcomes with models built using synthetic data, one could imagine those hurdles being potentially lower from the “how it’s built” perspective, while more scrutiny appropriately placed on how and why the approach works in the first place.

Special thanks to Vivian Neilley for thorough review and suggested areas for improvement.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store