Or… Why Good Data Engineers Will Always Have Work
Prior to publishing the mid-term exam, my professor in my Programming for Health Data Scientists class gave us all an option to do some extra work on a pre-midterm assignment. The payoff, assuming you did well enough, would be to avoid one of the problems on the mid-term the following week. The challenge assignment: take two files — one small-ish JSON file and a much more robust XML file pulled from two hospitals — and somehow combine their data in such a way that you could do something useful with it. For example, given a particular procedure code, view a range of charges by hospital.
So I took the bait. The files are real data files, the result of the Hospital Price Transparency rule that went into effect January 1, 2021. Hospitals are required to publish cost data around the procedures they offer. For consumers, the data has to be available in a consumer-friendly format. For data folks, the only requirement is that the complete data be available in a “machine readable” file. There are no structure or format requirements, so how each hospital chooses to publish the data file is pretty much discretionary. And as you might suspect, some hospital systems are less enthusiastic than others about the new rule, meaning the amount of work required to harmonize and merge the data varies pretty widely as well. As I would soon find out, first-hand.
The whiteriver.json file contained data on about 100 procedure codes, and it turned out the data formatting was consistent throughout the file.
The saline.xml file was a different story. It contained more than 1,500 procedure codes, was broken into three different sections, and one of those sections was structured differently than the other two. (Ask me how I learned that one…)
I noticed fairly quickly that the way the hospitals reported costs by insurance company (which was part of the original ask for the assignment…) was vastly different. The whiteriver file broke insurance companies down into subsidiaries with separate charges listed for each, whereas saline just contained a single price per company in most cases. My first pivot was to ask my professor if, instead of showing costs per hospital per insurance company, I could show minimum, maximum, and cash price. He agreed that this sounded reasonable.
I decided to start by creating a data dictionary out of the files with the following structure. I started by reading in the JSON file as JSON, and then because it was so well formatted, I was able to convert the data I needed in it to a dictionary pretty easily:
For the XML file, I decided to import it directly as a dictionary using the xmltodict package, as I believed this would make merging the data with my combo dictionary easier down the road.
I performed some tests to try to spelunk the format of the dictionary, and felt like I had it down, so ran the code to merge the two files. The code worked — however, only a few hundred new procedure codes had been added to my combo dictionary. I created a function to check to see if a certain procedure code existed and produced the expected output.
I then used this function to spot-check entries from the top, middle, and bottom of the saline.xml file. As I suspected, there were a *lot* of missing values. While I won’t take you through the entire troubleshooting process, I will share the commented out code I was able to generate when I finally figured out what was going on:
And so, after a number of hours I’m too embarrassed to admit in public, I was finally able to write the code to pull the data out of the saline.xml file that I wanted and add it to my combo dictionary.
The new combo dictionary had more than 1,500 procedure codes, and was formatting correctly, so I was good to go. To put on the finishing touch, however, I decided to take the next step in making the data I had worked so hard to generate more easily accessible for the next person, so I wrote it out to a CSV file, which could then be easily re-imported via Pandas or similar:
You can take a look at the complete code from beginning to end here. If there’s one thing this experience taught me, it is the value of data engineering. I had often heard that it represented about 80% of the work in the field of data science, and believed it, but walking through this experience on a couple of relatively small, already semi-structured data sets really drove that point home. To those unsung heroes who format data like this into things that folks can easily plug into machine learning models: I salute you!