Why Good Data Engineers Will Always Have Work, Part Deux

Over the summer I’m taking an Inferential Modeling course using R. Our first assignment was to do some basic regression modeling on a public data set (link). I thought I would be all clever and code my R script to ingest the file as a first step, rather than download the static file and read it in locally — teacher’s pet and all, showing off my mad R skills.

I should have known…

If you download the file today, you’ll note that the column names are all nice and human readable — things like “# of Adverse Events”, “Performance Measure” and such. However, when I originally built my program (as in, twelve hours before this post), that same basic data with the same file name at the same URL? Those two columns were named “X..of.Adverse.Events” and “Performance.Measure”. Most of the multi-word column names had been similarly formatted. What this meant was, while my code was working smoothly when I went to bed last night, when I woke up this morning nothing worked past the basic download and data frame creation.

That was easy enough to fix once I figured out what was going on, but did require some tedious work replacing column names in code which had not required quotation marks to account for spaces before. And then further tedium because quotation marks don’t work when you’re using R’s lm() function — you have to use the backtick (`) instead.

The (I’m sure well-intentioned) person who updated the file also went in and cleaned up some of the data — for example, in “Performance Measures” there were identical values in which some contained trailing spaces, which I had fixed as part of my code. This was no longer necessary. However, whoever did this missed several other similar issues in terms of capitalization and crossover between measure names, and apparently actually added a new variant for one measure, so figuring out what was left for me to fix vs. what they had done was… fun…

Then I noticed that the data itself had been slightly altered, as my now working again models were generating slightly different results. My overall analysis was still the same, but I had to go back in and replace any values I had manually typed out with the new, slightly different values.

Lesson reinforced: Good Data Engineers will always have work. If you don’t control the upstream data, you don’t control the upstream data.

Lesson learned: Go ahead and show off, but make sure you download and save a copy of the file your program thinks it should be working with, just in case something changes while you sleep.

This has been a public service announcement.

Data Science, Big Data, & Cloud nerd with a focus on healthcare & a passion for making complex topics easier to understand. All thoughts are mine & mine alone.