Political Leanings and Health Risks
Health Data Science Python Programming for Data Scientists Final Project
For my final project in my recently completed Health Data Science Master’s degree course, our instructions were to find data from a variety of sources, in a variety of formats, and do something interesting with it that could relate to a broader healthcare initiative in the real world. We needed to demonstrate the ability to read in and work with multiple types of files, transform and manipulate data, produce basic visualizations, and generally demonstrate some level of mastery over the variety of course concepts that were covered throughout the semester. This blog walks you through my final project, however if you want to get straight to the code, feel free (link). For those that enjoy a good story… well, a story, anyway… read on.
In a previous blog, I had done some work using R to build out a Covid-19 data set starting from public data available in BigQuery. For this project, I had to simplify a few things as one of the requirements was the instructor needed to be able to run the entire notebook without manual intervention. This meant no data that required credentials to get access to, etc. So to start, I did some spelunking in some public data buckets on Google Cloud Storage to see if I could find something resembling the New York Times data I had worked with on BigQuery. As luck would have it, I found a ‘cloud-samples-data’ bucket with a lot of interesting data, including a subset of the New York Times data. Score!
The public data on BigQuery is updated on a regular basis. This sample file only looked at a subset of the dates and had been generated several months before. In order to make this real-world, I wrote a function to extract the latest date’s worth of data and format the columns appropriately. That way, you could take the logic I wrote in, generate an updated file using BigQuery, and export it and use it with almost no code change.
In order to provide some context for the numbers, I decided to add in county-level population data, which I had already generated and made publicly available via Google Sheets. Again, though, accessing Google Sheets via API requires credentials of some sort, so for this project I exported that data to Excel format and loaded it on GitHub, which I could then have my instructor download without any credentials needed.
Next, the program merges this population data with the New York Times latest data and creates some additional columns that will be useful for analysis.
Now we have some interesting data to work with, the next step is to add something that might answer some of the questions as to why infection rates were higher in some counties vs. others. As part of an earlier assignment, I had already written code to scrape Missouri voting records on a county-by-county basis from a Wikipedia page, and so with permission from my instructor, I decided to reuse that code and add it here to see if the way the population of a county voted had anything in common with Covid-19 infection rates.
The next step I took was to create three labels for the data based on whether a county voted heavily Republican or Democrat. If less than a 60% majority was attained by either party, I labeled it as a Swing county.
Next, I stored this data on a local file for future use if needed, and then simplified this data to the three columns I needed — county, margin percentage, and the lean label.
Finally, I subset (subsetted?) the New York Times data to just Missouri counties and then merged that subset with the Missouri voting data.
Now it was time to start generating some visualization. I won’t post them all here, but here were a couple of the more interesting ones that showed that Republican-leaning counties tended to have higher infection rates than Democrat or Swing counties.
My initial thought here was “Missouri is a Republican-leaning state, so this may not be entirely fair.” So I decided to look at the state population of Republican learning counties vs. Democrat and Swing counties, and it turned out the two were close enough to equal to at least at a glance not worry too much about a total population bias.
Finally then, I decided to look at just how much riskier it could be to live in a certain type of county. What did the worst-case scenario look like for each group?
Depending on where you lived, being from a Republican county in Missouri ended up with infection rates nearly twice as high as the highest Democrat or Swing county. Admittedly, the number of Republican counties was a lot higher than Democrat or Swing, so there could be some numeric bias in there, but I still found this to be interesting and something that I’d want to look at across all 50 states for a more robust analysis.
Analysis complete, I decided to create some additional labels based on infection rate (above vs. below mean rate).
I finished up the assignment with some random data gathering and some thoughts on future uses and ways to extend the project, and called it good.
This was a very satisfying project to complete and I felt like it conceptually demonstrated skills needed to work with data in a real-world data science project.