Population Density and Health Risks

Jason Eden
5 min readMay 25, 2021

Health Data Science Statistics and Analytical Programming Final Project

For my Statistics and Analytical Programming final project, the goals were similar to my Python course final, but with different caveats and requirements. For starters, there were no requirements about reading in data from a certain number of different file formats. The only requirement on the data side was finding an interesting data set and doing some analysis on it. Therefore, I was able to take the work I had started here using public data on BigQuery, export it to GitHub, and just start chunking away at it. On the flip side, however, the requirements for the types of analysis were deeper and required demonstration of both R programming techniques and actual understanding of the statistical tools being deployed. Once again, if you want to get straight to the code, you can find it here. Since I created the project using RMarkdown, I was also able to export it to HTML, which you can get here. Most of the screenshots I will be using will be from the HTML view.

Picking up where I left off, I had created a file called localdata, which I subsequently stored on GitHub. Thus, the first part of the program actually just grabs and reloads that data.

Google “Another Dysfunctional Family Dinner — SNL.” You’re welcome.

The next several code blocks analyze the data structure — number of rows, column names, data types, etc. — to lay the groundwork, as well as demonstrate various techniques and tools we learned in the course.

“…and so on, and so forth, and what have you.” — Sue Heck, The Middle

This data looks at similar (but more updated) New York Times Covid-19 data, but compares it against population density rather than political leaning. It is not lost on me that the two things might be related, but for purposes of the school project, I met the requirements with just this and did not need to add additional data here. (I was going for a good grade, not trying to prove anything specific…) To that end, I started with a visualization to see if this might be a good place to start.

Notice how the higher, darker dots are all on the left side of the chart — this does not bode well for small populations and death rates from Covid (cue dark music…)

Next, although the two were related, I looked at population density label vs. raw population numbers and generated a KDE graphic.

If you think packing people together more tightly makes them more likely to die from Covid-19, this visualization challenges that assumption.

The next section of the project does a “What-If” data transformation exercise which, while interesting, doesn’t relate directly to the output here. Some additional tables and sums were generated, which I had also previously done (check out the “something meaty” blog or the code if you want to revisit that…) but then just to be a teacher’s pet, I decided to throw in an example of using tapply to get to the same thing in a more sophisticated fashion.

Look at me all hoity toity. I’m now required to drink tea with my pinky finger sticking out.

That done, the next step was to check the “normality” of the data — or, in other words, check to see what kinds of analysis I could do based on whether the data followed a normal distribution or not. I performed a few operations — visual inspection via histograms, and then various statistical checks — to discover that indeed, my data were **not** normally distributed.

Again, playing the role of teacher’s pet, I included several additional normality tests and plots/charts just for fun. One of the reasons for this normality check is to determine what kind of tests / analysis can be performed on the data. For the project, I intentionally picked a test that makes the assumption that data are normally distributed and examined the output just to drive the point home — you *have* to understand your data and the tests you are running before you are going to be able to draw any real conclusions. Otherwise, the numbers will deceive you.

Ah, yes. The devil is in the details, is it not?

Having shown that the wrong test on the data can lead to false conclusions, I then pivoted to a Negative Binomial model, which does not make assumptions about normal data distribution.

(Sips tea, pointy finger straight out as per requirements.

As it turned out, the wrong test led us to the opposite of reality, a very, *very* important lesson regarding the potential destructive value of statistical tools in the wrong hands. Sparsely populated counties experienced higher Covid infection rates, and **sharply** higher death rates from Covid. Yikes!

To finish up, I did some Confidence Interval work.

Confidentially, I find your confidence in your confidence confounding.

At the end, I felt pretty good about the project and what I had been able to glean from the course and demonstrate. There are some interesting future possibilities as to how to extend this, as well as merge it with some of the work from my Python class and potentially other data sets. But for the class, this was a pretty satisfying place to end up.

Hope you enjoyed the walk through!

--

--

Jason Eden

Data Science & Cloud nerd with a passion for making complex topics easier to understand. All writings and associated errors are my own doing, not work-related.