A nerdy Clara Peller would have enjoyed this one.
Up to this point, a lot of my effort has been in figuring out how to do certain things that are pretty much baseline capabilities for integrating the power of Google Cloud Bigquery and AI/ML tools into data analysis / data science work on local systems running R or Python. We’ve reached a point, now, where we can start to dig in and actually do something resembling a real project, or at least the beginnings of one. So if you’ve been wondering
…your wait is over. I’m going to walk through a recent effort where I start from public data readily available on BigQuery, analyze it from beginning to end in an R Markdown file, and do some basic data joins and transformations. I’ll do some simple analysis on the entire set of data, followed by figuring out how I want to subset the data for local analysis, and then pull the smaller subset to my local system to finish off. I’ll include some screenshots from RStudio along the way, but if you want to check out the code and follow along at home, you can grab the final file here.
RIP Clara. Now let’s get some beef.
Once I got all the libraries loaded and authentication set up, the first thing I did was to pull in a list of all of the public datasets that Google makes available for BigQuery, and then subset that list to produce only those datasets that contained the word “covid” in them.
Out of that list, I thought it might be interesting to look at the New York Times data set. There were four tables, one named “us_counties”, and upon exploration I discovered it had a daily running total on a county by county basis for confirmed cases and deaths.
My first thought was to try to give these numbers some frame of reference — for example, by comparing confirmed cases and deaths to the total population estimate for a given county. That data was not captured in the NYT data set, so I needed to look elsewhere. My first instinct was to see what census data was available, and it turned out there were several data sets.
Looking at the list, I thought “census_bureau_usa” might be promising, so I looked at the tables available.
The data here were at best 10+ years old and sorted by zip code rather than by county. In other words
If that was the best available, I either could have tried to make it work or pulled more updated data using the tidycensus package that I had described in my last blog post, but given there were several other census datasets, so I was undeterred. I happened to get lucky in my next exploration.
I checked out the “county_2018_5yr” table, and sure enough it had more than 3,000 rows of counties with lots and lots of data points. Critically present was a field called geo_id, which lined up with the fips_code field in my NYT data, and a field called total_pop.
There were a bunch of other interesting data in there as well (242 columns total), and I may go back and add some or all of them into my master table at some point. For now, total population was all I wanted to throw in for the first round of analysis.
To do this, it was time to bring some data home into my own GCP project. I started off by creating a dataset, and then after a quick test to make sure my credentials and methodology were still good, I performed a join of the NYT table and just the population column from the census data by the county fips code / geo id.
Note that this was the first time I was actually storing data in my project, as up to that point everything I had done had been on public tables. Now, I had all of my data of interest in a single table, in my own GCP project.
Some quick analysis to determine data size, number of rows, and list out the fields and their data types, and I was ready for the next step. This turns out to be a really hard thing in some cases: you’ve got all this cool data at your fingertips, so what question do you want to answer? After giving it some thought, I came up with the question “Does county size correlate to number of cases or deaths?” This led to the next question — how to determine what “size” meant in this context. A quick Google search brought me to a census blog that had some neat statistics that broke down big vs. small counties based on population density, as opposed to overall population, so I decided to just adopt their definition. According to their data, the largest “small” county population was under 480,000, and the smallest “large” county population was over 480,000. This made it pretty easy to divide my counties into two groups — dense population vs. sparse population. Back to some SQL to create the factors, and we were almost ready for prime time.
The last thing I needed to do was normalize (I’m pretty sure that’s the wrong word — please feel free to drop a comment correcting my terminology…) the case and death counts by dividing them by the population, and then dividing that by 1,000 to get total deaths per 1,000 people whether your county had a population of 2,000 or 2,000,000. Then I had a single, equivalent variable (two of them, actually) that I could use to determine if population density had a correlation with deaths or case count.
The final step before the “real” stats work was to segment out just a single day’s worth of data into a separate table. This data ended up being just over 3,000 rows and under 300 KB, so I decided to pull it back into the local system and store it as an R data table for further analysis.
Here’s where things got interesting. I started out by looking up totals for population, separated by whether you lived in a dense or sparse county, and then the total cases and deaths for each category.
At a quick glance, it looked like the populations were roughly even between the two groups, which was a surprise in and of itself for me. The case numbers and death numbers also looked very similar. I had gone into this thinking that the more densely populated an area was, the more likely you would be to catch and die from COVID-19 (hypothesis: more people in smaller space = more transmission), but this data looked like it was roughly the same.
Or did it?
One of the things I learned in my MBA program years ago was that human beings — even most science-y and engineering-minded ones — are just naturally terrible at interpreting meaning from numbers in data, and especially large numbers and data.
I have learned since then to almost never trust my initial instinct to large number comparisons like this without more thorough analysis, and specifically applied statistics. As such, I decided to do a couple of basic stats tests and test my interpretation. Both of these tests — the t-test and the anova — examine data when looking at the effect that a variable (in our case, dense vs. sparse county population) has on an outcome (in the notebook, I decided to look at deaths per 1,000 population).
Before you correct me here — I know, it’s not an effect, per se, but rather a correlation, since the things being tested may actually both be affected in the same way by some other variable. Thus, as the saying goes, correlation does not necessarily equal causation. BUT — it’s a good indicator that there’s something worth exploring further, if a correlation (effect) is indicated.
I performed the t-test first:
The important value for now in this output is the p-value. A p-value of less that 0.05 has significance — debated, I know, but let’s treat this like an intro to stats class, shall we? — in that you effectively reject the “null hypothesis.” The null hypothesis states that there is ***not*** a significant difference in the correlation (effect) of your two factor variables. Thus, if your p-value is greater than 0.05, you are said to have accepted the null hypothesis — your variable doesn’t correlate to a significant difference in the outcome. However, in our case, the p-value turned out to be far smaller than 0.05, which we then interpret as a rejection of the null hypothesis. This means there is reason to believe that there ***is*** a significant correlation between population density of a county basis and overall COVID-19 death rate.
What really got me here was the mean difference between the two groups, listed at the bottom of the output. It turned out not only was I wrong with my first-glance interpretation of the data (remember, I thought it looked like there wasn’t much difference between dense and sparse populations,) but I was also guessing wrong about the direction that difference would go if there was one. It turned out the mean/average deaths per 1,000 population in the sparsely populated counties was actually **higher** than in the densely populated ones. That was fascinating, as it could indicate either behavioral or environmental differences between folks who lived in one type of area vs. the other are likely to blame for these results. Further analysis required!
The anova test produced a similar result:
Now before you go sending this to some journal for publication, keep in mind that on the statistics side of things, I’m still warming up. There are things that I know I needed to do and didn’t — for example, check for the distribution of my data and adjust if it’s not normal — and probably things I’m not aware of I needed to do and didn’t (are my deaths per 1,000 numbers small enough I needed to use log somehow to correct for vanishingly small number calculations?) And I’m sure there’s more nuance than that. That said, my main goal was to try to find a way to take what I’ve learned up to this point and produce something that let me practice real-world skills, and develop practices that could transfer once I **do** know better how to do the things I’m learning. To that end, I think this was a pretty cool exercise to walk through. What do you think?