Predicting COVID-19 Spread Pt. 1: Data Collection

Written by Chase Thacker

I am an adoptive father of two boys. My wife and I love West Virginia and hope to share that love of the Appalachian region with others. I do my small part by blogging about data science and Appalachian topics. For fun, I like to play hobbyist board games, read tons of books (particularly sci-fi and fantasy), and pretend to know what I am doing in my woodshop.



Data Science | Modelling



March 15, 2020

Hello and welcome to my blog! I hope I am able to provide you with something of interest or of use to you.

I plan to use this as a place to write about my interests and my studies. I will use it to give updates and information about my projects and to write about tools and topics I find useful, interesting, or helpful. Generally, I will be writing about data science topics, but I also ho

Here in this first blog post, I am going to introduce a small, beginner project I started making when my wife wanted me to dig into the available data on the novel coronavirus (aka COVID-19) that has taken over the news cycles in this first part of 2020.

Since this is my first post on data topics (in addition to being my first overall!), I will give a quick overview of how I plan to handle code snippets before jumping into the actual analysis.

Generally, I will open-source my projects through my GitHub account and will link them in the post so you can access the code and data. Of course, I will refrain from open-sourcing whenever it is not possible, and I will state at the beginning of those posts why it is not possible.

For code snippets, I will be commenting profusely. My goal is to make the code understandable (or at least the purpose understandable) by anyone reading this blog–even if they have never written or read a line of code in the past.

Since I am far from being an expert, I welcome and encourage any feedback on my code so I can improve!

The goal of this project is to create a one-day prediction model for new cases of COVID-19 in the United States broken down by state. Before beginning, there are a few caveats.

First, I am fairly new in my data science journey, and I am a busy dad. I may not have the skills or the time to create a model as detailed as I would like.

Second, the “novel” part of “novel coronavirus” is very true. Many unknowns still exist in the rate of spread and in detection

Third, the official numbers cannot always be trusted due to differences in testing availability and symptom presentation.

Fourth, I have no expertise in disease spread or public health. I can certainly take the numbers and predict as best as I can, but I will not necessarily know which factors tend to be most important.

Fifth (though I may think of more later), different states are taking slightly different approaches and are implementing different restrictions at different times.

All of these caveats lead to errors and difficulties in the predictions. Of course, I am creating this project mostly as a means to play around with the tools and as a means to explore the data. If I were a professional trying to create predictions driving policy, I would do more to offset these caveats to create a better model.

Finally, we are at the point of starting the project! Naturally, the place to start with any data science project is with the data. In this case, it is conveniently collected already.

Thanks to the team at John Hopkins for collating COVID-19 case data from the various sources around the world! You can check out their data here: https://github.com/CSSEGISandData/COVID-19

In the rest of this post, I will cover how I obtained the data from Johns Hopkins and cleaned it to have just the data points relevant to this project.

You can find the following code and all code related to this project here: https://github.com/acthacker/covid19

In this initial code block, I pulled in the data from the Johns Hopkins repository. Rather than downloading and referencing my local copy of the data, I used the links to point to their data so this model can run on the newest data (which they update every day).

#Imports Pandas and Numpy which have features for data science work not included in base Python
#I will wait to import other tools until the cells where I need them so I can write the explanations in those posts.
import numpy as np
import pandas as pd

#Reads data from source and imports into Pandas dataframes
data_confirmed = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv")
data_deaths = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv")
data_recovered = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv")

#sample data
print(data_confirmed.head())

The result of line 11 above is shown in the screenshot below. The “.head()” methods displays the first 5 rows of the data_confirmed dataframe. From this sample, we can see that we have some work to do to narrow down to the United States data we are interested in.

In the next code block, I filtered out all countries except the United States. The code is a little dense so I will give a short explanation. The “=” is reassigning new values to the variables we were using. After the equals sign, I created a subset of the overall dataframe.

#Filters each of the three to contain only US data
data_confirmed = data_confirmed[data_confirmed["Country/Region"].str.contains("US")]
data_deaths = data_deaths[data_deaths["Country/Region"].str.contains("US")]
data_recovered = data_recovered[data_recovered["Country/Region"].str.contains("US")]

#Sample data again
print(data_confirmed.head())

Basically, that code is saying “For each row, look in the ‘Country/Region’ column of the dataframe and tell me True if it contains ‘US’ and False if it does not. Then, give me the rows that equal True.” You can view the results in the next screenshot.

Oops, the data from the cruise ship “Diamond Princess” was included. While it may be valuable for the broader community of people investigating COVID-19, it was not terribly relevant to my analysis. Not shown in the screenshot is also a breakdown of COVID-19 cases by city. Since that level of granularity was not spread evenly around the country, I decided to stick to state-level analysis only. The next code block shows the removal of those extraneous pieces of information.

#Filters out rows containing "Princess"
data_confirmed = data_confirmed[~data_confirmed["Province/State"].str.contains("Princess")]
data_deaths = data_deaths[~data_deaths["Province/State"].str.contains("Princess")]
data_recovered = data_recovered[~data_recovered["Province/State"].str.contains("Princess")]

#Filters out rows containing "," which catches the city entries
data_confirmed = data_confirmed[~data_confirmed["Province/State"].str.contains(",")]
data_deaths = data_deaths[~data_deaths["Province/State"].str.contains(",")]
data_recovered = data_recovered[~data_recovered["Province/State"].str.contains(",")]

#Sampling again
print(data_confirmed.head())

If you notice, the code here is slightly different. Instead of returning True and keeping rows where I had a string match, I wanted to return False so we can filter out those matches. The “~” before the “.contains()” method reversed the True/False that the method returned. Using this, I was able to filter down to info simply about the states.

There it is! The data filtered and adjusted down to what I needed to start the analysis. I plan to write at least two more posts in this series. For the second, I will cover the data manipulation as I create the necessary factors, and for the third, I will cover the prediction model and some preliminary accuracy results.

Thanks for reading! Please feel free to reach out to me with any feedback!

Quick Hits: Exclude Current Week in SQL

by Chase Thacker | Jun 11, 2021 | Data Science

Welcome to the first of my “Quick Hits” entries on the blog! I am still trying to find the proper shape for my blog formats, and I hope this one will persist going forward. Quick Hits entries will be more focused on a problem I found interesting and a solution to that...

How to find data for data science or GIS projects

by Chase Thacker | Feb 12, 2021 | Appalachia, Data Science, Tools

Intro When starting a personal data science project, often the most difficult part is figuring out where to obtain the data you will use. Whether you are just looking for some test data to use when learning a new technique or are looking to investigate the answer to a...

The Birthday Problem with Real-world data (Birthday Problem Pt. 2)

by Chase Thacker | May 3, 2020 | Data Science, Statistics

As promised in the previous post, I have found some real-world datasets to test out our Birthday Problem predictions on! The more advanced visualizations will have to wait until another weekend when I have some time. For this post, I will be looking at four datasets...

« Older Entries

Thanks for reading!

If you enjoyed this content, feel free to sign up for my newsletter to be updated about new content. I promise not to spam your inbox, and you should receive only receive emails when something new is available for you to read.

If you enjoy my writing, you can also check out my other blog where I review books!

Books blog

Written by Chase Thacker

Data Science | Modelling

March 15, 2020

Recent Posts

Archives

Categories

You may also like…

Quick Hits: Exclude Current Week in SQL

How to find data for data science or GIS projects

The Birthday Problem with Real-world data (Birthday Problem Pt. 2)

Thanks for reading!

Success!