Hello and welcome to my blog! I hope I am able to provide you with something of interest or of use to you.
I plan to use this as a place to write about my interests and my studies. I will use it to give updates and information about my projects and to write about tools and topics I find useful, interesting, or helpful. Generally, I will be writing about data science topics, but I also ho
Here in this first blog post, I am going to introduce a small, beginner project I started making when my wife wanted me to dig into the available data on the novel coronavirus (aka COVID-19) that has taken over the news cycles in this first part of 2020.
Since this is my first post on data topics (in addition to being my first overall!), I will give a quick overview of how I plan to handle code snippets before jumping into the actual analysis.
Generally, I will open-source my projects through my GitHub account and will link them in the post so you can access the code and data. Of course, I will refrain from open-sourcing whenever it is not possible, and I will state at the beginning of those posts why it is not possible.
For code snippets, I will be commenting profusely. My goal is to make the code understandable (or at least the purpose understandable) by anyone reading this blog–even if they have never written or read a line of code in the past.
Since I am far from being an expert, I welcome and encourage any feedback on my code so I can improve!
The goal of this project is to create a one-day prediction model for new cases of COVID-19 in the United States broken down by state. Before beginning, there are a few caveats.
First, I am fairly new in my data science journey, and I am a busy dad. I may not have the skills or the time to create a model as detailed as I would like.
Second, the “novel” part of “novel coronavirus” is very true. Many unknowns still exist in the rate of spread and in detection
Third, the official numbers cannot always be trusted due to differences in testing availability and symptom presentation.
Fourth, I have no expertise in disease spread or public health. I can certainly take the numbers and predict as best as I can, but I will not necessarily know which factors tend to be most important.
Fifth (though I may think of more later), different states are taking slightly different approaches and are implementing different restrictions at different times.
All of these caveats lead to errors and difficulties in the predictions. Of course, I am creating this project mostly as a means to play around with the tools and as a means to explore the data. If I were a professional trying to create predictions driving policy, I would do more to offset these caveats to create a better model.
Finally, we are at the point of starting the project! Naturally, the place to start with any data science project is with the data. In this case, it is conveniently collected already.
Thanks to the team at John Hopkins for collating COVID-19 case data from the various sources around the world! You can check out their data here: https://github.com/CSSEGISandData/COVID-19
In the rest of this post, I will cover how I obtained the data from Johns Hopkins and cleaned it to have just the data points relevant to this project.
You can find the following code and all code related to this project here: https://github.com/acthacker/covid19
In this initial code block, I pulled in the data from the Johns Hopkins repository. Rather than downloading and referencing my local copy of the data, I used the links to point to their data so this model can run on the newest data (which they update every day).
#Imports Pandas and Numpy which have features for data science work not included in base Python #I will wait to import other tools until the cells where I need them so I can write the explanations in those posts. import numpy as np import pandas as pd #Reads data from source and imports into Pandas dataframes data_confirmed = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv") data_deaths = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv") data_recovered = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv") #sample data print(data_confirmed.head())
The result of line 11 above is shown in the screenshot below. The “.head()” methods displays the first 5 rows of the data_confirmed dataframe. From this sample, we can see that we have some work to do to narrow down to the United States data we are interested in.
In the next code block, I filtered out all countries except the United States. The code is a little dense so I will give a short explanation. The “=” is reassigning new values to the variables we were using. After the equals sign, I created a subset of the overall dataframe.
#Filters each of the three to contain only US data data_confirmed = data_confirmed[data_confirmed["Country/Region"].str.contains("US")] data_deaths = data_deaths[data_deaths["Country/Region"].str.contains("US")] data_recovered = data_recovered[data_recovered["Country/Region"].str.contains("US")] #Sample data again print(data_confirmed.head())
Basically, that code is saying “For each row, look in the ‘Country/Region’ column of the dataframe and tell me True if it contains ‘US’ and False if it does not. Then, give me the rows that equal True.” You can view the results in the next screenshot.
Oops, the data from the cruise ship “Diamond Princess” was included. While it may be valuable for the broader community of people investigating COVID-19, it was not terribly relevant to my analysis. Not shown in the screenshot is also a breakdown of COVID-19 cases by city. Since that level of granularity was not spread evenly around the country, I decided to stick to state-level analysis only. The next code block shows the removal of those extraneous pieces of information.
#Filters out rows containing "Princess" data_confirmed = data_confirmed[~data_confirmed["Province/State"].str.contains("Princess")] data_deaths = data_deaths[~data_deaths["Province/State"].str.contains("Princess")] data_recovered = data_recovered[~data_recovered["Province/State"].str.contains("Princess")] #Filters out rows containing "," which catches the city entries data_confirmed = data_confirmed[~data_confirmed["Province/State"].str.contains(",")] data_deaths = data_deaths[~data_deaths["Province/State"].str.contains(",")] data_recovered = data_recovered[~data_recovered["Province/State"].str.contains(",")] #Sampling again print(data_confirmed.head())
If you notice, the code here is slightly different. Instead of returning True and keeping rows where I had a string match, I wanted to return False so we can filter out those matches. The “~” before the “.contains()” method reversed the True/False that the method returned. Using this, I was able to filter down to info simply about the states.
There it is! The data filtered and adjusted down to what I needed to start the analysis. I plan to write at least two more posts in this series. For the second, I will cover the data manipulation as I create the necessary factors, and for the third, I will cover the prediction model and some preliminary accuracy results.
Thanks for reading! Please feel free to reach out to me with any feedback!