Select Page
Predicting COVID-19 Spread Pt. 2 Data preparation

Written by Chase Thacker

I am an adoptive father of two boys. My wife and I love West Virginia and hope to share that love of the Appalachian region with others. I do my small part by blogging about data science and Appalachian topics. For fun, I like to play hobbyist board games, read tons of books (particularly sci-fi and fantasy), and pretend to know what I am doing in my woodshop.

March 16, 2020

This is part 2 in my series about a small analytics project I am throwing together. The goal of the project is to create a one-day prediction model for new cases of the novel coronavirus. Check out part 1 for an overview of the data collection process. In this part of the series, I will be covering the preparation of the data for inputting into my model.


Before I begin on the data preparation, I have one more bit of data collection I need to cover. After writing my other post, I realized that my data could be well-supplemented by population numbers for each state. While I do not expect these numbers to matter much in the short term for the models, the population counts may prove useful for later editions of my model.
I obtained the 2019 population estimates from the US Census Bureau here. Once I had the file downloaded, I did some basic manipulation in Excel to push it into a format that suits my analysis better. You can find the resulting CSV on my GitHub page for this project.

#State population estimates found at https://www.census.gov/data/datasets/time-series/demo/popest/2010s-national-total.html#par_textimage_401631162
    data_pop = pd.read_csv("state_pop.csv")
    
    print(data_pop.head())

The code above just loads in the data and then shows the first five entries to confirm the load is correct


With all the data loaded, my next step was to shape it into appropriate dataframes for my analysis. The goal here was to build two dataframes. One contained the columns needed to train the model, and the other contained the columns that I needed to predict on.

Basically, the prediction dataframe has the most recent day’s data while the training dataframe has the data for Day T-1. For Day T-1, the number of new cases was known. For the current day, the number of new cases between current day and T+1 is unknown, and that was what I hoped to predict.

I will get more into the model details and the thoughts around it in my next post. I will explain more about my variable selection thoughts. For the rest of this post, I will just be covering the data manipulation needed to create the variables.


This step was overall much bigger than I originally expected. For some reason, I did not think about how long it could take to shape four separate dataframes into the exact dataframes I needed. The first step was simply alphabetizing everything for better integration and creating a new index.

#Sort the dataframe alphabetically by state name
    data_confirmed.sort_values(by=["Province/State"], ascending = True, inplace = True)
    data_deaths.sort_values(by=["Province/State"], ascending = True, inplace = True)
    data_recovered.sort_values(by=["Province/State"], ascending = True, inplace = True)
    
    #Resets indexing just to make everything a little cleaner to work with
    data_confirmed = data_confirmed.reset_index(drop = True)
    data_deaths = data_deaths.reset_index(drop = True)
    data_recovered = data_recovered.reset_index(drop = True)

Next, I needed to grab the state names, latitude, and longitude. Each of those needed to be added to the two new dataframes, the training data and the data for the prediction. I also noticed at this point that I still needed to drop out the US Territories since I was focusing on the states and DC.

#Grabs the state, latitude, and longitude columns to initiate the dataframe
    train_df = data_confirmed[["Province/State", "Lat", "Long"]]
    #Drops the rows for the US territories
    train_df = train_df[~train_df["Province/State"].isin(["Guam", "Puerto Rico", "Virgin Islands"])]
    #Same for the second dataframe
    predict_df = data_confirmed[["Province/State", "Lat", "Long"]]
    predict_df = train_df[~train_df["Province/State"].isin(["Guam", "Puerto Rico", "Virgin Islands"])]
    
    print(train_df.head())

You can see in the screenshot that the data is loaded in and ready for further additions.

Next, I created the variables for 1-day increase in cases, 3-day increase in cases, and currently active cases.
#Creating variables for current active cases
#Subtracts the COVID deaths and the COVID recoveries from the confirmed cases to give a current, active count
pred_curr_day_active = data_confirmed.iloc[:, -1].subtract(data_deaths.iloc[:,-1].add(data_recovered.iloc[:,-1]))
train_curr_day_active = data_confirmed.iloc[:, -2].subtract(data_deaths.iloc[:,-2].add(data_recovered.iloc[:,-2]))

#Setting the result we are trying to predict in the training model
#The result to predict is the number of new cases between yeterday and today using data leading to today
train_target = data_confirmed.iloc[:,-1].subtract(data_confirmed.iloc[:,-2])

#Creating variables for the past day's increases in cases
train_increase = data_confirmed.iloc[:,-2].subtract(data_confirmed.iloc[:,-3])
train_3day_increase = data_confirmed.iloc[:,-2].subtract(data_confirmed.iloc[:,-5])

#Creating variables for the current day's increases in cases
pred_increase = data_confirmed.iloc[:,-1].subtract(data_confirmed.iloc[:,-2])
pred_3day_increase = data_confirmed.iloc[:,-1].subtract(data_confirmed.iloc[:,-4])

#Adds the columns to the training dataframe
train_df["population"] = data_pop["population"]
train_df["active"] = train_curr_day_active
train_df["1_day"] = train_increase
train_df["3_day"] = train_3day_increase

#Adds the columns to the prediction dataframe
predict_df["population"] = data_pop["population"]
predict_df["active"] = pred_curr_day_active
predict_df["1_day"] = pred_increase
predict_df["3_day"] = pred_3day_increase

display(train_df.head())

With these columns in place, I had only one other column that I wanted to add. I wanted to add a column showing the number of cases in nearby states. For this column, I needed to create a function in Python to calculate the great circle distance between the provided latitudes and longitudes. Of course, there are issues with this, but I will discuss that more in the modelling section of this project.

#Importing needed libraries
from math import sin, cos, sqrt, atan2, radians

#Initiate function which takes lat and lon from two different points as imputs
def haversine(lat1, lon1, lat2, lon2):

    #Estmated redius of the Earth
    R = 3958.8

    #Transforms the lat and lon into radians for the caulculation
    lat1 = radians(lat1)
    lon1 = radians(lon1)
    lat2 = radians(lat2)
    lon2 = radians(lon2)

    #Programmed version of the Haversine equation
    #Found here: https://en.wikipedia.org/wiki/Haversine_formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c

    #Returns the great circle distance between the two points.
    return distance

I called this function later in the code to return the distance between two points. This code basically just takes the Haversine equation and converts it from math notation into working code. Once that function was in place, I looped through each state, found every state whose center point was within 250 miles of that state’s center point, and summed the active cases in those nearby states. That sum was added to the dataframe as that state’s “nearby cases” count.

#Initializing the lists to populate with nearby cases
nearby_cases_pred = []
nearby_cases_train = []

#This line begins this for loop and allows for calling each row in the Pandas dataframe
for index, row in predict_df.iterrows():
    #This line gives a baseline of 0 for the nearby case count
    case_count = 0
    #This is a second, internal loop of every state again to compate against the selected state
    for i, r in predict_df.iterrows():
        #Calls the Haversine function and returns the distance
        hav = haversine(row["Lat"], row["Long"], r["Lat"], r["Long"])
        #If that distance is less than 250 and more than 0 (so it does not catch the same state)...
        #For example: comparing WV to WV in the inner loop would return 0, and we want nearby, not in-state cases here
        if hav <= 250 and hav > 0:
            #Add the case count for the nearby state to the running case count total
            case_count += r["active"]
    #Appends the total nearby cases to the list initiated above
    nearby_cases_pred.append(case_count)

#Same code again for the training data
for index, row in train_df.iterrows():
    case_count = 0
    for i, r in train_df.iterrows():
        hav = haversine(row["Lat"], row["Long"], r["Lat"], r["Long"])
        if hav <= 250 and hav > 0:
            case_count += r["active"]
    nearby_cases_train.append(case_count)

#Takes the lists created and adds them as new columns in the dataframes.
train_df["nearby"] = nearby_cases_train
predict_df["nearby"] = nearby_cases_pred

display(predict_df.head())

As you can see in the output, all of the desired columns were present and ready to begin prediction!

With everything in place and formatted correctly, I was able to start the actual predictions. Check out part 3 where I will wrap up this project!

You may also like…

Quick Hits: Exclude Current Week in SQL

Quick Hits: Exclude Current Week in SQL

Welcome to the first of my “Quick Hits” entries on the blog! I am still trying to find the proper shape for my blog formats, and I hope this one will persist going forward. Quick Hits entries will be more focused on a problem I found interesting and a solution to that...

If you enjoy my writing, you can also check out my other blog where I review books!