Predicting COVID-19 Spread Pt. 3: Model and Predictions

Written by Chase Thacker

I am an adoptive father of two boys. My wife and I love West Virginia and hope to share that love of the Appalachian region with others. I do my small part by blogging about data science and Appalachian topics. For fun, I like to play hobbyist board games, read tons of books (particularly sci-fi and fantasy), and pretend to know what I am doing in my woodshop.



Data Science | Modelling



March 17, 2020

If you stuck with me through the first two posts, thanks! If you are just joining me in this one, be sure to check out parts 1 and 2 for more background about the project, the data, and what modification has been done on it.

In this section, I will be covering the creation of the prediction model and the results of my initial model. I hope it is at least somewhat helpful and informative!

If you are interested in the source code, you can find everything in my GitHub at https://github.com/acthacker/covid19

Now, I can finally write about the fun part! (Or, at least, what people usually consider fun. I actually think the data prep is interesting since it requires so much care, planning, and thought).

First, I want to cover the variable I created but ended up not using. I went through the effort of finding and bringing in the population data for each state. However, I decided against using it in the models since the number of cases is so minuscule right now relative to the overall population. I could see a metric like % of population confirmed being relevant once numbers rise, but the percentage is too low right now to be of much use.

Here are the variables I did use:

1. Total number of active confirmed: I wanted to include this and to make sure I used active confirmed rather than total confirmed since anyone actively sick could be a disease vector. Subtracting out deaths and recovered cases to look only at active cases may give a more accurate picture of current disease vectors.

2. 1-day and 3-day new cases: I wanted to be able to see the number of new cases to gauge the velocity of growth. I recognize that these two variables likely have a high correlation (which is a faux pas in linear models), but they do not seem to affect the overall accuracy of the model at this time.

3. Nearby: Since diseases know no human borders, I figured looking at the number of cases in nearby states would be worthwhile. This metric is a little skewed since it is looking at the distance between the center of one state and the center of the surrounding states. There may be a case where two states have lots of cases in border counties without having their geographic centers near enough to be included in each other’s count. However, I decided to keep this one since it does help particularly well in areas like New England where low population states like Rhode Island are really connected to high population states like New York.

For the model, I used a linear regression model. I plan to write a longer post about linear regression someday. For now, if you are unfamiliar with it, think of it as a trying to draw a line that most closely matches the data points you have with known responses. When you then add new data (the data you are trying to predict), the models adds those points to the created line and gives the number predicted by the path of that line.

You can check out this Wikipedia article for more detailed info: https://en.wikipedia.org/wiki/Simple_linear_regression

In linear regression, the result (Y) is equal to an intercept term (B) + a regression coeffecient created by the model * each variable (X1, X2…Xn)

Here is my model:

#Importing necessary modules
from sklearn.linear_model import LinearRegression

#Creating Y (target) variable and X (prediction) variables
Y = train_target
X = train_df[["nearby", "active",  "1_day", "3_day"]]

#Creating and training regression model
regressor = LinearRegression().fit(X, Y)
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])

#Displaying model stats
display(coeff_df)
print("Intercept: " + str(regressor.intercept_))
print("R^2 value: " + str(round(regressor.score(X, Y),2)))

The screenshot below shows the intercept and coefficients created for the model. It also shows the R-squared (R^2) value. I will talk about it more when I write about linear regression, but it is basically a measure of accuracy of fit. R^2 measures the amount of the variation in the results that is explained by the model, and it is given in a range between 0 and 1.

The intercept is a baseline value. If all the X values (the variables like 1-day, 3-day, etc.) are at 0, then the resulting prediction will just be the intercept of -1.35. The coeffecients mean that a 1 unit increase in those variables leads to a raise in the prediction by that coeffecient.

For example, the .45 for Active cases means that 100 Active cases in a state would raise the estimate by 45. Add that to the Intercept, and you get a prediction of roughly 43.65 with all other variables kept steady.

The R^2 for this model is a very high .95 which means that the model successfully explains almost all of the variation. However, this could be due to overfitting due to the small number of data points (51). Also, running the model on a different date (this was from the 3/16/20 data) will produce entirely different results.

You can see this model at work in the code and following screenshot. It that next code block, I predicted results for the 3/16 data and then compared it to the actual, known results.

#Creating predictions for the known data
y_train_pred = regressor.predict(X)

#Table to compare the predicted results against the actual, known results
training_predictions = pd.DataFrame({"Predicted": y_train_pred, "Actual":Y})

display(training_predictions.head(10))

As you can see, the actual results are not quite right, but they are pretty close. Now that the model was trained on the known data, I was ready to predict the results for the unknown. I was ready to predict how many new, confirmed cases would be announced per state on 3/17.

#Creating the X (predictor) variables for the unknown (current) data
x_prediction = predict_df[["nearby", "active", "1_day", "3_day"]]

#Predicting and storing in a Y result variable
y_pred = regressor.predict(x_prediction)

#Initiating list for modify8ing above results
adjusted_predictions = []

#For each item in the results list
for i in y_pred:
    #If it is less than or equal to 0, set it to 0
    if i <= 0:
        adjusted_predictions.append(0)
    #Otherwise, round it to the nearest integer (think: whole, non-decimal number)
    else:
        adjusted_predictions.append(int(i.round()))

#Put the adjusted results in a dataframe and display predictions for next day new cases
predicted_by_state = pd.DataFrame({"State": predict_df["Province/State"], "Predicted new cases": adjusted_predictions})

Here are the first 10 as shown in my IDE:

And the total results after exporting to Excel and formatting a bit for clarity:

Here at the end, I want to caution everyone again to take these results and predictions with a grain of salt. Many factors go into disease spread, and the 50 states have wildly different cultures and geographies that could contribute to or hinder the spread of COVID-19.

I could easily have missed relevant factors. I have high correlation between several of my factors. I did not do any kind of cross-validation or train/test split on my data. Those are just a few of the area that could be improved.

Regardless, I had fun creating this, and I would like to see how it performs each day through this outbreak of disease. For a first peresonal data science project, it was fun and instructive for me (I had not done many of these steps in Python before since I had done most of my data science work in R and kept Python for programming).

I hope it can be somewhat fun and instructive for you too!

P.S. In the future, I may revisit this and try to create some predictive models that go beyond a single day.

Quick Hits: Exclude Current Week in SQL

by Chase Thacker | Jun 11, 2021 | Data Science

Welcome to the first of my “Quick Hits” entries on the blog! I am still trying to find the proper shape for my blog formats, and I hope this one will persist going forward. Quick Hits entries will be more focused on a problem I found interesting and a solution to that...

How to find data for data science or GIS projects

by Chase Thacker | Feb 12, 2021 | Appalachia, Data Science, Tools

Intro When starting a personal data science project, often the most difficult part is figuring out where to obtain the data you will use. Whether you are just looking for some test data to use when learning a new technique or are looking to investigate the answer to a...

The Birthday Problem with Real-world data (Birthday Problem Pt. 2)

by Chase Thacker | May 3, 2020 | Data Science, Statistics

As promised in the previous post, I have found some real-world datasets to test out our Birthday Problem predictions on! The more advanced visualizations will have to wait until another weekend when I have some time. For this post, I will be looking at four datasets...

« Older Entries

Thanks for reading!

If you enjoyed this content, feel free to sign up for my newsletter to be updated about new content. I promise not to spam your inbox, and you should receive only receive emails when something new is available for you to read.

If you enjoy my writing, you can also check out my other blog where I review books!

Books blog

Written by Chase Thacker

Data Science | Modelling

March 17, 2020

Recent Posts

Archives

Categories

You may also like…

Quick Hits: Exclude Current Week in SQL

How to find data for data science or GIS projects

The Birthday Problem with Real-world data (Birthday Problem Pt. 2)

Thanks for reading!

Success!