If you stuck with me through the first two posts, thanks! If you are just joining me in this one, be sure to check out parts 1 and 2 for more background about the project, the data, and what modification has been done on it.
In this section, I will be covering the creation of the prediction model and the results of my initial model. I hope it is at least somewhat helpful and informative!
If you are interested in the source code, you can find everything in my GitHub at https://github.com/acthacker/covid19
Now, I can finally write about the fun part! (Or, at least, what people usually consider fun. I actually think the data prep is interesting since it requires so much care, planning, and thought).
First, I want to cover the variable I created but ended up not using. I went through the effort of finding and bringing in the population data for each state. However, I decided against using it in the models since the number of cases is so minuscule right now relative to the overall population. I could see a metric like % of population confirmed being relevant once numbers rise, but the percentage is too low right now to be of much use.
Here are the variables I did use:
1. Total number of active confirmed: I wanted to include this and to make sure I used active confirmed rather than total confirmed since anyone actively sick could be a disease vector. Subtracting out deaths and recovered cases to look only at active cases may give a more accurate picture of current disease vectors.
2. 1-day and 3-day new cases: I wanted to be able to see the number of new cases to gauge the velocity of growth. I recognize that these two variables likely have a high correlation (which is a faux pas in linear models), but they do not seem to affect the overall accuracy of the model at this time.
3. Nearby: Since diseases know no human borders, I figured looking at the number of cases in nearby states would be worthwhile. This metric is a little skewed since it is looking at the distance between the center of one state and the center of the surrounding states. There may be a case where two states have lots of cases in border counties without having their geographic centers near enough to be included in each other’s count. However, I decided to keep this one since it does help particularly well in areas like New England where low population states like Rhode Island are really connected to high population states like New York.
For the model, I used a linear regression model. I plan to write a longer post about linear regression someday. For now, if you are unfamiliar with it, think of it as a trying to draw a line that most closely matches the data points you have with known responses. When you then add new data (the data you are trying to predict), the models adds those points to the created line and gives the number predicted by the path of that line.
You can check out this Wikipedia article for more detailed info: https://en.wikipedia.org/wiki/Simple_linear_regression
In linear regression, the result (Y) is equal to an intercept term (B) + a regression coeffecient created by the model * each variable (X1, X2…Xn)
Here is my model:
#Importing necessary modules from sklearn.linear_model import LinearRegression #Creating Y (target) variable and X (prediction) variables Y = train_target X = train_df[["nearby", "active", "1_day", "3_day"]] #Creating and training regression model regressor = LinearRegression().fit(X, Y) coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient']) #Displaying model stats display(coeff_df) print("Intercept: " + str(regressor.intercept_)) print("R^2 value: " + str(round(regressor.score(X, Y),2)))
The screenshot below shows the intercept and coefficients created for the model. It also shows the R-squared (R^2) value. I will talk about it more when I write about linear regression, but it is basically a measure of accuracy of fit. R^2 measures the amount of the variation in the results that is explained by the model, and it is given in a range between 0 and 1.
The intercept is a baseline value. If all the X values (the variables like 1-day, 3-day, etc.) are at 0, then the resulting prediction will just be the intercept of -1.35. The coeffecients mean that a 1 unit increase in those variables leads to a raise in the prediction by that coeffecient.
For example, the .45 for Active cases means that 100 Active cases in a state would raise the estimate by 45. Add that to the Intercept, and you get a prediction of roughly 43.65 with all other variables kept steady.
The R^2 for this model is a very high .95 which means that the model successfully explains almost all of the variation. However, this could be due to overfitting due to the small number of data points (51). Also, running the model on a different date (this was from the 3/16/20 data) will produce entirely different results.
You can see this model at work in the code and following screenshot. It that next code block, I predicted results for the 3/16 data and then compared it to the actual, known results.
#Creating predictions for the known data y_train_pred = regressor.predict(X) #Table to compare the predicted results against the actual, known results training_predictions = pd.DataFrame({"Predicted": y_train_pred, "Actual":Y}) display(training_predictions.head(10))
As you can see, the actual results are not quite right, but they are pretty close. Now that the model was trained on the known data, I was ready to predict the results for the unknown. I was ready to predict how many new, confirmed cases would be announced per state on 3/17.
#Creating the X (predictor) variables for the unknown (current) data x_prediction = predict_df[["nearby", "active", "1_day", "3_day"]] #Predicting and storing in a Y result variable y_pred = regressor.predict(x_prediction) #Initiating list for modify8ing above results adjusted_predictions = [] #For each item in the results list for i in y_pred: #If it is less than or equal to 0, set it to 0 if i <= 0: adjusted_predictions.append(0) #Otherwise, round it to the nearest integer (think: whole, non-decimal number) else: adjusted_predictions.append(int(i.round())) #Put the adjusted results in a dataframe and display predictions for next day new cases predicted_by_state = pd.DataFrame({"State": predict_df["Province/State"], "Predicted new cases": adjusted_predictions})
Here are the first 10 as shown in my IDE:
And the total results after exporting to Excel and formatting a bit for clarity:
Here at the end, I want to caution everyone again to take these results and predictions with a grain of salt. Many factors go into disease spread, and the 50 states have wildly different cultures and geographies that could contribute to or hinder the spread of COVID-19.
I could easily have missed relevant factors. I have high correlation between several of my factors. I did not do any kind of cross-validation or train/test split on my data. Those are just a few of the area that could be improved.
Regardless, I had fun creating this, and I would like to see how it performs each day through this outbreak of disease. For a first peresonal data science project, it was fun and instructive for me (I had not done many of these steps in Python before since I had done most of my data science work in R and kept Python for programming).
I hope it can be somewhat fun and instructive for you too!
P.S. In the future, I may revisit this and try to create some predictive models that go beyond a single day.