As promised in the previous post, I have found some real-world datasets to test out our Birthday Problem predictions on! The more advanced visualizations will have to wait until another weekend when I have some time.
For this post, I will be looking at four datasets and checking how compare to our statistics. Do the rather remarkable predictions of the Birthday Problem equation hold up when compared to actual data rather than to programmatically-generated datasets?
Side note: like all of my posts, the source code can be found in my public GitHub repository at https://github.com/acthacker/birthday_problem
The four datasets I gathered were all sourced from Kaggle, a website for data science learning, competitions, and dataset sharing. In the future, I may try scraping my own data from the web for an extension of this project, but I will save that for some time when I want to write about web crawling. Keep an eye out also for projects where I will show how to interface with an API to gather data.
Here are the links to the data:
- https://www.kaggle.com/kendallgillies/nflstatistics
- https://www.kaggle.com/stefanoleone992/fifa-20-complete-player-dataset
- https://www.kaggle.com/drgilermo/nba-players-stats-20142015
- https://www.kaggle.com/fivethirtyeight/fivethirtyeight-congress-age-dataset
The first dataset is a collection of stats about NFL players that the author scraped from NFL.com. Since this data included every player from the start of the league until the time it was scraped (2016), it included many retired players that I had to remove to narrow the focus to current players. I also had to remove the small handful of players for whom no birthday was included
The second dataset is a complete collection of worldwide soccer player information scraped from FIFA 20 so the data should be relatively accurate for current teams. Very little data cleaning was needed here.. The third dataset also needed little cleaning. It was a collection of data about NBA teams in 2014-2015 and was relatively complete.
The final dataset required a little more manipulation than the others. It is a collection of information about each Congress starting in 1947 and running up to the preset. It was collected and hosted by FiveThirtyEight as part of a project they have looking at the age of congressmen. For this set, I looked only at the Senate, and I looked at the Democrats and Republicans separately. Therefore, I needed to filter out the House members and split the data on party lines. Doing so gave me datasets with an average of nearly 50 members in each group
If you are interested in figuring out how many NBA team have to plan big parties or how often senators should celebrate together, you can jump past the next section into the results. If you want to see how I got to those results, read the next section with the code breakdown and explanation!
As always, I am sure someone out there (or many someones) could have written more concise, quicker code. However, I found a way that worked for me. I will be sharing only the NFL code here rather than stepping through the same thing several times. For all four, I used that basic code structure with just some minor changes for the peculiarities of those individual datasets.
The basic approach was to start with data cleaning and making a list of the teams. At this point, I needed to convert the birthdays from full dates into just month/day info so we were not considering year. Then, I looped through the teams and made lists of the birthdays. With the list, I could then check for duplicated birthdays and increase my counts. Finally, I called my code from the first post to calculate the statistical predication to the actual observed percentage of teams with duplicated birthdays.
#Importing in the function from the previous post from birthday_problem import factorial_equation #Importing in Pandas for dataframe management import pandas as pd #Read in data sets congress = pd.read_csv("congress_data.csv") nba = pd.read_csv("nba_data.csv") nfl = pd.read_csv("nfl_data.csv") fifa = pd.read_csv("fifa_data.csv") #Filter the dataset to include only active players and exclude retired nfl = nfl[nfl["Current Status"]=="Active"] #Cut out the small number with Null birthday data nfl = nfl[nfl["Birthday"].notnull()] #Find the team names by making a list of the unique values in the "Current Team" column nfl_teams = nfl["Current Team"].unique() #Converts birthdays to datetime element nfl["Birthday"] = pd.to_datetime(nfl["Birthday"]) #Converts datetime back to string--but in a month/day only format without the year nfl["Birthday"] = nfl["Birthday"].apply(lambda x: x.strftime("%m/%d")) #Initialize match and count numbers matches = 0 count = 0 #Loop through every team for i in nfl_teams: #Filter to the current team curr = nfl[nfl["Current Team"] == i] #Make a list of the current birthdays birthdays = list(curr["Birthday"]) #Measure if the length of the list matches the length of the set of value in the list #If any values are duplicated, the set will be shorter than the length #Uniqe values are only counted one in the set so the length is shorter with duplicated items if len(birthdays) != len(set(birthdays)): #Increase our match counter by 1 if a duplicate is found matches +=1 #Increase our team count counter regardless #Could also be found by just useing len(nfl_teams) count += 1 #Find mathematically predicted percentage using function created in last post #Multiply result by 100 and round to show percentage in a way more friendly to most people nfl_pred_perc = round(factorial_equation(95)*100, 2) #Find percentage of teams with a duplicated birthday by dividing the match by the team count nfl_dupe_perc = round((matches / count) * 100, 2) #Print results print("NFL Predicted Percentage: " + str(nfl_pred_perc)) print("NFL Dupe Percentage: " + str(nfl_dupe_perc))
Ultimately, the observed results ended up really close to the predicted results!
NFL Predicted Percentage: 100.0
NFL Duplicated Percentage: 100.0
FIFA Predicted Percentage: 59.82
FIFA Duplicated Percentage: 62.32
NBA Predicted Percentage: 25.29
NBA Duplicated Percentage: 26.67
Congress Predicted Percentage: 97.04
Congress Duplicated Percentage: 100.0
For the NFL, the team sizes were around 95 each. Since 99.9% probability of a match is hit at 70, a team size of 95 is so close to 100% that the rounding function ended up spitting out a full 100%. Pre-rounding, it was predicted at 99.99985601708488%. Unsurprisingly, NFL players should all plan to deal with birthday scheduling conflicts among their teammates at least once in the year.
For FIFA, I predicted the statistics based on an average team size of 26, and I predicted the NBA on the average roster size of 15. Of course, some teams were smaller and some were larger for each of these datasets so the probabilities were not exactly on the predicted probability. However, it still worked out with both being within just a few percentage points of the predicted probabilities.
For Congress, I ended up with finding a duplicated birthday in every single group of Republicans and Democrats from all 33 incoming congresses measured in the dataset. With a 97% probability, that is within the realm of likelihood (.97^66 (number of trials) = 13.4% chance of randomly achieving this result). However, selection bias is probably playing a role here since some senators serve for a long, long time (Senator Byrd of WV, for example). Because of senators serving more than one term, the incoming congresses every 2 years are not exactly randomly-chosen. Also, the structure of the senate means that only 1/3 of the membership changes with each incoming congress.
In any case, even though these datasets were not necessarily entirely random, the numbers in the real world do follow the math! I think the Birthday Problem is so fascinating, and it is quite incredible that it proves so accurate.