The Birthday Problem with Real-world data (Birthday Problem Pt. 2)

Written by Chase Thacker

I am an adoptive father of two boys. My wife and I love West Virginia and hope to share that love of the Appalachian region with others. I do my small part by blogging about data science and Appalachian topics. For fun, I like to play hobbyist board games, read tons of books (particularly sci-fi and fantasy), and pretend to know what I am doing in my woodshop.



Data Science | Statistics



May 3, 2020

As promised in the previous post, I have found some real-world datasets to test out our Birthday Problem predictions on! The more advanced visualizations will have to wait until another weekend when I have some time.

For this post, I will be looking at four datasets and checking how compare to our statistics. Do the rather remarkable predictions of the Birthday Problem equation hold up when compared to actual data rather than to programmatically-generated datasets?

Side note: like all of my posts, the source code can be found in my public GitHub repository at https://github.com/acthacker/birthday_problem

The four datasets I gathered were all sourced from Kaggle, a website for data science learning, competitions, and dataset sharing. In the future, I may try scraping my own data from the web for an extension of this project, but I will save that for some time when I want to write about web crawling. Keep an eye out also for projects where I will show how to interface with an API to gather data.

Here are the links to the data:

The first dataset is a collection of stats about NFL players that the author scraped from NFL.com. Since this data included every player from the start of the league until the time it was scraped (2016), it included many retired players that I had to remove to narrow the focus to current players. I also had to remove the small handful of players for whom no birthday was included

The second dataset is a complete collection of worldwide soccer player information scraped from FIFA 20 so the data should be relatively accurate for current teams. Very little data cleaning was needed here.. The third dataset also needed little cleaning. It was a collection of data about NBA teams in 2014-2015 and was relatively complete.

The final dataset required a little more manipulation than the others. It is a collection of information about each Congress starting in 1947 and running up to the preset. It was collected and hosted by FiveThirtyEight as part of a project they have looking at the age of congressmen. For this set, I looked only at the Senate, and I looked at the Democrats and Republicans separately. Therefore, I needed to filter out the House members and split the data on party lines. Doing so gave me datasets with an average of nearly 50 members in each group

If you are interested in figuring out how many NBA team have to plan big parties or how often senators should celebrate together, you can jump past the next section into the results. If you want to see how I got to those results, read the next section with the code breakdown and explanation!

As always, I am sure someone out there (or many someones) could have written more concise, quicker code. However, I found a way that worked for me. I will be sharing only the NFL code here rather than stepping through the same thing several times. For all four, I used that basic code structure with just some minor changes for the peculiarities of those individual datasets.

The basic approach was to start with data cleaning and making a list of the teams. At this point, I needed to convert the birthdays from full dates into just month/day info so we were not considering year. Then, I looped through the teams and made lists of the birthdays. With the list, I could then check for duplicated birthdays and increase my counts. Finally, I called my code from the first post to calculate the statistical predication to the actual observed percentage of teams with duplicated birthdays.

#Importing in the function from the previous post
from birthday_problem import factorial_equation
#Importing in Pandas for dataframe management
import pandas as pd

#Read in data sets
congress = pd.read_csv("congress_data.csv")
nba = pd.read_csv("nba_data.csv")
nfl = pd.read_csv("nfl_data.csv")
fifa = pd.read_csv("fifa_data.csv")

#Filter the dataset to include only active players and exclude retired
nfl = nfl[nfl["Current Status"]=="Active"]
#Cut out the small number with Null birthday data
nfl = nfl[nfl["Birthday"].notnull()]

#Find the team names by making a list of the unique values in the "Current Team" column
nfl_teams = nfl["Current Team"].unique()

#Converts birthdays to datetime element
nfl["Birthday"] = pd.to_datetime(nfl["Birthday"])
#Converts datetime back to string--but in a month/day only format without the year
nfl["Birthday"] = nfl["Birthday"].apply(lambda x: x.strftime("%m/%d"))

#Initialize match and count numbers
matches = 0
count = 0
#Loop through every team
for i in nfl_teams:
    #Filter to the current team
    curr = nfl[nfl["Current Team"] == i]
    #Make a list of the current birthdays
    birthdays = list(curr["Birthday"])
    #Measure if the length of the list matches the length of the set of value in the list
    #If any values are duplicated, the set will be shorter than the length
    #Uniqe values are only counted one in the set so the length is shorter with duplicated items
    if len(birthdays) != len(set(birthdays)):
        #Increase our match counter by 1 if a duplicate is found
        matches +=1
    #Increase our team count counter regardless
    #Could also be found by just useing len(nfl_teams)
    count += 1

#Find mathematically predicted percentage using function created in last post
#Multiply result by 100 and round to show percentage in a way more friendly to most people
nfl_pred_perc = round(factorial_equation(95)*100, 2)
#Find percentage of teams with a duplicated birthday by dividing the match by the team count
nfl_dupe_perc = round((matches / count) * 100, 2) 

#Print results
print("NFL Predicted Percentage: " + str(nfl_pred_perc))
print("NFL Dupe Percentage: " + str(nfl_dupe_perc))

Ultimately, the observed results ended up really close to the predicted results!

NFL Predicted Percentage: 100.0
NFL Duplicated Percentage: 100.0

FIFA Predicted Percentage: 59.82
FIFA Duplicated Percentage: 62.32

NBA Predicted Percentage: 25.29
NBA Duplicated Percentage: 26.67

Congress Predicted Percentage: 97.04
Congress Duplicated Percentage: 100.0

For the NFL, the team sizes were around 95 each. Since 99.9% probability of a match is hit at 70, a team size of 95 is so close to 100% that the rounding function ended up spitting out a full 100%. Pre-rounding, it was predicted at 99.99985601708488%. Unsurprisingly, NFL players should all plan to deal with birthday scheduling conflicts among their teammates at least once in the year.

For FIFA, I predicted the statistics based on an average team size of 26, and I predicted the NBA on the average roster size of 15. Of course, some teams were smaller and some were larger for each of these datasets so the probabilities were not exactly on the predicted probability. However, it still worked out with both being within just a few percentage points of the predicted probabilities.

For Congress, I ended up with finding a duplicated birthday in every single group of Republicans and Democrats from all 33 incoming congresses measured in the dataset. With a 97% probability, that is within the realm of likelihood (.97^66 (number of trials) = 13.4% chance of randomly achieving this result). However, selection bias is probably playing a role here since some senators serve for a long, long time (Senator Byrd of WV, for example). Because of senators serving more than one term, the incoming congresses every 2 years are not exactly randomly-chosen. Also, the structure of the senate means that only 1/3 of the membership changes with each incoming congress.

In any case, even though these datasets were not necessarily entirely random, the numbers in the real world do follow the math! I think the Birthday Problem is so fascinating, and it is quite incredible that it proves so accurate.

Quick Hits: Exclude Current Week in SQL

by Chase Thacker | Jun 11, 2021 | Data Science

Welcome to the first of my “Quick Hits” entries on the blog! I am still trying to find the proper shape for my blog formats, and I hope this one will persist going forward. Quick Hits entries will be more focused on a problem I found interesting and a solution to that...

How to find data for data science or GIS projects

by Chase Thacker | Feb 12, 2021 | Appalachia, Data Science, Tools

Intro When starting a personal data science project, often the most difficult part is figuring out where to obtain the data you will use. Whether you are just looking for some test data to use when learning a new technique or are looking to investigate the answer to a...

Exploring the Birthday Problem/Paradox (Pt. 1)

by Chase Thacker | Apr 30, 2020 | Data Science, Statistics

My first sequence of posts was all about COVID-19 in the early days of the outbreak in the US. Now, I want to turn towards something a little more fun--the Birthday Problem! This classic stats problem is also known as the "Birthday Paradox" because its conclusions run...

« Older Entries

Thanks for reading!

If you enjoyed this content, feel free to sign up for my newsletter to be updated about new content. I promise not to spam your inbox, and you should receive only receive emails when something new is available for you to read.

If you enjoy my writing, you can also check out my other blog where I review books!

Books blog

Written by Chase Thacker

Data Science | Statistics

May 3, 2020

Recent Posts

Archives

Categories

You may also like…

Quick Hits: Exclude Current Week in SQL

How to find data for data science or GIS projects

Exploring the Birthday Problem/Paradox (Pt. 1)

Thanks for reading!

Success!