My first sequence of posts was all about COVID-19 in the early days of the outbreak in the US. Now, I want to turn towards something a little more fun–the Birthday Problem! This classic stats problem is also known as the “Birthday Paradox” because its conclusions run counter to so many people’s intuition.
This post is the first example of another type of post I hope to have, the “exploration of a concept in math/stats/programming” post. In these posts, I will take the problem, explain it, and provide some code to explore it. On down the line, I definitely hope to get into some Appalachian data (since this is an Appalachia-themed data blog). If you are looking for those posts, please be patient! I will get there soon.
Just like in my last series, I am open-sourcing all of my code, and it can be accessed through this link: https://github.com/acthacker/birthday_problem
Here is the classic question driving the Birthday Problem: “How many randomly-selected people have to be in a room to ensure at least a 50% probability that two (or more) of them share a birthday?” Of course, there are constraints on the problem. Twins and leap years are ignored. Seasonality in births is ignored (more people tend to be born in summer months), and a standard distribution is assumed. The birth date is all that is considered rather than considering year of birth.
How many do you think? 10, 50, 100, 200? Many people would likely come up with 123 (365/2). However, the correct answer is a measly 23! With 23 randomly-selected people, the probability of at least two sharing the same birthday is slightly over 50%. We will calculate the exact percentage later.
What leads to this seeming paradox? Most people, when they hear this question, think of comparing themselves against other people in the room. They forget the number of comparisons that happen among the other people.
With just two people in the room (A and B), only one comparison exists A<=>B. Add in just two more people, and the number quickly grows to six:
- 1. A<=>B
- 2. A<=>C
- 3. A<=>D
- 4. B<=>C
- 5. B<=>D
- C<=>D
Those comparisons increase by n – 1 for every additional person. For example, our magic number of 23 would have 22 + 21 + … +2 +1 total comparisons–253 total comparisons. You can also arrive at this number by counting pairs (23*22)/2 = 253.
Okay, so now we know there are way more comparisons than most people consider. How do we get from this number of comparisons to the actual probability of sharing birthdays for any number of people?
The easiest approach is not to look at the probability of having a match. Rather, the easiest approach is the look at the probability of no match existing. Once we know that probability, we can easily find the probability of a match existing.
- P(s) = probability of successful match
- P(u) = probability of unsuccessful match
Since these probabilities must add up to 1:
- P(s) + P(u) = 1P(s) = 1 – P(u)
All we have to do then is find the probability of no successful matches existing and subtract that number from 1 to get our probability of a successful match. I am going to walk through two different ways and give some Python code calculating each.
Method 1: Simple exponentiation
For this method, we will need the number of pairs as calculated previously. We will iterate through each pair and treat it as an individual event (like a coin flip). For each event, we will calculate the probability of that pair not being a match. Then, we will combine those probabilities to determine what the probability of none of the pairs containing a match is.
When comparing birthdays, the probability of no match occurring is 364/365; the second person could have been born on any of the days of the year other than the day of the first person.
- 364/365 = 0.99726
For no matches to have occurred, we have to hit that 0.99726 chance for every single pair in the room. That means multiplying that chance by itself for every pair. The formula (where n = number of people in the group) is:
- 0.99762^(n*(n-1) / 2 ) or in the case of our running example:
- 0.99762^253 = 0.49949
Plugging that into our other formula:
- P(s) = 1 = 0.49949Probability of successful match when 23 people are in the room = 0.50015 or roughly 50%
Here is some code to show this equation in Python:
#Function representing exponentiation approximation formula def exponentiation(num_people, num_days=365): #Probability of no match for each pair probability_of_no_match = (num_days - 1) / num_days #Calculate number of pairs num_pairs = num_people * (num_people - 1) / 2 #Determine overall probability of no match occurring by raising probability of no match to the number of pairs overall_probability_of_no_match = probability_of_no_match**num_pairs #Subtract probability from 1 to find probability of match occurring probability_of_match = 1 - overall_probability_of_no_match #Return result when functino is called return probability_of_match #Calls function with 23 random people print(exponentiation(23)) #Print output: 0.5004771540365807
As you can see, the code gave roughly the same result with some differences due to rounding in intermediate steps.
Method 2: Using factorials
Method 2 is a little more complex than method 1, but it does give a slightly different perspective in what is going on. Like Method 1, it relies on finding the probability of no matches occurring and then subtracting that amount from 100%.
This method starts by looking at the probability that each additional person does not match any previous person’s birthday. For the first person, no dates are taken so their probability is 365/365 or 100% chance that they do not share a birthday (with the non-existent people in the room). For the second person, it is 364/365 because they first person has already claimed one birthday. For the third, it is 363/365. This pattern continues through for every person in the group.
Next, you take all these probabilities and multiply them by each other. 365/365 * 364/365 * 363/365 and so on.
Of course, as the number of participants gets large, the multiplication would get long and messy. Luckily, an easy simplification exists! We can use factorials.
A factorial (represented by a “!” after the integer) is the multiplication of the integer by the next smallest integer repeated until it hits 1. For example, 4! = 4*3*2*1 = 24
Where do we see factorials in our problem above? We see it in the numerator. In the numerator, you end up with 365 * 364 * 363 … which is just 365! limited by the n number of participants.
Since it is limited to n, we should make sure the sequence stops after that n number of iterations. That can happen by dividing 365! by (365 – n)! because adding that term causes cancellation of all multipliers below 365 – n. For our test case of 23, that means 365!/342!.
At this point, we have our numerator figured out (365!/342!) so we just need to figure out the denominator. That is actually pretty easy since all the initial fractions have the same denominator, 365. The denominator of our function will be 365^n or in our test case 365^23.
At the end, that means our total function for the probability of no match occurring is:
- (365!/342!)/365^23 = .4927
Which means our probability of a match is:
- P(s) = 1 – .4927
- P(s) = .5073
**If you pay attention, you can see slightly different results between the two methods. I forgot to mention that Method 1 is simply an approximation for simpler and quicker calculation. Method 2 is a complete, more accurate equation**
We can take this method a little further and generalize it out to work with any number of people in the group and even any number of days selected. If:
n = number of randomly-selected participantsd = number of days
Our equation is: (d! / (d – n)!) / d^n = P(u)
Or the more commonly-used style (post algebraic restructuring): P(s) = 1 – (d! / ((d – n)! * d^n))
Here is this equation coded out in Python also:
#Importing the math module for factorial function import math #Create function to calculate using factorial equation def factorial_equation(num_people, num_days = 365): #Compute nummerator numerator = math.factorial(num_days) #Compute denominator denominator = math.factorial(num_days - num_people) * num_days**num_people #Calculate probability of no match probability_no_match = numerator / denominator #Calculate probability of match occurring probability_match = 1 - probability_no_match #Return probability of match whenever function is called return probability_match #Calls and prints result when function is given 23 as the number of people print(factorial_equation(23)) #Print output: 0.5072972343239854
The result again is roughly 50%
Naturally, I can throw numbers out all day, but do they hold up? In a future post, I plan to look at some real-world datasets, but we can just code some random sets today and check our math.
In the following example, I generated 10,000 sets of random birthdays for n number of people in each set. Then, I checked for duplicates within each set and reported back the percentage of sets with duplicates. This process should emulate real-world data pretty well. So, do the results match the predictions of the math?
#Import necessary functions from random import seed from random import randint #Since computers only work in pseudo-random numbers, we can set a seed to have replicable results #42 is my preferred number (see The Hitchhiker's Guide to the Galaxy for context) seed(42) #Function to build 10,000 random sets and calculate percentage that have a duplicated birthday def perc_matches(num_people, num_days = 365, num_iterations = 10000): #This creates a list of lists that is num_iterations long #Each list contains num_people number of randomm numbers with num_days as the amount of integers to chose from random_selections = [[randint(0,num_days) for i in range(0,num_people)] for x in range(0,num_iterations)] #Initiates count of matches matches = 0 #For each list in the list of lists for i in random_selections: #If the lenght of the list is different than the length of the set of items in the list #the list must have duplicated items. if len(i) != len(set(i)): #When duplicated items appear in the list, the number of matches increases by 1 matches += 1 #Number of lists with duplicated items is divided by number of lists to give percentage of list with duplicated birthdays perc = matches / num_iterations #Returns computed percentage return perc #Prints result of function call when 23 people are given to the function. print(perc_matches(23)) #Print output: 0.5131
Out of my 10,000 random sets, 51.31% contained a match. That is really close to the 50.73% predicted by the factorial equation. Since we are dealing with probabilities, we had no guarantee of getting the exact amount. As the number of trials approaches infinity, we should see the percentage with matches approach the mathematical probability of 50.73%.
Okay, so now we know that it is reasonably accurate for 23. What about other numbers of people in each trial? What is the probability that there is a match when there are more or less people in our random collection of people?
In the following code, I calculated the mathematical predication for each integer from 1-70 and then calculated the percentage of trials with duplicated birthdays. Then, I exported the results to CSV so you can see the results in a table.
#Import Pandas to work with dataframes import pandas as pd #Initiate dataframe with column names df = pd.DataFrame(columns = ["Number of People", "Exponentiation Approximation", "Factorial Equation", "Coded Output"]) #Loop to feed teh functions every number from 1-70 #70 is where the probability hits 99.9%, and computing much higher makes this code run very slowly for i in range(1,71): #Calls the functions for each number and adds the results to the dataframe df.loc[i] = [i, exponentiation(i), factorial_equation(i), perc_matches(i)] #Exports the dataframe as a CSV for use in other code and visualizations in future posts. :) df.to_csv("birthday_problem.csv", index=False)
*I cut the runs of the program off at 70 because that is where predictions hit 99.9% probability. Running for longer can cause very extended run times for the program.
The shard-eyed among you may have noticed that the coded output for 23 is not the same as the first time we ran the function. That is because a different set of random numbers was generated the second time the function was called.
If you look at the chart, the coded results from our randomly-selected groups lines up so perfectly with the mathematical predictions that you can barely see the distinctions in the lines. Our math is validated!
I chose to write this post because I think the Birthday Problem/Paradox is one of the most interesting quirks in stats. I know it is important to the computer science and cryptography folks because of some implications it has for computing. For me in my data work, it mostly serves to inspire me. It reveals how math and stats and challenge our expectations. It shows how following the data can lead to greater insights.
Stay tuned for my next post where I will find some real-world datasets to test this problem on and, hopefully, produce some better, interactive visualizations.