I see a lot of confusion surrounding the normal distribution. People expect them to appear in places where they shouldn’t, or people fail to understand why statisticians assume these distributions in so many models. Part of me wants to blame this on the name, but I’m not a Sapir-Whorf guy. Rather, I think that few professors and textbooks explain it in an adequate way, and most people don’t care enough about the subject to delve into it on their own time. In my experience, statistics courses often emphasize tables and plugging stuff into formulas. This leads to some short-term understanding of the material but probably causes more confusion in the long run.
In an earlier article, I explained the statistical significance formula. I recommend reading that, but it still doesn’t explain why we bother with normal distributions in the first place. I will try to offer that explanation here. I will also provide all the relevant code, so you can follow along on your favorite Python compiler if you’d like. Let’s start with a roulette wheel.
Uniform Distribution
A Las Vegas roulette wheel contains 38 numbers: 0-36, plus a “00.” To keep everything as a unique integer, I will treat the “00” as a “-1” in this article. Imagine that I walk into a casino and mark every outcome on a roulette wheel. I sit there for hours (if not days) and record 100,000 games of roulette. What will that distribution look like?
#imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import random
import seaborn as sns
# Setting up list of roulette values
roulette_values = [x for x in range(-1,36+1)]
roulette_dict = {x:0 for x in range(-1,36+1)}
# Playing 100,000 roulette games
random.seed(0)
for x in range(0,100_000):
roulette_dict[random.sample(roulette_values,k=1)[0]] += 1
if x % 10000==0:
print('completed step', x)
figure(figsize=(12, 6), dpi=80)
temp_df =pd.DataFrame.from_dict(roulette_dict,orient='index', columns=['times_landed_on'])
temp_df.reset_index(inplace=True)
temp_df.rename(columns={'index': 'wheel_number'},inplace=True)
sns.barplot(data=temp_df, x='wheel_number',y='times_landed_on')
This is not the beautiful bell curve that everyone remembers from their stats classes. Rather, the roulette wheel produces a “uniform” distribution. This is a fancy statistics term for a distribution where everything has the same likelihood. On a given spin, you’re as likely to see a 36 as you are a 3 or a 14. Let me repeat: this is not a normal distribution, and we shouldn’t expect to see one in this case.
Imagine that someone asks you the following question. They observed a roulette wheel for 500 spins and saw that the average score was 20. That’s above the expected average of 17.5 (again, we’re treating “00” as “-1”). What are the odds of this occurring by random chance? This is your standard “is ___ statistically significant” question. Let’s not answer in the standard way, though. Ignore that “z-table” in the back of your statistics textbook. Pretend you’ve never heard of a normal distribution. How would you answer this?
We want to see if 20 is a weird result. We know it’s above the mean, but we don’t know if a result that far above the mean is to be expected. We can figure this out, however, without resorting to any formulas or tables. We’re going to head back to that roulette table. This time, though, we don’t need to know how the game itself is distributed. We figured that part out: it’s uniform. Instead, we need to model the average score of a bunch of roulette games. To create this model, we’re going to watch 30 rounds at a time. After we watch thirty rounds, we’ll average the 30 scores and record this average. Next, we’ll watch another 30 rounds and record the average of those 30 games. We’ll do this 100,000 times, giving us 100,000 averages across 3,000,000 total spins of the wheel.
random.seed(0)
result_li = []
for x in range(0,100_000):
result_li.append(np.mean(random.sample(roulette_values,k=30)))
sns.distplot(result_li)
This might look a bit more familiar. Most of our 30-round sets averaged a score close to 17.5. A smaller amount saw something in the 15-16 range or the 19-20 range. An extremely small amount of games had an average score of under 15 or over 20. Since we’re working with a real (albeit simulated) dataset, we report exact values. Remember, we have 100,000 average scores. The average of those averages is 17.5, which is also, not coincidentally, the expected value of a single game of roulette. These sets have a standard deviation of 0.92. Roughly 16% of values are more than one standard deviation from the mean and roughly 16% are less than one standard deviation from the mean. Roughly 2.5% of values are more than two standard deviations from the mean and roughly 2.5% are less than two standard deviations below the mean. In other words, about 32% of values sit more than one standard deviation from the mean and about 5% rest more than two standard deviations away from the mean. I’m not making theoretical claims here! This is just what the data says.
Let’s return to that question from earlier. Someone saw a roulette wheel average a score of 20 across a large number of games. Could this be statistical noise? Or does this indicate a roulette wheel that’s biased toward higher numbers? With a mean of 17.5 and a standard deviation of 0.92, a value of 20 is more than two standard deviations from the mean. Earlier, we saw that a value of more than two standard deviations from the mean only occurred around 5% of the time: 2.5% on the high end (more than two standard deviations above the mean) and 2.5% on the low end (more than two standard deviations below the mean). In other words, a value as extreme as 20 has less than a 5% probability of occurring by random chance. There’s probably something wrong with that wheel. Alternatively, we can just look at the graph. Does a value of 20 look like an expected result? The curve looks pretty short there, so probably not. Contrast that to a value of 18. That’s not the most likely result, but it occurs enough that it shouldn’t raise any eyebrows.
Here’s how I got the values mentioned above:
dist_mean, dist_std = np.mean(result_li), np.std(result_li)
pct_below_1 = len([i for i in result_li if i < dist_mean - dist_std]) / len(result_li)
pct_above_1 = len([i for i in result_li if i > dist_mean + dist_std]) / len(result_li)
pct_above_2 = len([i for i in result_li if i > dist_mean + dist_std * 2]) / len(result_li)
pct_below_2 = len([i for i in result_li if i < dist_mean - dist_std * 2]) / len(result_li)
Negative Binomial Distribution
Now, Let’s play a slightly different game. Instead of playing a fixed number of spins, we’ll play the roulette wheel until we hit a value of 28 or higher. Then, we’ll jot down how many attempts it took us to see this result. For example, say we get the following results: 4, 8, 25, 15, 12, 31. We’d write down “6,” since it took us 6 spins to see a result of 28 or higher. We will play 100,000 such games.
def play_roulette_game():
plays = 0
hit_28_plus=False
while not hit_28_plus:
my_sample = random.sample(roulette_values,k=1)[0]
if my_sample >= 28:
hit_28_plus = True
plays+=1
return plays
random.seed(0)
roulette_play_list = []
for x in range(1,100_000):
roulette_play_list.append(play_roulette_game())
sns.distplot(roulette_play_list)
We call this negative binomial distribution, but I encourage you to ignore the fancy stuff for now. Is this what you expected? If not, think about how this game works. You have a roughly 1/4 shot of getting 28 or higher on a given spin. About a quarter of our games will therefore end on the first play. You’ll need about four spins, on average, to get one of the magic numbers. On occasion, though, you can hit an unlucky streak and need 10 or more spins to get the right result.
Imagine that someone else played this same game 500 times. He tells you, on average, that it took him 4.5 tries to get a 28 or higher. Should we expect that, or is the wheel rigged? To answer this, we’ll again throw away our textbooks and Z-tables. Instead, let’s produce another distribution of averages. We’ll play 100 games and record the average number of spins needed across those 100 games. Then we’ll play another 100 and record the average across those games. We’ll do that 10,000 times and see what the distribution of averages looks like.
def play_100_roulette_games():
result_li = []
for x in range(100):
result_li.append(play_roulette_game())
return np.mean(result_li)
random.seed(0)
average_games_taken_li = []
for x in range(0,10_000):
average_games_taken_li.append(play_100_roulette_games())
sns.distplot(average_games_taken_li)
Huh, that’s a familiar shape. We can take the standard deviation and mean, and we find that…
game_mean, game_std = np.mean(average_games_taken_li), np.std(average_games_taken_li)
len([i for i in average_games_taken_li if i < game_mean - game_std]) / len(average_games_taken_li)
len([i for i in average_games_taken_li if i < game_mean - 2 * game_std]) / len(average_games_taken_li)
len([i for i in average_games_taken_li if i > game_mean + game_std]) / len(average_games_taken_li)
len([i for i in average_games_taken_li if i > game_mean + 2 * game_std]) / len(average_games_taken_li)
About 32% are more than one standard deviation above or below the mean and roughly 5% are more than two. That 4.5 average that our friend saw? Pretty close to the middle of the distributions, so it doesn’t indicate any wonky behavior from the wheel. If he averaged 6 tries, on the other hand, we might assume that the wheel is biased against numbers 28 and above.
Normal Distribution
In the first game, we saw a uniform distribution: all outcomes were equally likely. In the second, we saw a negative binomial distribution: small outcomes were most likely, but larger outcomes still occurred. Each game produced a very different dataset. Yet, when taking the average result of repeated plays, the results looked remarkably similar. Here’s what they look like together:
Both average-of-multiple-plays distributions saw about 68% of values stand within one standard deviation of the mean and about 95% within two standard deviations of the mean. They were both, in other words, normal distributions.
That’s where the normal distribution comes from. We don’t need to assume that the data itself derives from a normal distribution. After all, the first dataset came from a uniform one and the second came from a negative binomial one. It’s simply the case that the average of repeated plays follows a normal distribution. This result doesn’t depend on physics. It doesn’t depend on the construction of a roulette wheel, real or simulated. It doesn’t even depend on what game we play with the roulette wheel. Spin three balls on the board and add their scores together. Play a rigged roulette wheel where even numbers have twice the probability of odd ones. Count the number of times you land on the numbers 4, 7, and 17 before you land on the 29 twice; then take a blackjack card and multiply your score by that number. It doesn’t matter. If you average the results of repeated plays, those averages will follow a normal distribution.
It’s worth considering when we expect to see normal distributions. Why do pickle lengths follow a normal distribution? There are a bunch of random genetic events that determine length, and each pickle received a random result for each event. When you take the average of a bunch of random events, you get a normal distribution. Other times, however, we don’t expect a normal distribution. Consider my company’s website. One can think of each individual user as browsing the site until they lose interest. This represents a negative binomial distribution model, where “losing interest” fills the same role as “hitting a number of 28 or higher” in the second roulette game. Maybe that’s not a normal way to understand human behavior, but, empirically, I can confirm that the negative binomial model fits the data.
In short, remember that not every process will follow a normal distribution. The normal distribution stems from the average of a bunch of random processes, even if those random processes are themselves non-normal. When we run a hypothesis test, we aren’t assuming that our sample is normally distributed. Rather, we’re assuming that, if we sampled our data a super large number of times, the average of those samples would follow a normal distribution.
Where does the Normal Distribution Come From?
Always appreciate a stats refresher! I had both high school and college classes but don't use it in my daily life much so it's a ongoing battle against the constant atrophy of my knowledge base. Recently at work though, I've been voluntold into supporting a DEI-related* project and there's a lot of (software-assisted) regression analyses happening in the background. It definitely helps to have a solid grounding because then I can interpret the results too, not just rely on the vendor to explain.
*heavy on the E, hence the stats, and much less so on the D&I, happily
I always super appreciate your stats explainers. I really missed the boat on this stuff in school - my own fault - and this is just the right level of catch-up. You’re great at picking real-world examples that are easy to follow.