# Histograms and Cumulative Distribution Functions

For this project we’re going to be using the SF Salaries data set from Kaggle. Here is what the data set looks like:

Our goal will be to compare the distributions of employee salaries in 2011 and 2014. We will try to detect any notable differences in the distributions.

First we’re going to split the original DataFrame by Year, so we get 4 smaller DataFrames:

1 2 3 4 5 6 |
salaries = pd.read_csv('Salaries.csv') salaries_2011 = salaries[salaries['Year'] == 2011] salaries_2012 = salaries[salaries['Year'] == 2012] salaries_2013 = salaries[salaries['Year'] == 2013] salaries_2014 = salaries[salaries['Year'] == 2014] |

The most common representation of distribution is a histogram, which is a graph that shows the frequency of each value in the data. A histogram is very useful for inspecting the underlying distribution of the data, detecting outliers, skewness and etc. We’re going to plot the salary distribution histogram of employees for 2011. Histograms can be very informative, as you can see in the following plot:

1 2 3 4 5 6 |
salaries.drop(salaries[salaries['TotalPay'] > 300000].index, axis=0, inplace=True) #outlier removal plt.figure(figsize=(15,12)) sns.distplot(a=salaries_2011['TotalPay'], bins=350, label='2011', kde=True).set_title('Salary Distribution 2011') plt.legend(loc='upper right') plt.show() |

*Note: The outlier removal method in the snippet above is not recommended, but it will satisfy the needs of this project.*

We said we wanted to compare the distributions of salaries for 2011 and 2014, so let’s try doing that with the usage of histograms:

1 2 3 4 5 |
plt.figure(figsize=(18, 12)) sns.distplot(a=salaries_2011['TotalPay'], bins = 300, label='2011', kde=True).set_title('Salary Distribution for 2011 and 2012') sns.distplot(a=salaries_2014['TotalPay'], bins = 300, label='2014', kde=True) plt.legend(loc='upper right') plt.show() |

As you can see this is not very revealing. Two major problems occur when comparing distributions using histograms:

- They are affected by different sample sizes. If we had more samples for 2011 than 2012, we would be mislead into thinking that there is a substantial difference in the frequency distributions, when actually it’s just a result of different sample sizes.
- They are not easily interpretable. As you can see they’re cluttered and it’s really hard to spot any difference in the frequency distributions, even if there was one.

So, to address these problems, we’re going to introduce a new concept called the Cumulative Distribution Function, or CDF, for short.

### Cumulative Distribution Functions

A CDF is a function that describes the probability of a random variable taking on a given value or less. To calculate it for a particular value *x* of a distribution, you do the following:

- Sort the values in the distribution in ascending order,
- Count the number of values lower than or equal to your value,
- Divide the number of values lower than
*x*by the total number of values in the distribution.

In Python:

1 2 3 4 5 6 7 8 |
def EvalCdf(sample, x): count = 0.0 for value in sample: if value <= x: count += 1 prob = count / len(sample) return prob |

If you have ever taken a standardized test, then you will know what *percentile ranks* are. Well, CDFs are essentially the same thing, except that the result is a probability in the range 0-1, instead of a score from 0-100.

Percentile Ranks in Python:

1 2 3 4 5 6 7 |
def PercentileRank(scores, your_score): count = 0 for score in scores: if score <= your_score: count += 1 percentile_rank = 100.0 * count / len(scores) return percentile_rank |

#### Plotting CDFs

Most of my statistical evaluations rely on the visual representation of the Cumulative Distribution Function. Here is a function, in Python, that will do the calculation and the plotting for you:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
def plot_cdf(data, num_bins, title=None, xlabel=None): counts, bin_edges = np.histogram (data, bins=num_bins, normed=True) cdf = np.cumsum(counts) sns.set_style("darkgrid") plt.figure(figsize=(18,13)) plt.plot (bin_edges[1:], cdf/cdf[-1]) plt.ylabel('CDF') if title: plt.title(title) if xlabel: plt.xlabel(xlabel) plt.show() |

This is a function that will produce a simple plot. You can modify it and make all kinds of alterations that will make the plot look nicer.

Let’s try plotting the CDF of the *TotalPay* feature from the* salaries_2011* DataFrame, without removing outliers:

1 |
plot_cdf(salaries_2011['TotalPay'], 20, 'Salaries 2011 CDF', 'Salary') |

The long stretched line at the top indicates that there are outliers in this distribution. Let’s try plotting again, after we have removed the outliers:

1 2 |
salaries_2011.drop(salaries_2011[salaries_2011['TotalPay'] > 280000].index, axis=0, inplace=True) plot_cdf(salaries_2011['TotalPay'], 20, 'Salaries 2011 CDF', 'Salary') |

The histogram might look more informative than these plots at first glance, but let me point out situations in which these plots prove to be a lot more useful than their counterpart.

#### Comparing Distributions

Earlier in this post you have seen that histograms aren’t the best tool to use when comparing distributions, and that is exactly the reason why we introduced CDFs.

Let’s again try plotting the salary distributions for 2011 and 2012, only this time we will use CDFs, instead of histograms:

1 2 3 4 5 6 7 8 9 10 11 12 13 |
num_bins = 200 counts_2011, bin_edges_2011 = np.histogram(salaries_2011['TotalPay'], bins=num_bins, normed=True) counts_2014, bin_edges_2014 = np.histogram(salaries_2014['TotalPay'], bins=num_bins, normed=True) cdf_2011 = np.cumsum (counts_2011) cdf_2014 = np.cumsum(counts_2014) plt.figure(figsize=(18,13)) sns.set_style("darkgrid") plt.plot(bin_edges_2011[1:], cdf_2011/cdf_2011[-1], label='2011') plt.plot(bin_edges_2014[1:], cdf_2014/cdf_2014[-1], label='2014') plt.title('Salary Distribution for 2011 and 2014') plt.xlabel('Salary') plt.ylabel('CDF') plt.legend(loc='lower right') |

Unlike for the histogram, here we can almost immediately see the difference between the two distributions. They might be a bit harder to understand at first, but you will get the hang of it.

With histograms, different sample sizes can trick you into thinking that there is a difference in the distributions, which was one of the 2 problems with histograms we listed above. With CDFs, it’s irrelevant how much data each sample contains.

The plot above suggests that the salaries have risen from 2011 to 2014. Comparing the mean and the median of the distributions supports that claim:

1 2 3 4 5 |
print('2011 Salaries median - %.3f' % (salaries_2011['TotalPay'].median())) print('2014 Salaries median - %.3f' % (salaries_2014['TotalPay'].median()) + '\n') print('2011 Salaries mean - %.3f' % (salaries_2011['TotalPay'].mean())) print('2014 Salaries mean - %.3f ' % (salaries_2014['TotalPay'].mean())) |

1 2 3 4 5 |
2011 Salaries median - 68205.860 2014 Salaries median - 72355.030 2011 Salaries mean - 71663.077 2014 Salaries mean - 75367.466 |

An arbitrary number of CDFs can be plotted into the same axes without any problems for comparison. Let’s try plotting the salary distributions for each year from 2011 to 2014:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
num_bins = 20 counts_2011, bin_edges_2011 = np.histogram(salaries_2011['TotalPay'], bins=num_bins, normed=True) counts_2012, bin_edges_2012 = np.histogram(salaries_2012['TotalPay'], bins=num_bins, normed=True) counts_2013, bin_edges_2013 = np.histogram(salaries_2013['TotalPay'], bins=num_bins, normed=True) counts_2014, bin_edges_2014 = np.histogram(salaries_2014['TotalPay'], bins=num_bins, normed=True) cdf_2011 = np.cumsum (counts_2011) cdf_2012 = np.cumsum (counts_2012) cdf_2013 = np.cumsum(counts_2013) cdf_2014 = np.cumsum(counts_2014) plt.figure(figsize=(18,13)) sns.set_style("darkgrid") plt.plot(bin_edges_2011[1:], cdf_2011/cdf_2011[-1], label='2011') plt.plot(bin_edges_2012[1:], cdf_2012/cdf_2012[-1], label='2012') plt.plot(bin_edges_2013[1:], cdf_2013/cdf_2013[-1], label='2013') plt.plot(bin_edges_2014[1:], cdf_2014/cdf_2014[-1], label='2014') plt.legend(loc='lower right') plt.title('Salary Distributions from 2011 to 2014') plt.xlabel('Salary') plt.ylabel('CDF') |

#### Identification of the Distribution Type

One of the things that histograms are very useful for is the identification of the distribution type. Some of the common analytic distributions are the Normal, the Poisson, the Exponential and the Uniform Distribution (there are a lot more).

But, no single analytic distribution will be a perfect fit for out data. What we want to find out is not whether the analytic distribution fits our data perfectly, but whether it fits the data well enough for the task at hand. CDFs are especially useful for this task.

We will essentially do the same thing that we did when we compared the distributions above. We will plot the CDF of the analytic distribution that we want to test for and the CDF of our data on the same plot. Then, we will visually compare them and see whether the data fits the curve well enough.

We will compare our data with two different analytic distributions, starting with the Normal Distribution.

Here is a function that will plot the CDF of a Normal Distribution, and the CDF of our data on top:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
def MakeNormalModel(data, num_bins=200, title=None, xlabel=None): counts, bin_edges = np.histogram (data, bins=num_bins)#, normed=True) cdf = np.cumsum (counts) data = np.asarray(data) mean = data.mean() std = data.std() xmin = mean - 4 * std xmax = mean + 4 * std xs = np.linspace(xmin, xmax, 101) ps = stats.norm.cdf(xs, mean, std) plt.figure(figsize=(18,13)) plt.ylabel('CDF') plt.plot(xs, ps, label='Model', linewidth=4, color='0.8') plt.plot(bin_edges[1:], cdf/cdf[-1], label='Data') if title: plt.title(title) if xlabel: plt.xlabel(xlabel) plt.legend(loc='lower right') plt.show() |

We can run this function with our 2011 salary data to see whether the data is normally distributed:

1 2 |
salary_2011 = salaries_2011['TotalPay'] MakeNormalModel(salary_2011, title='Normal Distribution', xlabel='Salary') |

As you can see this data is clearly not fit for a Normal Distribution, especially in the first part of the distribution. But, the data above the 50000 threshold, under certain circumstances, could be accepted as a Normal Distribution.

We will now plot the data against a Lognormal Distribution. If you aren’t familiar with the Lognormal Distribution, you can learn more about it here.

1 2 3 |
log_salary_2011 = salary_2011 + 1 log_salary_2011 = np.log10(log_salary_2011) MakeNormalModel(log_salary_2011, title='Lognormal Distribution', xlabel='Salary (log10)') |

Our original data has plenty of 0 values, and taking the logarithms of those is going to yield undefined or infinite values. So, to prevent that we added 1 to each value in the original data.

The Lognormal distribution is also not a good fit for our data. The problem might lay in the fact that we have too many 0 values.

Anyway, you have realised by now how useful CDFs are when comparing distributions. I have shown you how to compare this data to two common distributions. I will leave it to you to find out which of the other common distributions fits this data best.

The book ThinkStats helped me out a lot in this project. It’s completely free and it’s one of the best Statistics books out there. You can download it here.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Did I miss something or make a mistake somewhere?

Let me know in the comments below.