CDF | probability | statistics | cumulative distribution function

The Use of CDF in Machine Learning

This article described the conception of CDF and its usage in Machine Learning.

Jason LiJuly 26, 2020 · 7 min read · Last Updated:

What is CDF?

CDF is the Cumulative Distribution Function. It is used in probability and statistics to calculate the distribution of cumulative probabilities. You can also look for the full and detailed definition on Wiki.

In probability and statistics, given a real-valued random-variable XX and a value xx we want to evaluate at, CDF of XX is the probability that XX is less than or equal to xx.

The most critical point of CDF is ”Cumulative”, which means we need to accumulate the CDF values while calculating CDF of XX at the point xx. Let’s have a look at the formula of CDF.

Formula of CDF

For any real-valued random-variable XX, CDF can be calculated by:

FX(x)=P(X<=x)F_X(x) = P(X <= x)

where FX(x)F_X(x) is CDF of XX evaluated at xx, P(X<=x)P(X <= x) is the probability that XX takes a value less than or equal to xx. As shown below, the red point is the xx point and the blue shaded area is the interval of all possible values for XX.

CDF-probability
CDF-probability

When to Use CDF?

It is often used to calculate the probability that the value of X falls within a certain interquartile range. For example, when P(a<x<=b)P(a < x <= b) need to be calculated, we can use CDF to do that:

P(a<x<=b)=FX(b)FX(a)P(a < x <= b) = F_X(b) - F_X(a)

when-to-use-CDF
when-to-use-CDF

Note that if you need to calculate the fully closed interval, P(a<=x<=b)P(a <= x <= b), you need to add the probability of x=ax = a to the above equation, i.e.

P(a<=x<=b)=FX(b)FX(a)+P(x=a)P(a <= x <= b) = F_X(b) - F_X(a) + P(x = a)

CDF of Discrete Random Variables

Let’s play the dice game. Suppose the dice have 6 sides (X[1,2,3,4,5,6]X \in [1, 2, 3, 4, 5, 6]). The probability that randomly placing each side of the dice is:

P(X=1)=P(X=2)=P(X=3)=P(X=4)=P(X=5)=P(X=6)=1/6P(X = 1) = P(X = 2) = P(X = 3) = P(X = 4) = P(X = 5) = P(X = 6) = 1/6

We can calculate the value of CDF (FX(x)F_X(x)) when

x[0,...,1,...,2,...,3,...,4,...,5,...,6,...(>6)]x \in [0, ..., 1, ..., 2, ..., 3, ..., 4, ..., 5, ..., 6, ...(\gt 6)]

as:

xFX(x)F_X(x)
0FX(0)=P(X<=0)=0F_X(0) = P(X <= 0) = 0
FX(...)=P(X<=...)=0F_X(...) = P(X <= ...) = 0
1FX(1)=P(X<=1)=1/6F_X(1) = P(X <= 1) = 1/6
FX(...)=P(X<=...)=1/6F_X(...) = P(X <= ...) = 1/6
2FX(2)=P(X<=2)=2/6F_X(2) = P(X <= 2) = 2/6
FX(...)=P(X<=...)=2/6F_X(...) = P(X <= ...) = 2/6
3FX(3)=P(X<=3)=3/6F_X(3) = P(X <= 3) = 3/6
FX(...)=P(X<=...)=3/6F_X(...) = P(X <= ...) = 3/6
4FX(4)=P(X<=4)=4/6F_X(4) = P(X <= 4) = 4/6
FX(...)=P(X<=...)=4/6F_X(...) = P(X <= ...) = 4/6
5FX(5)=P(X<=5)=5/6F_X(5) = P(X <= 5) = 5/6
FX(...)=P(X<=...)=5/6F_X(...) = P(X <= ...) = 5/6
6FX(6)=P(X<=6)=1F_X(6) = P(X <= 6) = 1
FX(...)=P(X<=...)=1F_X(...) = P(X <= ...) = 1

Plot the CDF according to the above table:

Based on this graph, we can see that the value of CDF of XX is discrete.

CDF of Continuous Random Variables

Let’s use the game of Wheel of Fortune to explain the CDF of Continuous Random Variables.

Assuming that the number the pointer points to when the wheel is stopped is XX (X[0,10)X \in [0, 10)), we can also use CDF to calculate the probability. Let’s see what CDF of XX is when xx = [-1, 0, 1, 5, 10, 15].

FX(1)=P(X<=1)=0F_X(-1) = P(X <= -1) = 0FX(0)=P(X<=0)=110F_X(0) = P(X <= 0) = \frac{1}{10}FX(1)=P(0<=X<=1)=210F_X(1) = P(0 <= X <= 1) = \frac{2}{10}FX(5)=P(0<=X<=5)=610F_X(5) = P(0 <= X <= 5) = \frac{6}{10}FX(10)=P(0<=X<=10)=1010F_X(10) = P(0 <= X <= 10) = \frac{10}{10}FX(15)=P(0<=X<=15)=1010F_X(15) = P(0 <= X <= 15) = \frac{10}{10}

Let’s plot these data:

With this graph, we can clearly see the continuity of its CDF values.

Differences in CDF Between Discrete and Continuous Random Variables

For discrete random variables, the value of the CDF is not continuous, which means that the CDF value at the point before the critical point will be very significantly different. For example, let’s go back to the dice game where P(X=5)=56P(X = 5) = \frac{5}{6}, P(X=4.9999999...)=46P(X = 4.9999999...) = \frac{4}{6}.

P(X=5)P(X=4.9999999...)P(X = 5) \ne P(X = 4.9999999...)

Whereas for continuous random variables, the value of the CDF is continuous and each point will be very close to the previous point. For example, going back to the wheel of fortune, P(X=5)=610P(X = 5) = \frac{6}{10}, P(X=4.9999999...)=5.999999...10P(X = 4.9999999...) = \frac{5.999999...}{10}.

P(X=5)P(X=4.9999999...)P(X = 5) \approx P(X = 4.9999999...)

References


Written by Jason Li

Jason is currently a PhD Candidate in Computer Science at RMIT University.

Buy me a coffee


This page is open source. Noticed a typo? Or something unclear?
Improve this page on GitHub