STAT1013: Some Probability Distributions
Some discrete probability distributions
In this section, we will be learning some of the most frequently used discrete probability distributions such as the Bernoulli distribution.
The basic idea is to find a formula for the probability mass function(PMF) of a discrete random variable $X$, that is $\mathbb{P}(X=x) = f(x)$.
We will also learn formulas for the mean, variance of X.
Bernoulli distribution
Description. The Bernoulli distribution is a univariate discrete probability distribution used to model random experiments that have binary outcomes. The two possible outcomes of a Bernoulli trial are success or failure, with probability of success denoted as $p$, and probability of failure denoted as $q = 1 - p$
Param. probability of success $p$
Notation. $X \sim \text{Bern}(p)$
Mean. $\mathbb{E}(X) = p.$
Variance. $\text{Var}(X) = p ( 1 - p)$
Example 1. Suppose that you have a huge mask machine. It is known that 50% of the masks are orange, 25% are yellow, and the other 25% are brown. You are going to draw one mask.
Let a random variable $X$ be 1 if it is yellow (‘success’) and 0, otherwise (‘failure’). Construct the probability distribution of $X$. Find the mean and variance of $X$.
Solution.
- Param: $p = \mathbb{P}(X = 1) = 1/4$.
- $X \sim \text{Bern}(0.25)$
- Probability mass function of $X$: \(f(0) = \mathbb{P}(X = 0) = 3/4, \quad f(1) = \mathbb{P}(X = 1) = 1/4.\)
- Mean: \(\mathbb{E}(X) = 0 \times (3/4) + 1 \times (1/4) = 1/4.\)
- Variance: \(\text{Var}(X) = p(1-p) = 3/16.\)
# Example 1: Python solution
# Step 1: find the routine/document in scipy.stat
from scipy.stats import bernoulli
# Step 2: define a random variable
X = bernoulli(0.25)
# Step 3: methods - pmf, cdf, quantile, expectation, sampling, mean, std
print(X.pmf(0))
print(X.pmf(1))
print(X.mean())
print(X.var())
0.1875
Binomial distribution
Description. The binomial distribution is a discrete probability distribution that Bernoulli experiment $\text{Bern}(p)$ is performed several (n) independent times.
Param.
- $n \in {0, 1, \cdots}$: number of trials
- $p \in [0,1]$: success probability for each trial
Notation. $X \sim B(n,p)$
Mean. $\mathbb{E}(X) = np$.
Variance. $\text{Var}(X) = np(1-p)$.
Example 2. Under the setting in Example 1, you are going to draw three masks. Find the probability that you draw exactly two yellows.
\[\mathbb{P}(X = 2) = \mathbb{P}( (1,1,0), (1,0,1), (0,1,1) ) = 3 \times (1/4)^2 \times (3/4) = 9/64.\]Solution.
# Example 2: Python solution
# Step 1: find the routine/document in scipy.stat
from scipy.stats import binom
# Step 2: define a random variable
X = binom(n=3, p=0.25)
# Step 3: methods - pmf, cdf, quantile, expectation, sampling, mean, std
X.pmf(2)
0.14062499999999994
🧮 Probability mass function (PMF) of Binomial distribution. In general, if the random variable X follows the binomial distribution with parameters $n$ and $p$, denoted as $X ~ B(n, p)$. The probability of getting exactly $k$ successes in $n$ independent Bernoulli trials is given by the probability mass function:
\({\displaystyle f(k,n,p)=\Pr(k;n,p)=\Pr(X=k)={\binom {n}{k}}p^{k}(1-p)^{n-k}},\) for $k = 0, 1, 2, \cdots, n$, where \({\displaystyle {\binom {n}{k}}={\frac {n!}{k!(n-k)!}}}.\)
Example 3. Suppose that you are going to inspect of a shipment of masks by randomly selecting 20 masks of the whole lot. If at least 5 of this masks are defective, the shipment is rejected. The manufacturer indicates that 5% of the masks are defective.
What is the probability that exactly 3 masks are selected?
What is the probability that at least one defective mask is selected?
What is the probability of rejecting the lot?
# Example 3: Python Solution
# Step 1: find the routine/document in scipy.stat
from scipy.stats import binom
# Step 2: define a random variable
X = binom(n=20, p=0.05)
# Step 3: methods - pmf, cdf, quantile, expectation, sampling, mean, std
## Q1: P(X=3)
print(X.pmf(3))
## Q2: P(X>=1) = 1 - P(X=0)
print(1 - X.pmf(0))
## Q3: P(X>=5) = 1 - P(X<=4)
print(1 - X.cdf(4))
0.0025739403346523027
Some continuous probability distributions
A continuous random variable is a random variable that can take on any value within a certain range.
Examples of continuous random variable are:
- The weight of new born baby.
- The amount of rain that falls in a randomly selected storm.
- The length of time to play 100 scores in NBA.
🧮 Definition [Probability density function (pdf)]. The function $f(x)$ is pdf for the continuous random variable $X$, if
- $f(x) \geq 0$, for all $x \in \mathbb{R}$.
- $\int_{- \infty}^{\infty} f(x) dx = 1$.
- $\mathbb{P}(a < X < b) = \int_a^b f(x) dx$.
Uniform Distribution
Description. The Uniform Distribution is a type of probability distribution in which all outcomes are equally likely.
Param. $a < b$.
Notation. $X \sim U_{[a,b]}$.
PDF. \(f(x) = {\displaystyle {\begin{cases}{\frac {1}{b-a}}&{\text{for }}x\in [a,b]\\0&{\text{otherwise}}\end{cases}}}.\)
Mean. $\mathbb{E}(X) = (a + b) / 2$.
Variance. $\text{Var}(X) = (b-a)^2 / 12$
Example 1. A continuous variable $X$ that can assume values between $x = 1$ and $x = b$ has a density function given by $f (x) = 1/2$.
- Find $b$.
- Find $\mathbb{P}(X < 1.5)$
- Find $\mathbb{P}(X \leq 2 \mid X > 1.5)$
- Find $\mathbb{E}(X)$ and $\text{Var}(X)$
Solution
- $\int_1^b 1/2 dx = 1$, thus $b = 3$.
- $\mathbb{P}(X < 1.5) = \int_1^{1.5} 1/2 dx = 1/4$.
- $\mathbb{P}(X \leq 2 \mid X > 1.5) = \mathbb{P}(1.5 < X \leq 2) / \mathbb{P}(X > 1.5) = 1/3$.
- $\mathbb{E}(X) = 2$
- $\text{Var}(X) = 1/3$
# Example 1: Python Solution
# Step 1: find the routine/document in scipy.stat
from scipy.stats import uniform
# Step 2: define a random variable
# In the standard form, the distribution is uniform on [0, 1]. Using the parameters loc and scale, one obtains the uniform distribution on [loc, loc + scale].
X = uniform(loc=1,scale=2)
# Step 3: methods - pmf, cdf, quantile, expectation, sampling, mean, std
## Q2: P(X<1.5)
print(X.cdf(1.5))
## Q3: P(1.5<X<=2) / P(X>1.5)
print((X.cdf(2) - X.cdf(1.5))/(1 - X.cdf(1.5)))
## Q4: mean
print(X.mean())
## Q5: variance
print(X.var())
0.25
0.3333333333333333
2.0
0.3333333333333333
Normal Distribution
Description. A normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable.
Param. mean $\mu$ and standard deviation $\sigma$.
Notation. $X \sim N(\mu, \sigma)$.
PDF. \(f(x)= {\displaystyle {\frac {1}{\sigma {\sqrt {2\pi }}}}e^{-{\frac {1}{2}}\left({\frac {x-\mu }{\sigma }}\right)^{2}}}\)
Mean. $\mathbb{E}(X) = \mu$.
Variance. $\text{Var}(X) = \sigma^2$
Example 2. Plot two pairs of normal distributions
$X \sim N(\mu=-1,\sigma=1)$, $X \sim N(\mu=1,\sigma=1)$
$X \sim N(\mu=0,\sigma=1)$, $X \sim N(\mu=0,\sigma=0.1)$
## Example 2: Python Solution
from scipy.stats import norm
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (20, 10) #set default figure size
n = 10000
# CASE 1
X1 = norm(-1, 1)
X2 = norm(1, 1)
data1 = X1.rvs(n)
data2 = X2.rvs(n)
fig, ax = plt.subplots()
sns.kdeplot(data1, fill=True, ax=ax, label='N(-1,1)')
sns.kdeplot(data2, fill=True, ax=ax, label='N(1,1)')
plt.legend(loc="upper left")
plt.show()
# CASE 2
X1 = norm(0, 1)
X2 = norm(0, 0.1)
data1 = X1.rvs(n)
data2 = X2.rvs(n)
fig, ax = plt.subplots()
sns.kdeplot(data1, fill=True, ax=ax, label='N(0,1)')
sns.kdeplot(data2, fill=True, ax=ax, label='N(0,0.1)')
plt.legend(loc="upper left")
plt.show()
Some FACTs of Normal Distribution.
- If $X \sim N(\mu, \sigma)$, then $Z = \frac{X - \mu}{\sigma} \sim N(0,1)$ is a standard normal distribution.
- The mode (max of pdf), occurs at $x = \mu$.
- The pdf curve is symmetric about a vertical axis via $x=\mu$.
- The pdf curve vanishes asymptotically as we proceed in either direction away from the mean.
Example 3. A mask machine is regulated so that it produces an average of 50 pieces per bag. If the amount of pieces is normally distributed with a standard deviation equal to 2 pieces.
what fraction of the bags will contain more than 75 pieces?
what is the probability that a randomly chosen bag contains between 25 and 60 pieces?
below what value do we get the smallest 2.5% of the bags?
## Example 3: Python Solution
from scipy.stats import norm
X = norm(50, 2)
#Q1: P(X>75) = 1 - P(X<=75) = 1 - cdf(75)
print(1-X.cdf(75))
#Q2: P(25<=X<=60) = P(X<=60) - P(X<=25) = cdf(60) - cdf(25)
print(X.cdf(60) - X.cdf(25))
#Q3: P(X<=?) = 0.025; cdf(?) = 0.025; ? = cdf-1(0.025)
print(X.ppf(0.025))
0.0
0.9999997133484281
46.080072030919894
Summary: Probability Distributions
CDF: A probability distribution $ \mathbb{P} (X \in A) $ can be described by its cumulative distribution function (CDF) \(F_{X}(x) = \mathbb{P}(X \leq x).\)
PDF/PMF: Sometimes, a random variable can also be described by density function $ f(x) $ that is related to its CDF by \(F_X(x) = \mathbb{P}(X \leq x) = \int_{-\infty}^x f(t)dt.\) When a probability density exists, a probability distribution can be characterized either by its CDF or by its density.
Quantile: the quantile function specifies value of the random variable such that the probability of the variable being less than or equal to that value equals the given probability. \(Q_X(p) = F^{-1}_X(p), \quad F(Q_X(p)) = \mathbb{P}( X \leq Q_X(p) ) = p.\) For example, the median of $X$ is $Q_X(0.5)$, that is, we try to find $q$ such that $\mathbb{P}(X \leq q) = 0.5$.
Discrete random variable
- The number of possible values of $ X $ is finite, say, $x_1, x_2, x_3, \cdots, x_K$.
- We replace a density with a probability mass function, a non-negative sequence that sums to one, i.e., \(f_X(x) = \mathbb{P}(X = x).\)
- We replace integration with summation in the formula that relates a CDF to a probability mass function, that is, \(F_X(x) = \mathbb{P}(X \leq x) = \sum_{k=1}^K \mathbb{P}(X = x_k).\)
Continuous random variable
A continuous random variable is a random variable that has only continuous values. Continuous values are uncountable and are related to real numbers.
If $F_X(x)$ is differentiable, then $f_X(x) = F’_X(x)$.
The area under pdf curve is the probability.
Python Solution: scipy.stat
- Find the routine/document in
scipy.stat
- Define a random variable
- methods: pdf, cdf, quantile, expectation, sampling
Methods:
- Continous random variable: cdf, pdf, ppf, random sampling
- Discrete random variable: cdf, pmf, ppf, random sampling