STAT1013: Probability Distributions, LLN and CLT

  • CDF: A probability distribution $ \mathbb{P} (X \in A) $ can be described by its cumulative distribution function (CDF) \(F_{X}(x) = \mathbb{P}(X \leq x).\)

  • PDF/PMF: Sometimes, a random variable can also be described by density function $ f(x) $ that is related to its CDF by \(F_X(x) = \mathbb{P}(X \leq x) = \int_{-\infty}^x f(t)dt.\) When a probability density exists, a probability distribution can be characterized either by its CDF or by its density.

  • Quantile: the quantile function specifies value of the random variable such that the probability of the variable being less than or equal to that value equals the given probability. \(Q_X(p) = F^{-1}_X(p), \quad F(Q_X(p)) = \mathbb{P}( X \leq Q_X(p) ) = p.\) For example, the median of $X$ is $Q_X(0.5)$, that is, we try to find $q$ such that $\mathbb{P}(X \leq q) = 0.5$.

Discrete random variable

  • The number of possible values of $ X $ is finite, say, $x_1, x_2, x_3, \cdots, x_K$.
  • We replace a density with a probability mass function, a non-negative sequence that sums to one, i.e., \(f_X(x) = \mathbb{P}(X = x).\)
  • We replace integration with summation in the formula that relates a CDF to a probability mass function, that is, \(F_X(x) = \mathbb{P}(X \leq x) = \sum_{k=1}^K \mathbb{P}(X = x_k).\)

Continuous random variable

  • A continuous random variable is a random variable that has only continuous values. Continuous values are uncountable and are related to real numbers.

  • If $F_X(x)$ is differentiable, then $f_X(x) = F’_X(x)$.

  • The area under pdf curve is the probability.

rv_def.png

Explore scipy.stat

  1. Find the routine/document in scipy.stat
  2. Define a random variable
  3. methods: pdf, cdf, quantile, expectation, sampling, mean, std

Methods:

  • Continous random variable: cdf, pdf, ppf, random sampling
  • Discrete random variable: cdf, pmf, ppf, random sampling
## Define a continuous distribution
from scipy.stats import t, beta, lognorm, expon, gamma, uniform, norm, chi2
X = norm(loc=1, scale=10)   # Normal Distribution
## CDF: Cumulative Distribution Function
X.mean()
1.0
## CDF: Cumulative Distribution Function
X.cdf(1)
0.5
## PDF / PMF: Probability {Mass/Density} Functions
X.pdf(1)
0.03989422804014327
## PPF: Percent Point Function
X.ppf(.3)
-4.244005127080409
## generate random samples
X.rvs(10)
array([  2.40161217,  29.99044783,  -0.03616166,  -3.05208143,
        12.97647534,  23.63761371,   4.50803736,   9.37961747,
        -7.80510715, -15.14772065])
## Define a discrete distribution
from scipy.stats import poisson, binom

X = binom(n=2, p=0.5)
X.cdf(0.2)
0.25
print( X.ppf(0.2) )
print( X.ppf(0.6) )
print( X.ppf(0.99) )
0.0
1.0
2.0
print( X.pmf(0) )
print( X.pmf(1) )
print( X.pmf(2) )
0.25
0.5000000000000002
0.25
X.rvs(50)
array([2, 1, 1, 1, 0, 0, 1, 0, 0, 2, 2, 0, 1, 2, 0, 1, 1, 1, 0, 0, 2, 0,
       1, 0, 1, 1, 0, 1, 1, 0, 1, 2, 0, 1, 0, 2, 0, 1, 0, 2, 0, 0, 0, 2,
       0, 1, 1, 0, 1, 1])

InClass Practice

  • Define $Y$ as a geometric random variable, that is, $Y \sim \text{Geometric}(p=0.3)$
  • Compute std of $Y$
  • Compute the probability of $\mathbb{P}( Y \geq 2)$;
  • Compute the 75\% quantile of $Y$;
  • Generate 100 random samples from $Y$.

Law of Large Numbers (LLN)

In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value and tends to become closer to the expected value as more trials are performed.1

Statement of LLN

Let $ X_1, \ldots, X_n $ be independent and identically distributed scalar random variables, with common distribution $ F $.

When it exists, let $ \mu $ denote the common mean of this sample:

\[\mu := \mathbb E(X)\]

In addition, let

\[\bar X_n := \frac{1}{n} \sum_{i=1}^n X_i\]

(Strong) Law of Large Number (LLN) states that, if $ \mathbb E |X| $ is finite, then \(\bar X_n \to \mu, \quad \text{almost surely},\) that is \(\mathbb P \left\{ \bar X_n \to \mu \text{ as } n \to \infty \right\} = 1 \tag{1}\)

The LLN is important because it guarantees stable long-term results for the averages of some random events. Source

Simulation (LLN)

In this section, we want to illustrate the LLN using a simulated example.

  • For $i=1,\cdots, K$:
    • Generate $n$ samples $X_1, \cdots, X_n$, with different sample sizes $n = 10, 30, 50, 100, 500$;
    • Compute the mean of samples $\bar{X}_n$;
  • line plot $n$ vs. $\bar{X}_n$
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (20, 10)  #set default figure size
import seaborn as sns
import random
import numpy as np
import pandas as pd
from scipy.stats import t, beta, lognorm, expon, gamma, uniform, cauchy
from scipy.stats import gaussian_kde, poisson, binom, norm, chi2
from scipy.linalg import inv, sqrtm
## inner-loop: one-round simulation
# Set parameters
n = 50                # Choice of sample size
X = expon(2)          # Exponential distribution, λ = 1/2
mu = X.mean()

print(np.mean(X.rvs(n)))
3.1366958653827144
df = {'sample_size': [], 'sample_mean': []}
n_sim = 100
n_range = [10, 30, 50, 100, 500, 1000, 3000, 5000]

for j in range(n_sim):
  for n in n_range:
    mu_tmp = np.mean(X.rvs(n))
    df['sample_size'].append(n)
    df['sample_mean'].append(mu_tmp)
df = pd.DataFrame(df)
df.head()
sample_sizesample_mean
0102.647633
1303.130454
2503.222623
31003.068034
45003.032763
sns.set()
sns.lineplot(data=df, x="sample_size", y="sample_mean", markers=True)
plt.axhline(y=X.mean(), xmin=0.05, color='red', alpha=0.5, linestyle='--')
plt.show()

png

InClass Practice

  • Is the LLN still correct if $X$ follows a student T distribution, e.g., $X \sim T(1)$;
  • Please run a similar simulation to demonstrate it.

Central Limit Theorem (CLT)

This notebook illustrates the most important theorem of probability and statistics: the central limit theorem (CLT).

Statement of CLT

In the classical IID setting, it tells us the following:

If the sequence $ X_1, \ldots, X_n $ is IID, with common mean $ \mu $ and common variance $ \sigma^2 \in (0, \infty) $, then

\[\frac{ \bar X_n - \mu }{\sigma / \sqrt{n}} \stackrel { d } {\to} N(0, 1) \quad \text{as} \quad n \to \infty \tag{2}\]

Here $ \stackrel { d } {\to} N(0, 1) $ indicates convergence in distribution to a centered (i.e, zero mean) normal with standard deviation $1$.

Roughly speaking, given a sufficiently large sample size, the (sampling) distribution of the sample mean for a variable will approximate a normal distribution regardless of that variable’s distribution in the population.

Simulation (CLT)

In this section, we design a simulation to verify CLT.

Toward this end, we now perform the following simulation

  • For $j=1, \cdots, m$:
    1. Choose an arbitrary distribution $ F $, and generate independent samples $ X_1, X_2, \cdots, X_n $.
    2. Compute the mean of each dataset $ Y_j := \bar X_n$.
  • Use these draws to plot the distribution (histogram) based on $Y_1, Y_2, \cdots, Y_m$.
  • Compare the result to the normal distribution.
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (20, 10)  #set default figure size
import random
import numpy as np
from scipy.stats import t, beta, lognorm, expon, gamma, uniform, cauchy
from scipy.stats import gaussian_kde, poisson, binom, norm, chi2
from scipy.linalg import inv, sqrtm

Simulate for CLT

import seaborn as sns

# Set parameters
n = 250                  # Choice of n
k = 10000                # Number of draws of Y_n
distribution = expon(2)  # Exponential distribution, λ = 1/2
μ, s = distribution.mean(), distribution.std()

## Plot the original distribution
data = distribution.rvs(k)
sns.displot(data, kde=True)
plt.show()
## EXP distribution (not normal)

png

# Draw underlying RVs. Each row contains a draw of X_1,..,X_n
data = distribution.rvs((k, n))
# Compute mean of each row, producing k draws of \bar X_n
sample_means = data.mean(axis=1)
# Generate observations of Y_n
# Y = sample_means
Y = (sample_means - μ) * np.sqrt(n) / s

## Plot the distribution of sample_mean
sns.displot(Y, kde=True, stat='density')
plt.show()

## Compare with standard normal distribution
sns.displot(Y, kde=True, stat='density')
plt.hist(np.random.randn(k), density=True, bins='auto', alpha=0.3, color='red')
# sns.displot(np.random.randn(n), kde=True, stat='density', ax=ax)
plt.show()

png

png

Q: is there any better way to verify if data follows a normal distribution or a general distribution?

QQ-plot

What. A QQ plot is a graphical representation of the comparison between a theoretical distribution and the observed data set.

Aim. It is used to determine whether the data is likely to have come from the expected distribution.

How. To create a QQ plot, the data is sorted, then a straight line is drawn between the expected quantiles and the observed data points. If the data is consistent with the theoretical distribution, then the points will form a straight line. Otherwise, the points will deviate from the line, indicating that the data does not fit the expected distribution.

Python. We can use statsmodels.api to provide a QQ-plot of a dataset.

  • data: array_like
  • dist: Comparison distribution. The default is scipy.stats.distributions.norm (a standard normal).

PS. In most cases, QQ-plot is used to determine whether or not a set of data follows a normal distribution.

import statsmodels.api as sm

sm.qqplot(data=Y, line='45')
plt.show()

## Test for uniform distribution
U = uniform(0,1)
sm.qqplot(data=U.rvs(1000), line='45')
plt.show()

U = uniform(0,1)
sm.qqplot(data=U.rvs(1000), dist=U, line='45')
plt.show()
/usr/local/lib/python3.8/dist-packages/statsmodels/graphics/gofplots.py:993: UserWarning: marker is redundantly defined by the 'marker' keyword argument and the fmt string "bo" (-> marker='o'). The keyword argument will take precedence.
  ax.plot(x, y, fmt, **plot_style)

png

png

png

Trend over sample size

def sim_CLT(distribution, n, k=10000):
  # Set parameters
  # n = 250                  # Choice of n
  # k = 10000                # Number of draws of Y_n
  # distribution = expon(2)  # Exponential distribution, λ = 1/2
  μ, s = distribution.mean(), distribution.std()

  # Draw underlying RVs. Each row contains a draw of X_1,..,X_n
  data = distribution.rvs((k, n))
  # Compute mean of each row, producing k draws of \bar X_n
  sample_means = data.mean(axis=1)
  # Generate observations of Y_n
  # Y = sample_means
  Y = (sample_means - μ) * np.sqrt(n) / s

  ## Plot the distribution of sample_mean
  sns.displot(Y, kde=True, stat='density', height=10)
  sm.qqplot(data=Y, line='45')
  plt.show()
  return Y
distribution = gamma(0.23)
Y = sim_CLT(distribution=distribution, n=5)
/usr/local/lib/python3.8/dist-packages/statsmodels/graphics/gofplots.py:993: UserWarning: marker is redundantly defined by the 'marker' keyword argument and the fmt string "bo" (-> marker='o'). The keyword argument will take precedence.
  ax.plot(x, y, fmt, **plot_style)

png

png

Y = sim_CLT(distribution=distribution, n=20)
/usr/local/lib/python3.8/dist-packages/statsmodels/graphics/gofplots.py:993: UserWarning: marker is redundantly defined by the 'marker' keyword argument and the fmt string "bo" (-> marker='o'). The keyword argument will take precedence.
  ax.plot(x, y, fmt, **plot_style)

png

png

Y = sim_CLT(distribution=distribution, n=30)
/usr/local/lib/python3.8/dist-packages/statsmodels/graphics/gofplots.py:993: UserWarning: marker is redundantly defined by the 'marker' keyword argument and the fmt string "bo" (-> marker='o'). The keyword argument will take precedence.
  ax.plot(x, y, fmt, **plot_style)

png

png

Y = sim_CLT(distribution=distribution, n=50)
/usr/local/lib/python3.8/dist-packages/statsmodels/graphics/gofplots.py:993: UserWarning: marker is redundantly defined by the 'marker' keyword argument and the fmt string "bo" (-> marker='o'). The keyword argument will take precedence.
  ax.plot(x, y, fmt, **plot_style)

png

png

Y = sim_CLT(distribution=distribution, n=100)
/usr/local/lib/python3.8/dist-packages/statsmodels/graphics/gofplots.py:993: UserWarning: marker is redundantly defined by the 'marker' keyword argument and the fmt string "bo" (-> marker='o'). The keyword argument will take precedence.
  ax.plot(x, y, fmt, **plot_style)

png

png

Y = sim_CLT(distribution=distribution, n=1000)
/usr/local/lib/python3.8/dist-packages/statsmodels/graphics/gofplots.py:993: UserWarning: marker is redundantly defined by the 'marker' keyword argument and the fmt string "bo" (-> marker='o'). The keyword argument will take precedence.
  ax.plot(x, y, fmt, **plot_style)

png

png

InClass Practice

  1. Use sim_CLT to verify the CLT for poisson distribution with $n=1000$.
  2. (hard) Let $Y_1, \cdots, Y_n$ be values from Q1, define $U_i = \texttt{norm().cdf}(Y_i)$, what is the distribution of $U_i$ supposed to be?
  3. Verify your conclusion by using QQ-plot.
  1. https://en.wikipedia.org/wiki/Law_of_large_numbers