Probability Distributions

為什麼分布這麼重要?

幾乎所有統計推論、假設檢定、機器學習模型的基礎都建立在機率分布上。面試時能快速選出正確的分布,並解釋為什麼選它,是很強的信號。

Probability Basics

在談分布之前,先確認基礎概念扎實:

Random Variable

A random variable is a function that maps outcomes of a random experiment to numbers.

  • Discrete: possible values are finite or countable — 擲骰子(1-6)、使用者點擊次數(0, 1, 2, ...)
  • Continuous: possible values form an interval — 等待時間、身高、股票價格

Probability Rules

面試常在複合事件問題中考這些基礎規則:

Addition rule (mutually exclusive events):

P(AB)=P(A)+P(B)(if mutually exclusive)P(A \cup B) = P(A) + P(B) \quad \text{(if mutually exclusive)}

General addition rule:

P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)

Multiplication rule (independent events):

P(AB)=P(A)P(B)(if independent)P(A \cap B) = P(A) \cdot P(B) \quad \text{(if independent)}

Conditional probability:

P(AB)=P(AB)P(B)P(A \mid B) = \frac{P(A \cap B)}{P(B)}

面試經典:Independent vs Mutually Exclusive

Mutually exclusive:A 發生時 B 不可能發生 → P(AB)=0P(A \cap B) = 0

Independent:A 是否發生不影響 B 的機率 → P(AB)=P(A)P(B)P(A \cap B) = P(A) \cdot P(B)

陷阱:互斥的事件不是獨立的!如果 A 發生了,B 一定不發生 → A 的發生影響了 B 的機率。

Law of Total Probability

Decompose an event into mutually exclusive and exhaustive scenarios:

P(B)=iP(BAi)P(Ai)P(B) = \sum_{i} P(B \mid A_i) \cdot P(A_i)

這是 Bayes' theorem 的分母,也是面試中解 conditional probability 題目的關鍵工具。

PDF, PMF, and CDF

PDF vs PMF

PMF (Probability Mass Function)PDF (Probability Density Function)
Applies toDiscrete random variablesContinuous random variables
P(X=x)P(X = x)Gives probability directlyAlways equals 0
Value range0P(X=x)10 \le P(X=x) \le 1f(x)0f(x) \ge 0, can exceed 1
Compute probabilitySummationIntegration

PDF — for continuous random variables, f(x)f(x) describes the relative likelihood of XX near xx. The probability over an interval is the area under the curve:

P(aXb)=abf(x)dxP(a \le X \le b) = \int_a^b f(x)\,dx

PDF Value ≠ Probability

面試官常問:「PDF 在某一點的值代表機率嗎?」答案是不代表。PDF 的值可以大於 1(例如 Uniform(0, 0.5) 的 PDF = 2),只有積分才是機率。連續分布在任何單點 P(X=x)=0P(X = x) = 0

CDF

Cumulative Distribution FunctionF(x)=P(Xx)F(x) = P(X \le x), the accumulated probability up to xx:

F(x)=xf(t)dt(continuous)F(x)=kxP(X=k)(discrete)F(x) = \int_{-\infty}^{x} f(t)\,dt \quad \text{(continuous)} \qquad F(x) = \sum_{k \le x} P(X = k) \quad \text{(discrete)}

CDF 的三個性質(面試常考):

  1. F()=0F(-\infty) = 0, F()=1F(\infty) = 1
  2. FF is non-decreasing
  3. P(a<Xb)=F(b)F(a)P(a < X \le b) = F(b) - F(a)

Expected Value and Variance

Expected Value

Expected value (mean) — the theoretical long-run average:

E[X]=xf(x)dx(continuous)E[X]=xxP(X=x)(discrete)E[X] = \int_{-\infty}^{\infty} x \cdot f(x)\,dx \quad \text{(continuous)} \qquad E[X] = \sum_x x \cdot P(X=x) \quad \text{(discrete)}

Linearity of expectation (no independence required!):

E[aX+bY+c]=aE[X]+bE[Y]+cE[aX + bY + c] = aE[X] + bE[Y] + c

這是面試中解題的利器。例如「10 個人隨機戴帽子,期望有幾人戴到自己的帽子?」— 用 linearity 拆成 10 個 indicator variables,每個的期望是 1/101/10,答案是 10×1/10=110 \times 1/10 = 1

Variance and Standard Deviation

Variance — measures the spread of data around the mean:

Var(X)=E[(Xμ)2]=E[X2](E[X])2\text{Var}(X) = E\left[(X - \mu)^2\right] = E[X^2] - (E[X])^2

Shortcut formula E[X2](E[X])2E[X^2] - (E[X])^2 在推導和面試中非常常用。

Variance properties:

Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2 \text{Var}(X) Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)

如果 X,YX, Y 獨立,Cov(X,Y)=0\text{Cov}(X,Y) = 0,所以 Var(X+Y)=Var(X)+Var(Y)\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y)

Covariance and Correlation

Covariance — measures how two variables move together:

Cov(X,Y)=E[(XμX)(YμY)]=E[XY]E[X]E[Y]\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]E[Y]

Correlation — standardized covariance, bounded in [1,1][-1, 1]:

ρXY=Cov(X,Y)σXσY\rho_{XY} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}

Correlation ≠ Independence

ρ=0\rho = 0 只代表沒有線性關係,不代表獨立。例如 XUniform(1,1)X \sim \text{Uniform}(-1, 1)Y=X2Y = X^2ρ=0\rho = 0(沒有線性相關),但 YY 完全由 XX 決定(強烈非線性相關)。

反過來:獨立一定代表 ρ=0\rho = 0

Chebyshev's Inequality

For any random variable XX with mean μ\mu and finite variance σ2\sigma^2:

P(Xμkσ)1k2P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}

不需要知道 distribution 的 shape — 只需要 mean 和 variance 就能 bound probability。

kkAt most this fraction outside ±kσ\pm k\sigmaCompare with Normal
1100%(trivial bound)31.7%
225%4.6%
311.1%0.3%
46.25%0.006%

Chebyshev 的 bound 比 Normal 的 68-95-99.7 rule 寬鬆很多 — 因為它適用於任何分布。如果你知道分布是 Normal,用 68-95-99.7 更 tight。

ML Connections

ApplicationHow Chebyshev Is Used
Outlier detection超過 3σ3\sigma 的 data point → by Chebyshev, 最多 11% 的 data → 可能是 outlier
Convergence boundsSample mean 的 error bound: P(Xˉμϵ)σ2/(nϵ2)P(\|\bar{X} - \mu\| \geq \epsilon) \leq \sigma^2/(n\epsilon^2)(Law of Large Numbers 的弱版本)
Feature engineeringClip features at ±3σ\pm 3\sigma — Chebyshev 保證最多 11% 被 clip
Distribution-free inference不知道 distribution 時,Chebyshev 提供 conservative bound
import numpy as np

# Verify Chebyshev on different distributions
for dist_name, samples in [
    ("Normal", np.random.normal(0, 1, 100000)),
    ("Exponential", np.random.exponential(1, 100000)),
    ("Uniform", np.random.uniform(-1, 1, 100000)),
]:
    mu, sigma = samples.mean(), samples.std()
    for k in [2, 3]:
        actual = np.mean(np.abs(samples - mu) >= k * sigma)
        chebyshev_bound = 1 / k**2
        print(f"{dist_name}: P(|X-μ|≥{k}σ) = {actual:.4f} ≤ {chebyshev_bound:.4f} (Chebyshev)")
    # All satisfy the bound, but Normal is much tighter than the bound

面試連結:Why 3-Sigma Rule?

「為什麼用 ±3σ\pm 3\sigma 做 outlier threshold?」— 如果 data 是 Normal → 只有 0.3% 在外面(68-95-99.7 rule)。如果 distribution 不知道 → Chebyshev guarantees 最多 11% 在外面。3σ3\sigma 是一個 distribution-free 的 reasonable threshold。

Key Distributions

Bernoulli Distribution

The simplest distribution: a single trial with two outcomes (success/failure).

  • X{0,1}X \in \{0, 1\}, with P(X=1)=pP(X=1) = p
  • E[X]=pE[X] = p, Var(X)=p(1p)\text{Var}(X) = p(1-p)

Use case: 一次點擊是否轉換、一封郵件是否被開啟、一筆交易是否為詐欺。

Binomial Distribution

The number of successes in nn independent Bernoulli trials:

P(X=k)=(nk)pk(1p)nk,k=0,1,,nP(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \dots, n
  • E[X]=npE[X] = np, Var(X)=np(1p)\text{Var}(X) = np(1-p)

Use case: 100 個使用者中有幾人轉換、一批 1000 件產品中有幾件瑕疵品。

Normal approximation: 當 np5np \geq 5n(1p)5n(1-p) \geq 5 時,Binomial(n,p)N(np,np(1p))\text{Binomial}(n, p) \approx N(np, np(1-p))。這是 A/B testing 中 z-test for proportions 的理論基礎。

Geometric Distribution

The number of trials needed to get the first success:

P(X=k)=(1p)k1p,k=1,2,3,P(X = k) = (1-p)^{k-1} p, \quad k = 1, 2, 3, \dots
  • E[X]=1/pE[X] = 1/p, Var(X)=(1p)/p2\text{Var}(X) = (1-p)/p^2
  • Also has the memoryless property — the discrete counterpart of Exponential

Use case: 使用者平均要看幾個廣告才會點擊、要打幾通客服電話才會解決問題。

Negative Binomial Distribution

The number of failures before the rr-th success:

P(X=k)=(k+r1k)(1p)kpr,k=0,1,2,P(X = k) = \binom{k+r-1}{k}(1-p)^k p^r, \quad k = 0, 1, 2, \dots
  • E[X]=r(1p)/pE[X] = r(1-p)/p
  • Geometric is the special case when r=1r=1

Use case: 常用來建模 overdispersed count data(variance > mean 的計數資料),例如網頁點擊次數、保險理賠次數。當資料的 variance 明顯大於 mean 時,Poisson 不適合,Negative Binomial 是替代方案。

Poisson Distribution

Models the number of events in a fixed interval of time or space, when events occur independently at a constant average rate λ\lambda:

P(X=k)=λkeλk!,k=0,1,2,P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \dots
  • E[X]=λE[X] = \lambda, Var(X)=λ\text{Var}(X) = \lambda
  • Mean equals variance — this is a defining property

Use case: 每小時收到的客訴數、每天的伺服器錯誤數、每分鐘的網站點擊數。均值等於變異數是 Poisson 分布最容易辨認的特徵。

Three assumptions (面試常考):

  1. Events in non-overlapping intervals are independent
  2. The average rate is constant over time
  3. At most one event can occur in an infinitesimally small interval

Poisson vs Binomial

When nn is large and pp is small, Binomial(n,p)Poisson(λ=np)\text{Binomial}(n, p) \approx \text{Poisson}(\lambda = np). 面試中如果看到「rare events over many trials」,優先考慮 Poisson。

判斷標準:n>20n > 20p<0.05p < 0.05(或 np<10np < 10),就可以用 Poisson 近似。

Normal (Gaussian) Distribution

The most important distribution in statistics, defined by mean μ\mu and variance σ2\sigma^2:

f(x)=1σ2πexp((xμ)22σ2)f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)
  • E[X]=μE[X] = \mu, Var(X)=σ2\text{Var}(X) = \sigma^2
  • The 68-95-99.7 rule: approximately 68% of data falls within ±1σ\pm 1\sigma, 95% within ±2σ\pm 2\sigma, and 99.7% within ±3σ\pm 3\sigma

Standard Normal: Z=XμσN(0,1)Z = \frac{X - \mu}{\sigma} \sim N(0, 1). Any normal can be standardized to ZZ.

Why it appears everywhere: Central Limit Theorem — 不論母體是什麼分布,樣本均值在 nn 夠大時趨近 Normal。這讓 z-test 和 t-test 在大樣本下都能使用。

Key properties (面試加分):

  • Symmetric: mean = median = mode
  • Linear combination of normals is also normal: aX+bYN(aμX+bμY,a2σX2+b2σY2)aX + bY \sim N(a\mu_X + b\mu_Y, a^2\sigma_X^2 + b^2\sigma_Y^2) (if X,YX, Y independent)
  • Skewness = 0, Kurtosis = 3 (excess kurtosis = 0)

Log-Normal Distribution

If ln(X)N(μ,σ2)\ln(X) \sim N(\mu, \sigma^2), then XX follows a Log-Normal distribution.

  • E[X]=eμ+σ2/2E[X] = e^{\mu + \sigma^2/2}
  • Right-skewed, positive-only, arises from multiplicative processes

Use case: 收入分布、股票價格、城市人口、檔案大小 — 任何「由很多獨立乘法因子構成」的資料都適合用 Log-Normal 建模。

面試判斷:Normal vs Log-Normal

如果你的資料:

  • 有負值 → 不是 Log-Normal
  • 右偏且只有正值 → 考慮 Log-Normal
  • 取 log 後看起來對稱 → 很可能是 Log-Normal

許多 ML 場景(房價、交易金額)取 log 後做 regression 效果更好,原因就是原始資料更接近 Log-Normal。

Exponential Distribution

Models the time between events in a Poisson process:

f(x)=λeλx,x0f(x) = \lambda e^{-\lambda x}, \quad x \ge 0
  • E[X]=1/λE[X] = 1/\lambda, Var(X)=1/λ2\text{Var}(X) = 1/\lambda^2
  • Memoryless property: P(X>s+tX>s)=P(X>t)P(X > s + t \mid X > s) = P(X > t) — the remaining waiting time doesn't depend on how long you've already waited
  • Exponential is the only continuous distribution with the memoryless property

Use case: 到下一個客戶到達的時間、伺服器故障之間的間隔、放射性衰變。

Poisson-Exponential duality: 如果事件的次數服從 Poisson(λ\lambda),那事件之間的等待時間服從 Exponential(λ\lambda)。同一個 Poisson process 的兩種觀點:一個看「數量」,一個看「間隔」。

Gamma Distribution

A generalization of the Exponential — the waiting time until the kk-th event:

f(x)=λkxk1eλx(k1)!,x0f(x) = \frac{\lambda^k x^{k-1} e^{-\lambda x}}{(k-1)!}, \quad x \ge 0
  • E[X]=k/λE[X] = k/\lambda, Var(X)=k/λ2\text{Var}(X) = k/\lambda^2
  • Exponential is the special case when k=1k=1
  • Chi-squared distribution is Gamma(k/2,1/2k/2, 1/2)

Use case: 等待第 5 個客戶到達的總時間、Bayesian 分析中作為 Poisson 的 conjugate prior。

Beta Distribution

Defined on [0,1][0, 1], with shape controlled by α\alpha and β\beta:

f(x)=xα1(1x)β1B(α,β),0x1f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}, \quad 0 \le x \le 1
  • E[X]=αα+βE[X] = \frac{\alpha}{\alpha + \beta}, Var(X)=αβ(α+β)2(α+β+1)\text{Var}(X) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}
  • Extremely flexible shape: α=β=1\alpha = \beta = 1 is Uniform, α=β>1\alpha = \beta > 1 is symmetric bell, αβ\alpha \neq \beta is skewed

Use case: 建模機率或比例(CTR、conversion rate)。在 Bayesian inference 中是 Binomial likelihood 的 conjugate prior。

直覺:Beta(α,β\alpha, \beta) 可以理解成「已經看到 α1\alpha - 1 次成功和 β1\beta - 1 次失敗之後,對成功率的信念」。

Uniform Distribution

All values equally likely over an interval [a,b][a, b]:

f(x)=1ba,axbf(x) = \frac{1}{b-a}, \quad a \le x \le b
  • E[X]=a+b2E[X] = \frac{a+b}{2}, Var(X)=(ba)212\text{Var}(X) = \frac{(b-a)^2}{12}
  • Maximum entropy distribution for bounded support — 在只知道上下界時,最「無資訊」的分布

Use case: 隨機數生成、建模完全不確定性、hash function 的理想輸出。

Relationships Between Distributions

理解分布之間的關係是面試中的強力加分點。點擊任一分布可以查看它的公式、性質以及和其他分布的關係:

DiscreteContinuousDerivationConjugate prior

Real-World Use Cases

Case 1: 信用卡詐欺偵測

資料特性:99.9% 正常交易、0.1% 詐欺 → 極度不平衡

  • 詐欺的發生可以用 Poisson 建模(每小時平均幾筆詐欺)
  • 詐欺之間的間隔用 Exponential 建模
  • 每筆交易是否詐欺是 Bernoulli(p = 0.001)
  • 100 筆交易中有幾筆詐欺 → Binomial(n=100, p=0.001)≈ Poisson(0.1)
  • 交易金額通常右偏 → Log-Normal(取 log 後做 feature engineering)

Case 2: 房價預測

  • 房價本身是右偏的 → Log-Normal(取 log 做 target transform)
  • Regression 假設 residuals 服從 Normal 分布
  • 房屋面積等 features 接近 Normal(可能有些偏態)
  • 每個月的成交筆數 → Poisson

Case 3: 客戶分群

  • 每位客戶的月消費次數 → PoissonNegative Binomial(如果 overdispersed)
  • 轉換率的先驗 → Beta 分布
  • 消費金額 → Log-Normal
  • GMM(Gaussian Mixture Model)假設每個 cluster 是 Normal 分布

Quick Reference Table

DistributionTypeParametersMeanVarianceKey Property
BernoulliDiscreteppppp(1p)p(1-p)Single binary trial
BinomialDiscreten,pn, pnpnpnp(1p)np(1-p)Sum of Bernoulli trials
GeometricDiscretepp1/p1/p(1p)/p2(1-p)/p^2Memoryless (discrete)
Neg. BinomialDiscreter,pr, pr(1p)/pr(1-p)/pr(1p)/p2r(1-p)/p^2Overdispersed counts
PoissonDiscreteλ\lambdaλ\lambdaλ\lambdaMean = Variance
NormalContinuousμ,σ\mu, \sigmaμ\muσ2\sigma^2CLT makes it universal
Log-NormalContinuousμ,σ\mu, \sigmaeμ+σ2/2e^{\mu+\sigma^2/2}(eσ21)e2μ+σ2(e^{\sigma^2}-1)e^{2\mu+\sigma^2}Multiplicative processes
ExponentialContinuousλ\lambda1/λ1/\lambda1/λ21/\lambda^2Memoryless (continuous)
GammaContinuousk,λk, \lambdak/λk/\lambdak/λ2k/\lambda^2Sum of Exponentials
BetaContinuousα,β\alpha, \betaα/(α+β)\alpha/(\alpha+\beta)see formulaConjugate prior for proportions
UniformContinuousa,ba, b(a+b)/2(a+b)/2(ba)2/12(b-a)^2/12Maximum entropy (bounded)

Hands-on: Distributions in Python

Sampling & Verifying Distribution Properties

import numpy as np
from scipy import stats

# Verify Poisson: mean ≈ variance (defining property)
samples = np.random.poisson(lam=5, size=10000)
# sample mean ≈ 5.0, sample variance ≈ 5.0 → Mean = Variance ✓

# Verify Exponential: memoryless property
# P(X > s+t | X > s) should equal P(X > t)
exp_samples = np.random.exponential(scale=0.5, size=100000)
p_conditional = np.mean(exp_samples[exp_samples > 1] > 3)  # P(X>3 | X>1)
p_marginal = np.mean(exp_samples > 2)                       # P(X>2)
# Both ≈ 0.018 → memoryless ✓

# Verify CLT: sample means become normal regardless of source distribution
population = np.random.exponential(1, size=100000)  # heavily skewed
sample_means = [np.random.choice(population, 50).mean() for _ in range(5000)]
_, p_value = stats.normaltest(sample_means)
# p_value > 0.05 → sample means are normally distributed ✓

Fitting Distributions to Data

from scipy import stats
import numpy as np

# Transaction amounts (right-skewed, positive-only → try Log-Normal)
amounts = np.random.lognormal(mean=3, sigma=1, size=1000)

# Fit log-normal
shape, loc, scale = stats.lognorm.fit(amounts, floc=0)
# KS test: does the data fit the distribution?
ks_stat, p_value = stats.kstest(amounts, "lognorm", args=(shape, loc, scale))
# p > 0.05 → cannot reject that data follows log-normal ✓

# Compare: if you take log, it should look normal
log_amounts = np.log(amounts)
_, p_normal = stats.normaltest(log_amounts)
# p > 0.05 → log(amounts) is normally distributed ✓

Explore Distributions

Use this interactive tool to build intuition about how distribution shapes change with parameters. Try adjusting μ\mu and σ\sigma for the normal, or λ\lambda for the exponential, and toggle between PDF and CDF:

Distribution Explorer

Interview Signals

What interviewers listen for:

  • 你能根據問題場景選擇正確的分布,而不是亂套公式
  • 你理解每個分布的假設條件(例如 Poisson 要求事件獨立且速率恆定)
  • 你能解釋分布之間的關係(Bernoulli → Binomial → Normal;Poisson ↔ Exponential)
  • 你知道什麼時候用近似(Binomial → Poisson, Binomial → Normal)
  • 你能區分 correlation 和 independence
  • 你知道 PDF 的值不是機率

Practice

Flashcards

Flashcards (1/10)

Poisson 分布有什麼特殊性質讓你一眼就能辨認?

Mean 等於 Variance(均值 = 變異數 = λ)。如果資料的樣本均值和變異數接近,Poisson 是很好的候選分布。如果 variance 明顯大於 mean(overdispersion),考慮 Negative Binomial。

Click card to flip

Quiz

Question 1/10

一家網站每小時平均收到 5 個客訴,你想計算某小時收到 0 個客訴的機率,應該用什麼分布?

Mark as Complete

3/5 — Okay