Probability Distributions

為什麼分布這麼重要？

幾乎所有統計推論、假設檢定、機器學習模型的基礎都建立在機率分布上。面試時能快速選出正確的分布，並解釋為什麼選它，是很強的信號。

Probability Basics

在談分布之前，先確認基礎概念扎實：

Random Variable

A random variable is a function that maps outcomes of a random experiment to numbers.

Discrete: possible values are finite or countable — 擲骰子（1-6）、使用者點擊次數（0, 1, 2, ...）
Continuous: possible values form an interval — 等待時間、身高、股票價格

Probability Rules

面試常在複合事件問題中考這些基礎規則：

Addition rule (mutually exclusive events):

P(A \cup B) = P(A) + P(B) \quad \text{(if mutually exclusive)}

General addition rule:

P(A \cup B) = P(A) + P(B) - P(A \cap B)

Multiplication rule (independent events):

P(A \cap B) = P(A) \cdot P(B) \quad \text{(if independent)}

Conditional probability:

P(A \mid B) = \frac{P(A \cap B)}{P(B)}

面試經典：Independent vs Mutually Exclusive

Mutually exclusive：A 發生時 B 不可能發生 → $P(A \cap B) = 0$

Independent：A 是否發生不影響 B 的機率 → $P(A \cap B) = P(A) \cdot P(B)$

陷阱：互斥的事件不是獨立的！如果 A 發生了，B 一定不發生 → A 的發生影響了 B 的機率。

Law of Total Probability

Decompose an event into mutually exclusive and exhaustive scenarios:

P(B) = \sum_{i} P(B \mid A_i) \cdot P(A_i)

這是 Bayes' theorem 的分母，也是面試中解 conditional probability 題目的關鍵工具。

PDF, PMF, and CDF

PDF vs PMF

	PMF (Probability Mass Function)	PDF (Probability Density Function)
Applies to	Discrete random variables	Continuous random variables
$P(X = x)$	Gives probability directly	Always equals 0
Value range	$0 \le P(X=x) \le 1$	$f(x) \ge 0$ , can exceed 1
Compute probability	Summation	Integration

PDF — for continuous random variables, $f(x)$ describes the relative likelihood of $X$ near $x$ . The probability over an interval is the area under the curve:

P(a \le X \le b) = \int_a^b f(x)\,dx

PDF Value ≠ Probability

面試官常問：「PDF 在某一點的值代表機率嗎？」答案是不代表。PDF 的值可以大於 1（例如 Uniform(0, 0.5) 的 PDF = 2），只有積分才是機率。連續分布在任何單點 $P(X = x) = 0$ 。

CDF

Cumulative Distribution Function — $F(x) = P(X \le x)$ , the accumulated probability up to $x$ :

F(x) = \int_{-\infty}^{x} f(t)\,dt \quad \text{(continuous)} \qquad F(x) = \sum_{k \le x} P(X = k) \quad \text{(discrete)}

CDF 的三個性質（面試常考）：

$F(-\infty) = 0$ , $F(\infty) = 1$
$F$ is non-decreasing
$P(a < X \le b) = F(b) - F(a)$

Expected Value and Variance

Expected Value

Expected value (mean) — the theoretical long-run average:

E[X] = \int_{-\infty}^{\infty} x \cdot f(x)\,dx \quad \text{(continuous)} \qquad E[X] = \sum_x x \cdot P(X=x) \quad \text{(discrete)}

Linearity of expectation (no independence required!):

E[aX + bY + c] = aE[X] + bE[Y] + c

這是面試中解題的利器。例如「10 個人隨機戴帽子，期望有幾人戴到自己的帽子？」— 用 linearity 拆成 10 個 indicator variables，每個的期望是 $1/10$ ，答案是 $10 \times 1/10 = 1$ 。

Variance and Standard Deviation

Variance — measures the spread of data around the mean:

\text{Var}(X) = E\left[(X - \mu)^2\right] = E[X^2] - (E[X])^2

Shortcut formula $E[X^2] - (E[X])^2$ 在推導和面試中非常常用。

Variance properties:

\text{Var}(aX + b) = a^2 \text{Var}(X)

\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)

如果 $X, Y$ 獨立， $\text{Cov}(X,Y) = 0$ ，所以 $\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y)$ 。

Covariance and Correlation

Covariance — measures how two variables move together:

\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]E[Y]

Correlation — standardized covariance, bounded in $[-1, 1]$ :

\rho_{XY} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}

Correlation ≠ Independence

$\rho = 0$ 只代表沒有線性關係，不代表獨立。例如 $X \sim \text{Uniform}(-1, 1)$ ， $Y = X^2$ ： $\rho = 0$ （沒有線性相關），但 $Y$ 完全由 $X$ 決定（強烈非線性相關）。

反過來：獨立一定代表 $\rho = 0$ 。

Chebyshev's Inequality

For any random variable $X$ with mean $\mu$ and finite variance $\sigma^2$ :

P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}

不需要知道 distribution 的 shape — 只需要 mean 和 variance 就能 bound probability。

$k$	At most this fraction outside $\pm k\sigma$	Compare with Normal
1	100%（trivial bound）	31.7%
2	25%	4.6%
3	11.1%	0.3%
4	6.25%	0.006%

Chebyshev 的 bound 比 Normal 的 68-95-99.7 rule 寬鬆很多 — 因為它適用於任何分布。如果你知道分布是 Normal，用 68-95-99.7 更 tight。

ML Connections

Application	How Chebyshev Is Used
Outlier detection	超過 $3\sigma$ 的 data point → by Chebyshev, 最多 11% 的 data → 可能是 outlier
Convergence bounds	Sample mean 的 error bound: $P(\\|\bar{X} - \mu\\| \geq \epsilon) \leq \sigma^2/(n\epsilon^2)$ （Law of Large Numbers 的弱版本）
Feature engineering	Clip features at $\pm 3\sigma$ — Chebyshev 保證最多 11% 被 clip
Distribution-free inference	不知道 distribution 時，Chebyshev 提供 conservative bound

import numpy as np

# Verify Chebyshev on different distributions
for dist_name, samples in [
    ("Normal", np.random.normal(0, 1, 100000)),
    ("Exponential", np.random.exponential(1, 100000)),
    ("Uniform", np.random.uniform(-1, 1, 100000)),
]:
    mu, sigma = samples.mean(), samples.std()
    for k in [2, 3]:
        actual = np.mean(np.abs(samples - mu) >= k * sigma)
        chebyshev_bound = 1 / k**2
        print(f"{dist_name}: P(|X-μ|≥{k}σ) = {actual:.4f} ≤ {chebyshev_bound:.4f} (Chebyshev)")
    # All satisfy the bound, but Normal is much tighter than the bound

面試連結：Why 3-Sigma Rule?

「為什麼用 $\pm 3\sigma$ 做 outlier threshold？」— 如果 data 是 Normal → 只有 0.3% 在外面（68-95-99.7 rule）。如果 distribution 不知道 → Chebyshev guarantees 最多 11% 在外面。 $3\sigma$ 是一個 distribution-free 的 reasonable threshold。

Key Distributions

Bernoulli Distribution

The simplest distribution: a single trial with two outcomes (success/failure).

$X \in \{0, 1\}$ , with $P(X=1) = p$
$E[X] = p$ , $\text{Var}(X) = p(1-p)$

Use case: 一次點擊是否轉換、一封郵件是否被開啟、一筆交易是否為詐欺。

Binomial Distribution

The number of successes in $n$ independent Bernoulli trials:

P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \dots, n

$E[X] = np$ , $\text{Var}(X) = np(1-p)$

Use case: 100 個使用者中有幾人轉換、一批 1000 件產品中有幾件瑕疵品。

Normal approximation: 當 $np \geq 5$ 且 $n(1-p) \geq 5$ 時， $\text{Binomial}(n, p) \approx N(np, np(1-p))$ 。這是 A/B testing 中 z-test for proportions 的理論基礎。

Geometric Distribution

The number of trials needed to get the first success:

P(X = k) = (1-p)^{k-1} p, \quad k = 1, 2, 3, \dots

$E[X] = 1/p$ , $\text{Var}(X) = (1-p)/p^2$
Also has the memoryless property — the discrete counterpart of Exponential

Use case: 使用者平均要看幾個廣告才會點擊、要打幾通客服電話才會解決問題。

Negative Binomial Distribution

The number of failures before the $r$ -th success:

P(X = k) = \binom{k+r-1}{k}(1-p)^k p^r, \quad k = 0, 1, 2, \dots

$E[X] = r(1-p)/p$
Geometric is the special case when $r=1$

Use case: 常用來建模 overdispersed count data（variance > mean 的計數資料），例如網頁點擊次數、保險理賠次數。當資料的 variance 明顯大於 mean 時，Poisson 不適合，Negative Binomial 是替代方案。

Poisson Distribution

Models the number of events in a fixed interval of time or space, when events occur independently at a constant average rate $\lambda$ :

P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \dots

$E[X] = \lambda$ , $\text{Var}(X) = \lambda$
Mean equals variance — this is a defining property

Use case: 每小時收到的客訴數、每天的伺服器錯誤數、每分鐘的網站點擊數。均值等於變異數是 Poisson 分布最容易辨認的特徵。

Three assumptions (面試常考):

Events in non-overlapping intervals are independent
The average rate is constant over time
At most one event can occur in an infinitesimally small interval

Poisson vs Binomial

When $n$ is large and $p$ is small, $\text{Binomial}(n, p) \approx \text{Poisson}(\lambda = np)$ . 面試中如果看到「rare events over many trials」，優先考慮 Poisson。

判斷標準： $n > 20$ 且 $p < 0.05$ （或 $np < 10$ ），就可以用 Poisson 近似。

Normal (Gaussian) Distribution

The most important distribution in statistics, defined by mean $\mu$ and variance $\sigma^2$ :

f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

$E[X] = \mu$ , $\text{Var}(X) = \sigma^2$
The 68-95-99.7 rule: approximately 68% of data falls within $\pm 1\sigma$ , 95% within $\pm 2\sigma$ , and 99.7% within $\pm 3\sigma$

Standard Normal: $Z = \frac{X - \mu}{\sigma} \sim N(0, 1)$ . Any normal can be standardized to $Z$ .

Why it appears everywhere: Central Limit Theorem — 不論母體是什麼分布，樣本均值在 $n$ 夠大時趨近 Normal。這讓 z-test 和 t-test 在大樣本下都能使用。

Key properties (面試加分):

Symmetric: mean = median = mode
Linear combination of normals is also normal: $aX + bY \sim N(a\mu_X + b\mu_Y, a^2\sigma_X^2 + b^2\sigma_Y^2)$ (if $X, Y$ independent)
Skewness = 0, Kurtosis = 3 (excess kurtosis = 0)

Log-Normal Distribution

If $\ln(X) \sim N(\mu, \sigma^2)$ , then $X$ follows a Log-Normal distribution.

$E[X] = e^{\mu + \sigma^2/2}$
Right-skewed, positive-only, arises from multiplicative processes

Use case: 收入分布、股票價格、城市人口、檔案大小 — 任何「由很多獨立乘法因子構成」的資料都適合用 Log-Normal 建模。

面試判斷：Normal vs Log-Normal

如果你的資料：

有負值 → 不是 Log-Normal
右偏且只有正值 → 考慮 Log-Normal
取 log 後看起來對稱 → 很可能是 Log-Normal

許多 ML 場景（房價、交易金額）取 log 後做 regression 效果更好，原因就是原始資料更接近 Log-Normal。

Exponential Distribution

Models the time between events in a Poisson process:

f(x) = \lambda e^{-\lambda x}, \quad x \ge 0

$E[X] = 1/\lambda$ , $\text{Var}(X) = 1/\lambda^2$
Memoryless property: $P(X > s + t \mid X > s) = P(X > t)$ — the remaining waiting time doesn't depend on how long you've already waited
Exponential is the only continuous distribution with the memoryless property

Use case: 到下一個客戶到達的時間、伺服器故障之間的間隔、放射性衰變。

Poisson-Exponential duality: 如果事件的次數服從 Poisson( $\lambda$ )，那事件之間的等待時間服從 Exponential( $\lambda$ )。同一個 Poisson process 的兩種觀點：一個看「數量」，一個看「間隔」。

Gamma Distribution

A generalization of the Exponential — the waiting time until the $k$ -th event:

f(x) = \frac{\lambda^k x^{k-1} e^{-\lambda x}}{(k-1)!}, \quad x \ge 0

$E[X] = k/\lambda$ , $\text{Var}(X) = k/\lambda^2$
Exponential is the special case when $k=1$
Chi-squared distribution is Gamma( $k/2, 1/2$ )

Use case: 等待第 5 個客戶到達的總時間、Bayesian 分析中作為 Poisson 的 conjugate prior。

Beta Distribution

Defined on $[0, 1]$ , with shape controlled by $\alpha$ and $\beta$ :

f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}, \quad 0 \le x \le 1

$E[X] = \frac{\alpha}{\alpha + \beta}$ , $\text{Var}(X) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$
Extremely flexible shape: $\alpha = \beta = 1$ is Uniform, $\alpha = \beta > 1$ is symmetric bell, $\alpha \neq \beta$ is skewed

Use case: 建模機率或比例（CTR、conversion rate）。在 Bayesian inference 中是 Binomial likelihood 的 conjugate prior。

直覺：Beta( $\alpha, \beta$ ) 可以理解成「已經看到 $\alpha - 1$ 次成功和 $\beta - 1$ 次失敗之後，對成功率的信念」。

Uniform Distribution

All values equally likely over an interval $[a, b]$ :

f(x) = \frac{1}{b-a}, \quad a \le x \le b

$E[X] = \frac{a+b}{2}$ , $\text{Var}(X) = \frac{(b-a)^2}{12}$
Maximum entropy distribution for bounded support — 在只知道上下界時，最「無資訊」的分布

Use case: 隨機數生成、建模完全不確定性、hash function 的理想輸出。

Relationships Between Distributions

理解分布之間的關係是面試中的強力加分點。點擊任一分布可以查看它的公式、性質以及和其他分布的關係：

DiscreteContinuousDerivationConjugate prior

Real-World Use Cases

Case 1: 信用卡詐欺偵測

資料特性：99.9% 正常交易、0.1% 詐欺 → 極度不平衡

詐欺的發生可以用 Poisson 建模（每小時平均幾筆詐欺）
詐欺之間的間隔用 Exponential 建模
每筆交易是否詐欺是 Bernoulli（p = 0.001）
100 筆交易中有幾筆詐欺 → Binomial（n=100, p=0.001）≈ Poisson(0.1)
交易金額通常右偏 → Log-Normal（取 log 後做 feature engineering）

Case 2: 房價預測

房價本身是右偏的 → Log-Normal（取 log 做 target transform）
Regression 假設 residuals 服從 Normal 分布
房屋面積等 features 接近 Normal（可能有些偏態）
每個月的成交筆數 → Poisson

Case 3: 客戶分群

每位客戶的月消費次數 → Poisson 或 Negative Binomial（如果 overdispersed）
轉換率的先驗 → Beta 分布
消費金額 → Log-Normal
GMM（Gaussian Mixture Model）假設每個 cluster 是 Normal 分布

Quick Reference Table

Distribution	Type	Parameters	Mean	Variance	Key Property
Bernoulli	Discrete	$p$	$p$	$p(1-p)$	Single binary trial
Binomial	Discrete	$n, p$	$np$	$np(1-p)$	Sum of Bernoulli trials
Geometric	Discrete	$p$	$1/p$	$(1-p)/p^2$	Memoryless (discrete)
Neg. Binomial	Discrete	$r, p$	$r(1-p)/p$	$r(1-p)/p^2$	Overdispersed counts
Poisson	Discrete	$\lambda$	$\lambda$	$\lambda$	Mean = Variance
Normal	Continuous	$\mu, \sigma$	$\mu$	$\sigma^2$	CLT makes it universal
Log-Normal	Continuous	$\mu, \sigma$	$e^{\mu+\sigma^2/2}$	$(e^{\sigma^2}-1)e^{2\mu+\sigma^2}$	Multiplicative processes
Exponential	Continuous	$\lambda$	$1/\lambda$	$1/\lambda^2$	Memoryless (continuous)
Gamma	Continuous	$k, \lambda$	$k/\lambda$	$k/\lambda^2$	Sum of Exponentials
Beta	Continuous	$\alpha, \beta$	$\alpha/(\alpha+\beta)$	see formula	Conjugate prior for proportions
Uniform	Continuous	$a, b$	$(a+b)/2$	$(b-a)^2/12$	Maximum entropy (bounded)

Hands-on: Distributions in Python

Sampling & Verifying Distribution Properties

import numpy as np
from scipy import stats

# Verify Poisson: mean ≈ variance (defining property)
samples = np.random.poisson(lam=5, size=10000)
# sample mean ≈ 5.0, sample variance ≈ 5.0 → Mean = Variance ✓

# Verify Exponential: memoryless property
# P(X > s+t | X > s) should equal P(X > t)
exp_samples = np.random.exponential(scale=0.5, size=100000)
p_conditional = np.mean(exp_samples[exp_samples > 1] > 3)  # P(X>3 | X>1)
p_marginal = np.mean(exp_samples > 2)                       # P(X>2)
# Both ≈ 0.018 → memoryless ✓

# Verify CLT: sample means become normal regardless of source distribution
population = np.random.exponential(1, size=100000)  # heavily skewed
sample_means = [np.random.choice(population, 50).mean() for _ in range(5000)]
_, p_value = stats.normaltest(sample_means)
# p_value > 0.05 → sample means are normally distributed ✓

Fitting Distributions to Data

from scipy import stats
import numpy as np

# Transaction amounts (right-skewed, positive-only → try Log-Normal)
amounts = np.random.lognormal(mean=3, sigma=1, size=1000)

# Fit log-normal
shape, loc, scale = stats.lognorm.fit(amounts, floc=0)
# KS test: does the data fit the distribution?
ks_stat, p_value = stats.kstest(amounts, "lognorm", args=(shape, loc, scale))
# p > 0.05 → cannot reject that data follows log-normal ✓

# Compare: if you take log, it should look normal
log_amounts = np.log(amounts)
_, p_normal = stats.normaltest(log_amounts)
# p > 0.05 → log(amounts) is normally distributed ✓

Explore Distributions

Use this interactive tool to build intuition about how distribution shapes change with parameters. Try adjusting $\mu$ and $\sigma$ for the normal, or $\lambda$ for the exponential, and toggle between PDF and CDF:

Distribution Explorer

μ (mean): 0σ (std dev): 1Show CDF

Interview Signals

What interviewers listen for:

你能根據問題場景選擇正確的分布，而不是亂套公式
你理解每個分布的假設條件（例如 Poisson 要求事件獨立且速率恆定）
你能解釋分布之間的關係（Bernoulli → Binomial → Normal；Poisson ↔ Exponential）
你知道什麼時候用近似（Binomial → Poisson, Binomial → Normal）
你能區分 correlation 和 independence
你知道 PDF 的值不是機率

Practice

Flashcards

Flashcards (1/10)

Poisson 分布有什麼特殊性質讓你一眼就能辨認？

Mean 等於 Variance（均值 = 變異數 = λ）。如果資料的樣本均值和變異數接近，Poisson 是很好的候選分布。如果 variance 明顯大於 mean（overdispersion），考慮 Negative Binomial。

Click card to flip

Quiz

Question 1/10

一家網站每小時平均收到 5 個客訴，你想計算某小時收到 0 個客訴的機率，應該用什麼分布？

Mark as Complete

How confident are you with this topic?

3/5 — Okay