Hypothesis Testing

Interview Mode: 概念重建

先把分布、抽樣、估計、假設檢定重新打穩，因為很多 machine learning 與 experimentation 題目都站在這些基礎上。

What You Should Understand

能清楚說明 sample mean、variance、bias、variance tradeoff 分別在什麼語境出現
知道 confidence interval 與 p-value 不同，且不把它們混成同一件事
理解 CLT 為什麼讓平均值在樣本夠大時近似常態，以及它對推論的重要性
能講出 Type I / Type II error，以及 power 為什麼是產品實驗常被忽略的風險

Core Concepts

The Logic of Hypothesis Testing

Hypothesis testing is a framework for making decisions under uncertainty. The core idea:

State a null hypothesis ( $H_0$ ): The "nothing interesting is happening" claim
Collect data and compute a test statistic
Calculate the p-value: How likely is this data (or more extreme) if $H_0$ is true?
Decide: If the p-value is below your threshold ( $\alpha$ ), reject $H_0$

用一個直覺的比喻：假設檢定就像法庭審判。

$H_0$ = 被告無罪（預設立場）
檢察官（你的資料）必須提出足夠的證據
p-value = 「假設被告真的無罪，出現這麼強烈的證據的機率有多低？」
如果機率低到不合理（p < $\alpha$ ），陪審團判有罪（reject $H_0$ ）
「無法拒絕 $H_0$ 」不代表被告清白，只代表證據不足

Common Misconception

A p-value of 0.03 does NOT mean "there's a 3% chance the null hypothesis is true." It means: "If the null were true, we'd see data this extreme about 3% of the time."

同理，p = 0.06 不代表「快要顯著了」。 $\alpha$ 是你在實驗開始前就定好的門檻，不是事後可以調整的。

Population vs Sample

在做任何推論之前，先搞清楚你在談什麼：

概念	Population（母體）	Sample（樣本）
定義	你想了解的全部對象	你實際觀察到的子集
大小	$N$ （通常未知或無限大）	$n$ （已知）
平均值	$\mu$ （parameter，固定但未知）	$\bar{x}$ （statistic，隨樣本變動）
變異數	$\sigma^2$	$s^2 = \frac{1}{n-1}\sum(x_i - \bar{x})^2$
目標	想知道的真相	用來推論母體的依據

為什麼除以 $n-1$ ？（Bessel's correction）

用樣本計算 variance 時， $\bar{x}$ 本身就是從這批樣本算出來的，會系統性地低估離散程度。除以 $n-1$ 而非 $n$ 可以修正這個 bias，使 $s^2$ 成為 $\sigma^2$ 的 unbiased estimator。

E[s^2] = E\left[\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2\right] = \sigma^2

面試常見追問

「什麼是 degrees of freedom？」— 你有 $n$ 個資料點，但已經用掉 1 個自由度來估計 $\bar{x}$ ，所以只剩 $n-1$ 個獨立的離差。這就是為什麼 $s^2$ 除以 $n-1$ 。

Sampling Distribution and Standard Error

Sampling distribution 是「如果你重複抽樣很多次，每次算出的 statistic 會形成的分布」。它不是你的資料的分布，而是你的估計量的分布。

Standard Error (SE) 是 sampling distribution 的 standard deviation：

\text{SE}(\bar{x}) = \frac{s}{\sqrt{n}}

SE 告訴你：你的估計有多穩定？SE 越小， $\bar{x}$ 越接近真實的 $\mu$ 。

注意區別：

Standard deviation ( $s$ ): 資料本身的離散程度 →「每個人的消費差異有多大」
Standard error ( $s / \sqrt{n}$ ): 估計量的離散程度 →「平均消費這個估計有多準」

Confidence Intervals

A 95% confidence interval means: if we repeated this experiment many times, about 95% of the intervals we construct would contain the true parameter.

\bar{x} \pm z_{\alpha/2} \cdot \frac{s}{\sqrt{n}}

Where $\bar{x}$ is the sample mean, $s$ is the sample standard deviation, and $n$ is the sample size.

CI 不代表「真值有 95% 機率在裡面」

真值 $\mu$ 是固定的（不是隨機的），要麼在 CI 裡面，要麼不在。正確解讀是：「用這個方法重複建構 CI，長期而言 95% 的 CI 會涵蓋真值。」

如果你想說「 $\mu$ 有 95% 機率在這個區間」，那是 Bayesian credible interval，不是 frequentist CI。

CI 寬度取決於三個因素：

\text{CI Width} = 2 \times z_{\alpha/2} \times \frac{s}{\sqrt{n}}

Confidence level ( $1-\alpha$ )：99% CI 比 95% CI 更寬
Variance ( $s$ )：資料越分散，CI 越寬
Sample size ( $n$ )：樣本越大，CI 越窄（ $\sqrt{n}$ 在分母）

Type I and Type II Errors

	$H_0$ is true	$H_0$ is false
Reject $H_0$	Type I Error ( $\alpha$ )	Correct (Power = $1-\beta$ )
Fail to reject $H_0$	Correct	Type II Error ( $\beta$ )

Type I Error ( $\alpha$ ): False positive — you see an effect that isn't there
Type II Error ( $\beta$ ): False negative — you miss a real effect
Power ( $1-\beta$ ): Probability of detecting a real effect

用生活化的例子理解：

場景	Type I Error (false positive)	Type II Error (false negative)
懷孕測試	沒懷孕但測出陽性	懷孕了但測出陰性
火災警報	沒火災但警報響了	有火災但警報沒響
垃圾郵件	正常信被標為垃圾	垃圾信進了收件匣

$\alpha$ 和 $\beta$ 的 tradeoff：固定樣本量下，降低 $\alpha$ （更嚴格）會增加 $\beta$ （更容易漏掉真效果）。唯一能同時降低兩者的方法是增加樣本量。

Central Limit Theorem (CLT)

Regardless of the population distribution, the distribution of sample means approaches a normal distribution as sample size increases:

\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right) \text{ as } n \to \infty

This is why we can use z-tests and t-tests even when the underlying data isn't normal — as long as $n$ is large enough.

CLT 的三個關鍵細節：

多大才夠？ 一般 $n \geq 30$ 是經驗法則，但如果母體分布嚴重偏態（如收入、交易金額），可能需要 $n > 100$
前提條件：觀測值必須獨立，且母體 variance 必須有限（Cauchy 分布就不適用）
CLT 說的是平均值的分布，不是資料本身的分布。即使資料是 exponential 分布， $\bar{X}$ 在 $n$ 夠大時仍近似常態

Choosing the Right Test

情境	測試方法	前提假設
比較一個 sample mean 和已知值	One-sample t-test	常態或 $n$ 夠大
比較兩組 independent means	Two-sample t-test	獨立、常態或 $n$ 夠大
同一群人 before/after	Paired t-test	差值近似常態
比較兩個 proportions	z-test for proportions	$np \geq 5$ 且 $n(1-p) \geq 5$
比較 3+ 組 means	One-way ANOVA (F-test)	獨立、常態、等變異數
類別變數之間的關聯	Chi-squared test	每個 cell 期望次數 $\geq 5$
非常態資料、小樣本	Mann-Whitney U / Wilcoxon	無分布假設（nonparametric）

面試中的選擇邏輯

面試官問「你會怎麼比較 X 和 Y？」時，先回答三件事：

資料類型：連續 vs 類別？
幾組比較：2 組 vs 3+ 組？
配對 vs 獨立：同一群人的前後比較，還是不同人？

然後再說出對應的檢定方法和前提假設。

Statistical Power

Power depends on four factors:

Effect size: Larger effects are easier to detect
Sample size ( $n$ ): More data = more power
Significance level ( $\alpha$ ): Higher $\alpha$ = more power (but more false positives)
Variance: Lower variance = more power

Effect Size（效果量）

p-value 告訴你「有沒有差異」，effect size 告訴你「差異有多大」。常見的度量：

Cohen's d（兩組 means 的差異，以 pooled SD 為單位）：

d = \frac{\bar{x}_1 - \bar{x}_2}{s_p}, \quad s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}}

Cohen's d	大小	直覺
0.2	Small	需要很大的 $n$ 才能偵測
0.5	Medium	中等樣本量即可
0.8	Large	肉眼可見的差異

Interview Tip

When asked about A/B testing, always mention power analysis. Many candidates forget that not reaching statistical significance doesn't mean "no effect" — it could mean the test was underpowered.

好的回答：「p = 0.12 不代表沒有效果。以我們的樣本量和 baseline variance，power analysis 顯示我們只有 40% 的 power 來偵測 2% 的 effect — 也就是說，即使真的有效，我們有 60% 的機率會漏掉。建議延長實驗或用 CUPED 降低 variance。」

Multiple Testing Problem

當你同時做多個檢定時，至少出現一個 false positive 的機率急速上升：

P(\text{at least one FP}) = 1 - (1 - \alpha)^m

檢定數量 $m$	FP 機率（ $\alpha = 0.05$ ）
1	5.0%
5	22.6%
10	40.1%
20	64.2%

修正方法：

Bonferroni correction： $\alpha_{\text{adj}} = \alpha / m$ 。簡單但保守，power 大幅下降。
Benjamini-Hochberg (FDR)：控制 false discovery rate（被 reject 的 $H_0$ 中，FP 的期望比例）。比 Bonferroni 寬鬆，適合探索性分析。
Primary metric approach：只對一個 primary metric 做正式檢定，其他是 secondary/exploratory，不需要 correction。

面試地雷

「我們跑了 20 個 metrics，其中 3 個 p < 0.05，所以這些是顯著的。」— 這是典型的 multiple testing 陷阱。20 個檢定在 $\alpha = 0.05$ 下預期會有 1 個 false positive。3 個「顯著」可能都是巧合。面試中主動提到 multiple testing correction 是很強的加分信號。

Real-World Use Cases

把假設檢定的概念放進三個常見的 ML 場景，你會發現它無處不在：

Case 1: 信用卡詐欺偵測（Classification）

你建了一個詐欺偵測模型，上線前老闆問：「新模型真的比舊模型好嗎？」

Setup:

$H_0$ : 新模型的 precision 和舊模型沒有差異
$H_1$ : 新模型的 precision 更高
你在同一批標記資料上跑兩個模型，得到 paired predictions

為什麼不能只看單一數字？

假設舊模型 precision = 0.82，新模型 precision = 0.85。看起來新的比較好？但這可能只是因為這批 test data 剛好有利。你需要做 McNemar's test（paired comparison on classification results）或 bootstrap confidence interval 來確認差異是否顯著。

from sklearn.metrics import precision_score
from scipy import stats
import numpy as np

# Bootstrap: resample test set 1000 times, compare precision
precisions_old, precisions_new = [], []
for _ in range(1000):
    idx = np.random.choice(len(y_test), len(y_test), replace=True)
    precisions_old.append(precision_score(y_test[idx], y_pred_old[idx]))
    precisions_new.append(precision_score(y_test[idx], y_pred_new[idx]))

diff = np.array(precisions_new) - np.array(precisions_old)
ci = np.percentile(diff, [2.5, 97.5])
# CI = [0.01, 0.05] → excludes 0, new model is significantly better
# CI = [-0.02, 0.06] → includes 0, not enough evidence

Type I / II Error 在詐欺偵測的代價

Type I Error（把正常交易標為詐欺）→ 客戶被擋交易、打客服、體驗差，但損失可控
Type II Error（漏掉真正的詐欺）→ 銀行直接損失金錢

所以詐欺偵測通常設較低的 threshold（容忍更多 FP 來降低 FN），也就是 更在意 recall 而非 precision。這對應到假設檢定中：你願意接受更高的 $\alpha$ （更多 false alarm）來換取更高的 power（抓到更多真正的詐欺）。

Case 2: 房價預測（Regression）

你的房價模型有 20 個 features，PM 問：「坪數到底對房價有沒有影響？加了學區距離這個 feature 有用嗎？」

Setup:

$H_0$ : $\beta_{\text{area}} = 0$ （坪數的 coefficient 為零，對房價沒影響）
$H_1$ : $\beta_{\text{area}} \neq 0$

在 OLS regression 中，每個 coefficient 自帶一個 t-test：

t = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)}

如果 $|t|$ 大到 p-value < 0.05，就拒絕「這個 feature 沒用」的虛無假設。

import statsmodels.api as sm

X = sm.add_constant(df[["area", "rooms", "distance_to_school"]])
model = sm.OLS(df["price"], X).fit()
print(model.summary())
#                  coef    std err     t      P>|t|    [0.025    0.975]
# area           12.34      1.56    7.91    0.000     9.28      15.40   ← significant
# rooms           3.21      2.88    1.11    0.266    -2.44       8.86   ← NOT significant
# dist_school    -5.67      1.23   -4.61    0.000    -8.08      -3.26   ← significant

面試常見 Follow-up

「如果 rooms 的 p-value 很高，是不是就該拿掉？」

不一定。高 p-value 可能是因為：

Multicollinearity：rooms 和 area 高度相關，共線性讓 SE 膨脹、t-value 縮小
Underpowered：樣本太少，無法偵測真實但較小的效果
真的沒用：確認後可以移除

用 VIF 檢查共線性，用 F-test 同時檢定多個 coefficients，不要只看單一 p-value。

F-test：整組 features 有沒有用？

F = \frac{(SS_{\text{res,reduced}} - SS_{\text{res,full}}) / q}{SS_{\text{res,full}} / (n - p - 1)}

這是比較「有這組 features 的模型」和「沒有的模型」，看加入的 features 整體是否顯著。

Case 3: 客戶分群（Clustering）

你用 K-Means 把客戶分成 4 群，行銷部門問：「高價值客群的平均消費真的比其他群高嗎？還是只是隨機波動？」

Setup:

$H_0$ : 各群的平均月消費相同（ $\mu_1 = \mu_2 = \mu_3 = \mu_4$ ）
$H_1$ : 至少有一群不同
用 one-way ANOVA（F-test）或非參數的 Kruskal-Wallis test

from scipy import stats

# cluster_spending = [group1_values, group2_values, group3_values, group4_values]
f_stat, p_value = stats.f_oneway(*cluster_spending)
# p < 0.001 → at least one cluster is significantly different

# Follow-up: WHICH clusters differ? → Tukey's HSD for pairwise comparison
from statsmodels.stats.multicomp import pairwise_tukeyhsd
tukey = pairwise_tukeyhsd(all_spending, cluster_labels, alpha=0.05)
print(tukey)
# group1 vs group4: mean diff=120, p=0.001 → significant
# group2 vs group3: mean diff=15,  p=0.82  → not significant

Clustering + Hypothesis Testing 的陷阱

你不能用分群結果直接做假設檢定來「驗證」分群好不好。

因為 K-Means 的目標函數本身就是在最大化群間差異 — 所以用同一批資料做 ANOVA 當然會顯著，這是 circular reasoning。

正確做法：

在 training set 上做分群
在 holdout set 上計算各群的指標
在 holdout 結果上做假設檢定

或者：檢定的不是「群間是否不同」（那是 tautology），而是「分群是否對 downstream task 有幫助」（例如分群後的行銷活動 ROI 是否提升）。

三個 Use Cases 的假設檢定對照

場景	$H_0$	檢定方法	Type I 代價	Type II 代價
詐欺偵測	新舊模型 precision 無差異	Bootstrap CI / McNemar's test	誤上線差模型	錯過好模型
房價預測	Feature coefficient = 0	t-test（個別）/ F-test（整組）	留下無用 feature	移除有用 feature
客戶分群	各群平均消費相同	ANOVA + Tukey HSD	誤判群間差異	漏掉真實差異

Hands-on: Hypothesis Testing in Python

t-test & Confidence Interval

import numpy as np
from scipy import stats

# Scenario: A/B test — does the new checkout flow increase order value?
control = np.random.normal(loc=50, scale=12, size=200)
treatment = np.random.normal(loc=53, scale=12, size=200)

# Two-sample t-test (independent, equal variance)
t_stat, p_value = stats.ttest_ind(control, treatment)

# 95% Confidence Interval for the difference in means
diff = treatment.mean() - control.mean()
se = np.sqrt(treatment.var()/len(treatment) + control.var()/len(control))
ci_low, ci_high = diff - 1.96 * se, diff + 1.96 * se
# → CI excludes 0 ⟹ statistically significant at α=0.05

Power Analysis & Sample Size

from scipy import stats
import numpy as np

def required_sample_size(baseline, mde, alpha=0.05, power=0.80):
    """Calculate sample size per group for a two-proportion z-test."""
    z_alpha = stats.norm.ppf(1 - alpha / 2)  # 1.96
    z_beta = stats.norm.ppf(power)            # 0.84
    p1, p2 = baseline, baseline + mde
    pooled_var = p1 * (1 - p1) + p2 * (1 - p2)
    return int(np.ceil((z_alpha + z_beta) ** 2 * pooled_var / mde ** 2))

# 10% → 12% (MDE=2%): ~  3,800 per group
# 10% → 11% (MDE=1%): ~ 15,300 per group
# 10% → 10.5% (MDE=0.5%): ~ 61,200 per group
# → Smaller effects need exponentially more data!

Explore Distributions

Use this interactive tool to build intuition about how distribution shapes change with parameters:

Distribution Explorer

μ (mean): 0σ (std dev): 1Show CDF

Interview Signals

What interviewers listen for:

你不是背公式，而是知道什麼時候該用哪個工具
你能主動指出檢定的前提假設，例如獨立性、樣本量、分布型態
你能解釋統計結果對商業決策代表什麼，而不是只回報顯著不顯著

Practice

Flashcards

Flashcards (1/10)

為什麼 confidence interval 比單一 point estimate 更有用？

因為它同時提供估計值與不確定性範圍，能讓你判斷結果是否穩定，是否足以支持決策。例如「平均消費 $50」不如「平均消費 $50，95% CI [$45, $55]」有資訊量。

Click card to flip

Quiz

Question 1/10

A/B test 結果 p-value = 0.03，最合理的解讀是什麼？

Mark as Complete

How confident are you with this topic?

3/5 — Okay