Sampling Methods
Interview Context
Sampling 是統計推論的起點 — 從母體抽樣不當,後面所有分析都會歪掉。面試中常以兩種形式出現:(1)「你的資料有 selection bias,怎麼辦?」(2)「你有 100 億筆資料,怎麼有效率地抽樣分析?」
What You Should Understand
- 知道 random sampling 為什麼是推論的基礎,以及它什麼時候不夠用
- 能比較 stratified、cluster、systematic sampling 的適用場景
- 理解 bootstrap 的原理和它能(及不能)做什麼
- 知道 reservoir sampling 如何在 streaming data 上做等機率抽樣
- 能辨認 sampling bias 的種類並提出修正方法
Core Concepts
Why Sampling Matters
Statistical inference requires that the sample represents the population. If it doesn't, no amount of sophisticated analysis can fix the conclusions.
兩個關鍵概念:
- Sampling frame: 你實際上能從中抽樣的名單。如果名單本身就不完整(例如只有線上用戶,沒有離線用戶),你的推論就有 coverage bias。
- Sampling mechanism: 如何從 frame 中選取個體。Non-random mechanisms 幾乎必定導致 bias。
Simple Random Sampling (SRS)
Every individual in the population has an equal probability of being selected. This is the gold standard, but often impractical.
where is the sample size and is the population size.
SRS 在理論上最乾淨,但實務中有限制:
- 需要一份完整的 population list(很多時候不存在)
- 如果 population 有 subgroups(例如不同國家的 user),SRS 可能會 under-represent 小群體
- 對稀有事件效率低(例如 0.1% 的詐欺交易在 SRS 中很難抽到足夠數量)
Stratified Sampling
Divide the population into mutually exclusive strata (subgroups), then sample independently from each:
where is the population weight of stratum .
Two Allocation Strategies
Proportional allocation: Sample from each stratum in proportion to its size.
保證每個 stratum 的代表性。
Optimal (Neyman) allocation: Sample more from strata with higher variance.
如果某個 stratum 的 variance 很大(例如 high-value customers 的消費金額差異很大),需要從這個 stratum 多抽一些才能得到穩定的估計。
When to Use Stratified Sampling
| Scenario | Why Stratified | Strata |
|---|---|---|
| A/B test across countries | 確保每個國家都有足夠代表 | Country |
| Survey across income levels | 高收入人群少但 variance 大 | Income bracket |
| Model evaluation on rare classes | 確保少數類別有足夠樣本 | Fraud / not fraud |
| Cross-platform analysis | Mobile 和 desktop 行為差異大 | Platform |
Interview Connection: Stratified K-Fold
sklearn 的 StratifiedKFold 就是 stratified sampling 的應用 — 確保每個 fold 的 class distribution 和全體一致。在 imbalanced classification 中如果用普通 KFold,某些 fold 可能完全沒有 minority class。
Cluster Sampling
Divide the population into clusters (e.g., geographic regions, schools), randomly select some clusters, then sample all individuals within selected clusters.
和 stratified 的關鍵區別:
- Stratified: 每個 stratum 都抽(strata 之間差異大,stratum 內同質)
- Cluster: 只抽部分 cluster(clusters 之間同質,cluster 內差異大)
Two-Stage Cluster Sampling
- Randomly select clusters from total
- Within each selected cluster, randomly sample individuals
Cluster sampling 在實務中非常常見(例如 A/B test 按城市 cluster 分流),但 variance 通常比 SRS 高,因為 cluster 內的個體往往相似(intraclass correlation)。
Systematic Sampling
Select every -th element from an ordered list, starting from a random point:
簡單高效,但如果 list 有周期性(例如週一到週日循環),systematic sampling 可能剛好只抽到某幾天的資料。
Bootstrap
The Core Idea
Bootstrap is a resampling method: treat your sample as if it were the population, and simulate the sampling distribution by drawing with replacement.
For bootstrap iterations:
- Draw a bootstrap sample of size (with replacement)
- Compute the statistic of interest (mean, median, coefficient, etc.)
- Repeat times (typically to )
- The distribution of the statistics approximates the sampling distribution
Bootstrap Confidence Interval
Percentile method (simplest):
取 個 bootstrap estimates 的 2.5th 和 97.5th percentile。
BCa method (bias-corrected and accelerated): 修正 bootstrap 的 bias 和 skewness,通常更準確。
When Bootstrap Works (and Doesn't)
| Works Well | Doesn't Work |
|---|---|
| Estimating SE of complex statistics (median, quantiles, ratios) | Extremes / tail probabilities (bootstrap 無法產生比原始資料更極端的值) |
| Model comparison (bootstrap difference in AUC) | Very small samples () — 重複抽同一批資料的 variance 不穩定 |
| Non-standard estimators without closed-form SE | Dependent data (time series) — 需要 block bootstrap |
| Hypothesis testing (permutation test variant) | Population with infinite variance (heavy tails) |
import numpy as np
from sklearn.metrics import precision_score
# Bootstrap: compare precision of two models
def bootstrap_ci(y_true, y_pred_A, y_pred_B, n_iter=2000, alpha=0.05):
diffs = []
n = len(y_true)
for _ in range(n_iter):
idx = np.random.choice(n, n, replace=True)
p_A = precision_score(y_true[idx], y_pred_A[idx], zero_division=0)
p_B = precision_score(y_true[idx], y_pred_B[idx], zero_division=0)
diffs.append(p_B - p_A)
lower = np.percentile(diffs, 100 * alpha / 2)
upper = np.percentile(diffs, 100 * (1 - alpha / 2))
return lower, upper
# CI excludes 0 → model B is significantly better
Bootstrap ≠ Magic
Bootstrap 不能修正 sampling bias。如果你的原始樣本本身就不代表母體(例如只有 iOS 用戶),resample 一百萬次也只是在重複同一個 biased sample。Bootstrap 改善的是 estimation precision,不是 representativeness。
Reservoir Sampling
The Problem
You have a stream of data (possibly infinite or unknown length ), and you need to maintain a random sample of exactly items, with each item having equal probability of being selected.
Algorithm R (Vitter, 1985)
def reservoir_sampling(stream, k):
"""Maintain a random sample of k items from a stream."""
reservoir = []
for i, item in enumerate(stream):
if i < k:
reservoir.append(item)
else:
# Replace element j with probability k/(i+1)
j = random.randint(0, i)
if j < k:
reservoir[j] = item
return reservoir
Why it works: After seeing items, each item is in the reservoir with probability . 數學歸納法可以證明這個不變性。
Use Cases
- Log sampling: 每秒百萬筆 request log,只保留 1% 做分析
- A/B test traffic: 從所有 user 中隨機抽取 experiment population
- ML training: 從 streaming data 中維持一個 representative training set
- Database sampling:
SELECT * FROM table TABLESAMPLE RESERVOIR(1000)— 很多資料庫底層就是 reservoir sampling
面試高頻題
「你有一個無限長的 data stream,只能 single pass,memory 只能存 k 筆。怎麼確保每一筆被選到的機率相等?」— 答案就是 reservoir sampling。記住 Algorithm R 的核心:第 筆資料被保留的機率是 。
Importance Sampling
The Problem
You want to estimate , but sampling from is either difficult or inefficient (e.g., rare event probability).
The Solution
Sample from a proposal distribution instead, and correct with importance weights:
Intuition
如果某個區域在 下機率很低(很難 sample 到),但 在那裡很重要,你可以用一個在那個區域機率更高的 來抽樣,然後用 weight 修正。
Practical Concerns
- Variance: If is very different from , some weights will be extremely large → high variance. 好的 應該讓 盡量穩定。
- Effective sample size: — 如果大部分 weight 集中在少數幾個 sample,effective sample size 很小。
- Self-normalized importance sampling: 當 只知道 unnormalized form 時,用 。
Use Cases
| Application | Why Importance Sampling |
|---|---|
| Rare event simulation | 直接 sample 幾乎不會碰到稀有事件,用偏重稀有事件的 提高效率 |
| Bayesian inference (MCMC) | Proposal distribution 用來探索 posterior |
| Off-policy evaluation (RL) | 用 behavior policy 的資料評估 target policy |
| Counterfactual evaluation | 用 logged data 模擬不同 policy 的效果 |
Sampling Bias
Types of Bias
| Bias | Description | Example |
|---|---|---|
| Selection bias | 樣本不代表母體 | 線上問卷只有主動填的人回答(self-selection) |
| Survivorship bias | 只看到「存活」的個體 | 分析成功公司的特質,忽略已倒閉的(例如分析上市股票忽略下市的) |
| Non-response bias | 不回應的人和回應的人系統性不同 | 高收入族群較少回覆薪資調查 |
| Convenience bias | 用最方便取得的資料 | 只分析公司內部已有的 user,忽略 churned user |
| Observation bias (Hawthorne) | 被觀察的人改變行為 | A/B test 中 user 知道自己在實驗中(但通常不知道) |
| Time bias | 資料收集的時間不代表一般情況 | 只用 Black Friday 的資料建模購買行為 |
Bias Correction Techniques
Inverse Probability Weighting (IPW):
被 under-represented 的個體得到更高的 weight。在 causal inference(propensity score weighting)和 survey statistics 中廣泛使用。
Post-stratification: 按已知的 population distribution 重新加權(例如你知道 population 中男女各半,但 sample 中男性佔 70%,就給女性更高的 weight)。
Doubly robust estimation: 結合 outcome model 和 propensity model,只要其中一個正確,estimator 就是 consistent。
Real-World Use Cases
Case 1: 信用卡詐欺偵測 — Undersampling & Oversampling
詐欺只佔 0.1% 的交易。直接用 SRS 訓練模型,model 會學到「永遠預測非詐欺」就有 99.9% accuracy。
解法:
- Random undersampling: 從 majority class 中只抽取和 minority class 相同數量的樣本。簡單但丟失資訊。
- SMOTE (Synthetic Minority Oversampling): 在 minority class 的 feature space 中用 k-NN 插值產生合成樣本。保留更多資訊但可能 overfitting。
- Stratified train/test split: 確保 train 和 test 中 fraud 的比例一致。
from sklearn.model_selection import train_test_split
# Stratified split: preserve class ratio
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, stratify=y, random_state=42
)
# y_train and y_test both have ~0.1% fraud
Case 2: 推薦系統 — Popularity Bias & Exposure Bias
推薦系統的 logged data 有嚴重的 selection bias:你只觀察到 user 被推薦過且點擊過 的 item,沒有觀察到 user 對其他 item 的偏好。
這叫 exposure bias 或 missing not at random (MNAR)。用這種資料直接訓練會讓 model 越來越偏向推薦已經熱門的 item(popularity bias 正循環)。
解法:
- Inverse propensity scoring: 給低曝光 item 更高的 weight
- Exploration: 定期隨機推薦部分 item 收集 unbiased data
- Causal inference: 用 propensity score 或 instrumental variables 修正
Case 3: 房價預測 — Non-Random Missing Data
房屋資料中,「裝修年份」這個 feature 有 30% missing。Missing 不是隨機的 — 老房子更可能沒有裝修記錄。
| Missing Mechanism | Description | Handling |
|---|---|---|
| MCAR (Missing Completely At Random) | 缺失和任何變數無關 | 直接刪除或簡單 impute |
| MAR (Missing At Random) | 缺失只和 observed variables 有關 | 用其他 features predict missing values |
| MNAR (Missing Not At Random) | 缺失和 missing value 本身有關 | 最難處理;需要 domain knowledge 或 sensitivity analysis |
房屋裝修年份的 missing 是 MNAR(沒裝修過的房子自然沒有裝修年份),直接 mean imputation 會嚴重 bias。更好的做法:加一個 binary feature has_renovation 並用 model-based imputation。
Hands-on: Sampling in Python
Stratified Sampling with pandas
import pandas as pd
import numpy as np
# Stratified sample: 10% from each group
df = pd.DataFrame({
"user_id": range(10000),
"country": np.random.choice(["US", "TW", "JP"], 10000, p=[0.7, 0.2, 0.1]),
"revenue": np.random.exponential(50, 10000),
})
stratified = df.groupby("country", group_keys=False).apply(
lambda x: x.sample(frac=0.1, random_state=42)
)
# Each country represented proportionally
Reservoir Sampling
import random
def reservoir_sample(stream, k):
reservoir = []
for i, item in enumerate(stream):
if i < k:
reservoir.append(item)
else:
j = random.randint(0, i)
if j < k:
reservoir[j] = item
return reservoir
# Sample 1000 items from a stream of unknown length
sample = reservoir_sample(iter(range(10_000_000)), k=1000)
Interview Signals
What interviewers listen for:
- 你能辨認資料中的 sampling bias 並提出修正方法
- 你知道 bootstrap 能做什麼(估計 SE、建構 CI)和不能做什麼(修正 bias)
- 你能在 streaming data 場景下提出 reservoir sampling
- 你能區分 stratified 和 cluster sampling 的適用場景
- 你理解 missing data 的三種機制(MCAR / MAR / MNAR)
Practice
Flashcards
Flashcards (1/10)
Stratified sampling 和 cluster sampling 的核心區別是什麼?
Stratified: 每個 stratum 都抽(strata 之間差異大,stratum 內同質)。Cluster: 只抽部分 cluster(clusters 之間同質,cluster 內差異大)。Stratified 通常 variance 更低,cluster 更省成本。
Quiz
你的 A/B test 只有 iOS 用戶參與,但產品的使用者有 60% 是 Android。這是什麼問題?