Sampling Methods

Interview Context

Sampling 是統計推論的起點 — 從母體抽樣不當，後面所有分析都會歪掉。面試中常以兩種形式出現：（1）「你的資料有 selection bias，怎麼辦？」（2）「你有 100 億筆資料，怎麼有效率地抽樣分析？」

What You Should Understand

知道 random sampling 為什麼是推論的基礎，以及它什麼時候不夠用
能比較 stratified、cluster、systematic sampling 的適用場景
理解 bootstrap 的原理和它能（及不能）做什麼
知道 reservoir sampling 如何在 streaming data 上做等機率抽樣
能辨認 sampling bias 的種類並提出修正方法

Core Concepts

Why Sampling Matters

Statistical inference requires that the sample represents the population. If it doesn't, no amount of sophisticated analysis can fix the conclusions.

兩個關鍵概念：

Sampling frame: 你實際上能從中抽樣的名單。如果名單本身就不完整（例如只有線上用戶，沒有離線用戶），你的推論就有 coverage bias。
Sampling mechanism: 如何從 frame 中選取個體。Non-random mechanisms 幾乎必定導致 bias。

Simple Random Sampling (SRS)

Every individual in the population has an equal probability of being selected. This is the gold standard, but often impractical.

P(\text{individual } i \text{ selected}) = \frac{n}{N}

where $n$ is the sample size and $N$ is the population size.

SRS 在理論上最乾淨，但實務中有限制：

需要一份完整的 population list（很多時候不存在）
如果 population 有 subgroups（例如不同國家的 user），SRS 可能會 under-represent 小群體
對稀有事件效率低（例如 0.1% 的詐欺交易在 SRS 中很難抽到足夠數量）

Stratified Sampling

Divide the population into mutually exclusive strata (subgroups), then sample independently from each:

\bar{x}_{\text{stratified}} = \sum_{h=1}^{H} W_h \bar{x}_h, \quad W_h = \frac{N_h}{N}

where $W_h$ is the population weight of stratum $h$ .

Two Allocation Strategies

Proportional allocation: Sample from each stratum in proportion to its size.

n_h = n \cdot \frac{N_h}{N}

保證每個 stratum 的代表性。

Optimal (Neyman) allocation: Sample more from strata with higher variance.

n_h \propto N_h \cdot \sigma_h

如果某個 stratum 的 variance 很大（例如 high-value customers 的消費金額差異很大），需要從這個 stratum 多抽一些才能得到穩定的估計。

When to Use Stratified Sampling

Scenario	Why Stratified	Strata
A/B test across countries	確保每個國家都有足夠代表	Country
Survey across income levels	高收入人群少但 variance 大	Income bracket
Model evaluation on rare classes	確保少數類別有足夠樣本	Fraud / not fraud
Cross-platform analysis	Mobile 和 desktop 行為差異大	Platform

Interview Connection: Stratified K-Fold

sklearn 的 StratifiedKFold 就是 stratified sampling 的應用 — 確保每個 fold 的 class distribution 和全體一致。在 imbalanced classification 中如果用普通 KFold，某些 fold 可能完全沒有 minority class。

Cluster Sampling

Divide the population into clusters (e.g., geographic regions, schools), randomly select some clusters, then sample all individuals within selected clusters.

和 stratified 的關鍵區別：

Stratified: 每個 stratum 都抽（strata 之間差異大，stratum 內同質）
Cluster: 只抽部分 cluster（clusters 之間同質，cluster 內差異大）

Two-Stage Cluster Sampling

Randomly select $m$ clusters from $M$ total
Within each selected cluster, randomly sample $n_i$ individuals

Cluster sampling 在實務中非常常見（例如 A/B test 按城市 cluster 分流），但 variance 通常比 SRS 高，因為 cluster 內的個體往往相似（intraclass correlation）。

Systematic Sampling

Select every $k$ -th element from an ordered list, starting from a random point:

k = \left\lfloor\frac{N}{n}\right\rfloor, \quad \text{start} \sim \text{Uniform}(1, k)

簡單高效，但如果 list 有周期性（例如週一到週日循環），systematic sampling 可能剛好只抽到某幾天的資料。

Bootstrap

The Core Idea

Bootstrap is a resampling method: treat your sample as if it were the population, and simulate the sampling distribution by drawing with replacement.

\text{Bootstrap sample } b: \quad \mathbf{x}^{*(b)} = \text{sample } n \text{ observations with replacement from } \mathbf{x}

For $B$ bootstrap iterations:

Draw a bootstrap sample of size $n$ (with replacement)
Compute the statistic of interest (mean, median, coefficient, etc.)
Repeat $B$ times (typically $B = 1000$ to $10000$ )
The distribution of the $B$ statistics approximates the sampling distribution

Bootstrap Confidence Interval

Percentile method (simplest):

\text{95\% CI} = \left[\hat{\theta}^*_{0.025},\; \hat{\theta}^*_{0.975}\right]

取 $B$ 個 bootstrap estimates 的 2.5th 和 97.5th percentile。

BCa method (bias-corrected and accelerated): 修正 bootstrap 的 bias 和 skewness，通常更準確。

When Bootstrap Works (and Doesn't)

Works Well	Doesn't Work
Estimating SE of complex statistics (median, quantiles, ratios)	Extremes / tail probabilities (bootstrap 無法產生比原始資料更極端的值)
Model comparison (bootstrap difference in AUC)	Very small samples ( $n < 20$ ) — 重複抽同一批資料的 variance 不穩定
Non-standard estimators without closed-form SE	Dependent data (time series) — 需要 block bootstrap
Hypothesis testing (permutation test variant)	Population with infinite variance (heavy tails)

import numpy as np
from sklearn.metrics import precision_score

# Bootstrap: compare precision of two models
def bootstrap_ci(y_true, y_pred_A, y_pred_B, n_iter=2000, alpha=0.05):
    diffs = []
    n = len(y_true)
    for _ in range(n_iter):
        idx = np.random.choice(n, n, replace=True)
        p_A = precision_score(y_true[idx], y_pred_A[idx], zero_division=0)
        p_B = precision_score(y_true[idx], y_pred_B[idx], zero_division=0)
        diffs.append(p_B - p_A)
    lower = np.percentile(diffs, 100 * alpha / 2)
    upper = np.percentile(diffs, 100 * (1 - alpha / 2))
    return lower, upper
# CI excludes 0 → model B is significantly better

Bootstrap ≠ Magic

Bootstrap 不能修正 sampling bias。如果你的原始樣本本身就不代表母體（例如只有 iOS 用戶），resample 一百萬次也只是在重複同一個 biased sample。Bootstrap 改善的是 estimation precision，不是 representativeness。

Reservoir Sampling

The Problem

You have a stream of data (possibly infinite or unknown length $N$ ), and you need to maintain a random sample of exactly $k$ items, with each item having equal probability of being selected.

Algorithm R (Vitter, 1985)

def reservoir_sampling(stream, k):
    """Maintain a random sample of k items from a stream."""
    reservoir = []
    for i, item in enumerate(stream):
        if i < k:
            reservoir.append(item)
        else:
            # Replace element j with probability k/(i+1)
            j = random.randint(0, i)
            if j < k:
                reservoir[j] = item
    return reservoir

Why it works: After seeing $i+1$ items, each item is in the reservoir with probability $k/(i+1)$ . 數學歸納法可以證明這個不變性。

Use Cases

Log sampling: 每秒百萬筆 request log，只保留 1% 做分析
A/B test traffic: 從所有 user 中隨機抽取 experiment population
ML training: 從 streaming data 中維持一個 representative training set
Database sampling: SELECT * FROM table TABLESAMPLE RESERVOIR(1000) — 很多資料庫底層就是 reservoir sampling

面試高頻題

「你有一個無限長的 data stream，只能 single pass，memory 只能存 k 筆。怎麼確保每一筆被選到的機率相等？」— 答案就是 reservoir sampling。記住 Algorithm R 的核心：第 $i$ 筆資料被保留的機率是 $k/i$ 。

Importance Sampling

The Problem

You want to estimate $E_p[f(X)]$ , but sampling from $p(x)$ is either difficult or inefficient (e.g., rare event probability).

The Solution

Sample from a proposal distribution $q(x)$ instead, and correct with importance weights:

E_p[f(X)] = E_q\left[f(X) \cdot \frac{p(X)}{q(X)}\right] \approx \frac{1}{n}\sum_{i=1}^n f(x_i) \cdot w_i, \quad w_i = \frac{p(x_i)}{q(x_i)}

Intuition

如果某個區域在 $p$ 下機率很低（很難 sample 到），但 $f(x)$ 在那裡很重要，你可以用一個在那個區域機率更高的 $q$ 來抽樣，然後用 weight $p/q$ 修正。

Practical Concerns

Variance: If $q$ is very different from $p$ , some weights will be extremely large → high variance. 好的 $q$ 應該讓 $|f(x) \cdot p(x)/q(x)|$ 盡量穩定。
Effective sample size: $n_{\text{eff}} = \frac{(\sum w_i)^2}{\sum w_i^2}$ — 如果大部分 weight 集中在少數幾個 sample，effective sample size 很小。
Self-normalized importance sampling: 當 $p$ 只知道 unnormalized form 時，用 $\hat{w}_i = w_i / \sum w_j$ 。

Use Cases

Application	Why Importance Sampling
Rare event simulation	直接 sample 幾乎不會碰到稀有事件，用偏重稀有事件的 $q$ 提高效率
Bayesian inference (MCMC)	Proposal distribution 用來探索 posterior
Off-policy evaluation (RL)	用 behavior policy 的資料評估 target policy
Counterfactual evaluation	用 logged data 模擬不同 policy 的效果

Sampling Bias

Types of Bias

Bias	Description	Example
Selection bias	樣本不代表母體	線上問卷只有主動填的人回答（self-selection）
Survivorship bias	只看到「存活」的個體	分析成功公司的特質，忽略已倒閉的（例如分析上市股票忽略下市的）
Non-response bias	不回應的人和回應的人系統性不同	高收入族群較少回覆薪資調查
Convenience bias	用最方便取得的資料	只分析公司內部已有的 user，忽略 churned user
Observation bias (Hawthorne)	被觀察的人改變行為	A/B test 中 user 知道自己在實驗中（但通常不知道）
Time bias	資料收集的時間不代表一般情況	只用 Black Friday 的資料建模購買行為

Bias Correction Techniques

Inverse Probability Weighting (IPW):

\hat{\mu} = \frac{\sum_{i=1}^n w_i \cdot y_i}{\sum_{i=1}^n w_i}, \quad w_i = \frac{1}{P(\text{selected}_i)}

被 under-represented 的個體得到更高的 weight。在 causal inference（propensity score weighting）和 survey statistics 中廣泛使用。

Post-stratification: 按已知的 population distribution 重新加權（例如你知道 population 中男女各半，但 sample 中男性佔 70%，就給女性更高的 weight）。

Doubly robust estimation: 結合 outcome model 和 propensity model，只要其中一個正確，estimator 就是 consistent。

Real-World Use Cases

Case 1: 信用卡詐欺偵測 — Undersampling & Oversampling

詐欺只佔 0.1% 的交易。直接用 SRS 訓練模型，model 會學到「永遠預測非詐欺」就有 99.9% accuracy。

解法：

Random undersampling: 從 majority class 中只抽取和 minority class 相同數量的樣本。簡單但丟失資訊。
SMOTE (Synthetic Minority Oversampling): 在 minority class 的 feature space 中用 k-NN 插值產生合成樣本。保留更多資訊但可能 overfitting。
Stratified train/test split: 確保 train 和 test 中 fraud 的比例一致。

from sklearn.model_selection import train_test_split

# Stratified split: preserve class ratio
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)
# y_train and y_test both have ~0.1% fraud

Case 2: 推薦系統 — Popularity Bias & Exposure Bias

推薦系統的 logged data 有嚴重的 selection bias：你只觀察到 user 被推薦過且點擊過 的 item，沒有觀察到 user 對其他 item 的偏好。

這叫 exposure bias 或 missing not at random (MNAR)。用這種資料直接訓練會讓 model 越來越偏向推薦已經熱門的 item（popularity bias 正循環）。

解法：

Inverse propensity scoring: 給低曝光 item 更高的 weight
Exploration: 定期隨機推薦部分 item 收集 unbiased data
Causal inference: 用 propensity score 或 instrumental variables 修正

Case 3: 房價預測 — Non-Random Missing Data

房屋資料中，「裝修年份」這個 feature 有 30% missing。Missing 不是隨機的 — 老房子更可能沒有裝修記錄。

Missing Mechanism	Description	Handling
MCAR (Missing Completely At Random)	缺失和任何變數無關	直接刪除或簡單 impute
MAR (Missing At Random)	缺失只和 observed variables 有關	用其他 features predict missing values
MNAR (Missing Not At Random)	缺失和 missing value 本身有關	最難處理；需要 domain knowledge 或 sensitivity analysis

房屋裝修年份的 missing 是 MNAR（沒裝修過的房子自然沒有裝修年份），直接 mean imputation 會嚴重 bias。更好的做法：加一個 binary feature has_renovation 並用 model-based imputation。

Hands-on: Sampling in Python

Stratified Sampling with pandas

import pandas as pd
import numpy as np

# Stratified sample: 10% from each group
df = pd.DataFrame({
    "user_id": range(10000),
    "country": np.random.choice(["US", "TW", "JP"], 10000, p=[0.7, 0.2, 0.1]),
    "revenue": np.random.exponential(50, 10000),
})

stratified = df.groupby("country", group_keys=False).apply(
    lambda x: x.sample(frac=0.1, random_state=42)
)
# Each country represented proportionally

Reservoir Sampling

import random

def reservoir_sample(stream, k):
    reservoir = []
    for i, item in enumerate(stream):
        if i < k:
            reservoir.append(item)
        else:
            j = random.randint(0, i)
            if j < k:
                reservoir[j] = item
    return reservoir

# Sample 1000 items from a stream of unknown length
sample = reservoir_sample(iter(range(10_000_000)), k=1000)

Interview Signals

What interviewers listen for:

你能辨認資料中的 sampling bias 並提出修正方法
你知道 bootstrap 能做什麼（估計 SE、建構 CI）和不能做什麼（修正 bias）
你能在 streaming data 場景下提出 reservoir sampling
你能區分 stratified 和 cluster sampling 的適用場景
你理解 missing data 的三種機制（MCAR / MAR / MNAR）

Practice

Flashcards

Flashcards (1/10)

Stratified sampling 和 cluster sampling 的核心區別是什麼？

Stratified: 每個 stratum 都抽（strata 之間差異大，stratum 內同質）。Cluster: 只抽部分 cluster（clusters 之間同質，cluster 內差異大）。Stratified 通常 variance 更低，cluster 更省成本。

Click card to flip

Quiz

Question 1/10

你的 A/B test 只有 iOS 用戶參與，但產品的使用者有 60% 是 Android。這是什麼問題？

Mark as Complete

How confident are you with this topic?

3/5 — Okay