A/B Testing & Experimentation

Interview Mode: 決策推理

A/B testing 是 data scientist 面試的重中之重。面試官不只考你會不會跑檢定，還考你能不能設計完整實驗、定義正確指標、識別陷阱，並把結果轉化為產品決策。

What You Should Understand

知道為什麼 randomization 能控制混淆因子，但不保證每次樣本都平衡
能區分 guardrail metric、primary metric、secondary metric
知道 sample ratio mismatch 是什麼，以及為什麼它是實驗品質警訊
能說明 observed correlation 不一定能支持 causal claim
能做 sample size calculation，理解 effect size、power、significance level 之間的關係

Core Concepts

Why Randomization Works

Randomization is the foundation of causal inference in experiments. By randomly assigning users to treatment and control:

Observed confounders (age, device, location) are balanced in expectation
Unobserved confounders (motivation, mood) are also balanced — this is the key advantage over observational studies

Randomization 是機率性的，不是確定性的。任何有限樣本中，兩組可能因為隨機性而存在差異，這正是 hypothesis testing 要處理的。

Randomization ≠ Perfection

Randomization 不能修正 instrumentation bugs、錯誤的 metric 定義、interference effects、或 selection bias。這些是實驗失效的常見原因，隨機化無法解決。

Experiment Design Checklist

A well-designed experiment answers these questions before launch:

Hypothesis: What do we believe will happen, and why?
Randomization unit: User, session, device, or page view?
Primary metric: What single metric defines success?
Guardrail metrics: What must not get worse?
Sample size: How many users and how long?
Triggering condition: Which users are eligible for the experiment?

Interview Tip

面試官問「how would you A/B test X?」時，先從這個 checklist 開始回答，而不是直接跳到 p-value。展示你有系統性思維。

Randomization Unit

The choice of randomization unit affects the analysis, metric definition, and validity:

Unit	Pros	Cons	When to Use
User	Consistent experience, clean analysis	Slower to accumulate samples	Most product experiments
Session	More samples, faster	Same user sees different variants	Short-lived treatments (e.g., ranking)
Page view	Maximum samples	Inconsistent UX	Backend changes invisible to users
Cluster (e.g., market)	Handles network effects	Very few units, low power	Social features, marketplace

Randomization unit 決定了你的 metric 分母、independence assumption、和 sample size 計算。選錯 unit 會讓整個實驗無效。

Metric Design

Metric Hierarchy

A robust experiment has three layers of metrics:

Primary metric (OEC — Overall Evaluation Criterion): The single metric that determines the launch decision. It should be:

Directly movable by the treatment
Measurable within the experiment timeframe
Aligned with long-term business goals

Secondary metrics: Provide additional context about how the treatment works. 幫助你理解 primary metric 為什麼動了（或沒動）。

Guardrail metrics: Must not degrade. 是實驗的安全網：

Page load time (performance)
Crash rate (stability)
Revenue per user (monetization)
Customer support contacts (user satisfaction)

Good Metrics vs. Bad Metrics

A good metric should be:

Property	Meaning	Bad Example
Sensitive	Detects real changes	Revenue（太 noisy，受太多外部因素影響）
Robust	Not easily gamed or noisy	Page views（可以靠 auto-refresh 灌水）
Aligned	Connected to long-term value	Clicks（不代表使用者真的得到價值）
Timely	Observable within experiment duration	年度留存率（等不到）

Example — search engine improvement:

Bad primary: Revenue → 太多外部因素影響，noise 太大
Better primary: Successful session rate → 使用者找到想要的東西
Good guardrail: Queries per session → 增加可能代表使用者在掙扎

Metric Surrogacy Problem

Short-term proxy metrics (e.g., clicks) don't always predict long-term outcomes (e.g., retention). 你需要驗證 proxy 和真正目標之間的關係。Netflix 曾發現某些提升 short-term engagement 的改動反而傷害了 long-term retention。

Sample Size & Power Analysis

The Power Calculation

Before running an experiment, compute the required sample size. For a two-sample z-test on means:

n = \frac{2(z_{\alpha/2} + z_\beta)^2 \sigma^2}{\delta^2}

where:

$n$ = sample size per group
$z_{\alpha/2}$ = z-score for significance level (1.96 for $\alpha = 0.05$ )
$z_\beta$ = z-score for power (0.84 for power = 80%)
$\sigma^2$ = variance of the metric
$\delta$ = minimum detectable effect (MDE)

For proportions (e.g., conversion rate $p$ ):

n = \frac{2(z_{\alpha/2} + z_\beta)^2 \cdot p(1-p)}{\delta^2}

The Four Levers

Power depends on four interconnected factors:

Factor	Increase → Power	Tradeoff
Sample size ( $n$ )	↑	Costs time and traffic
Effect size ( $\delta$ )	↑	Can't always control; smaller effects need more data
Significance level ( $\alpha$ )	↑	More false positives
Variance ( $\sigma^2$ )	↓ increases power	Use variance reduction techniques

Variance Reduction: CUPED

CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance by leveraging pre-experiment behavior:

\hat{Y}_{\text{cuped}} = \hat{Y} - \theta(\hat{X} - E[X])

where $X$ is a pre-experiment covariate (e.g., last week's engagement) and $\theta = \text{Cov}(Y, X) / \text{Var}(X)$ .

原理類似 regression adjustment：用實驗前的行為資料來解釋 metric 的 variance，把「本來就會有的個體差異」扣掉，讓 treatment effect 更容易被偵測到。

CUPED 可以降低 30-50% 的 variance，效果等同於免費多拿 30-50% 的樣本。Microsoft、Netflix、Uber 等公司廣泛使用。

Interview Gold

面試中主動提到 CUPED，會讓面試官知道你有實際的 experimentation 經驗。這是現代 A/B testing 最有影響力的技術之一。

Running the Experiment

How Long to Run

Run for at least one full business cycle (typically 1-2 weeks) to capture day-of-week effects
Do not stop early just because you see significance (peeking problem)
Do not extend indefinitely hoping for significance (p-hacking)

The Peeking Problem

If you check your p-value daily and stop when $p < 0.05$ , you inflate your false positive rate far above 5%:

P(\text{at least one false positive}) = 1 - (1 - \alpha)^{k} \quad \text{(k = number of checks)}

14 天每天看一次： $1 - 0.95^{14} \approx 0.51$ — 超過 50% 的 false positive rate！

Solutions:

Fixed-horizon testing: 事先決定實驗時長，只看最終結果
Sequential testing: Use alpha-spending functions (O'Brien-Fleming, Lan-DeMets) that control overall Type I error while allowing interim analyses
Bayesian testing: Monitor $P(\text{B better than A} \mid \text{data})$ continuously without p-value inflation

Sample Ratio Mismatch (SRM)

If you designed a 50/50 split but observe 51.2/48.8, run a chi-squared test:

\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}

If significant, do not trust the experiment results. Common SRM causes:

Buggy randomization code
Bots or crawlers assigned asymmetrically
Redirect latency causing differential dropout
Triggered analysis conditions that interact with treatment

SRM = Stop and Investigate

SRM 是最可靠的實驗品質診斷指標。如果偵測到 SRM，不管 primary metric 長怎樣，實驗結果都不可信。永遠在解讀結果前先檢查 SRM。

Common Pitfalls

Novelty and Primacy Effects

Novelty effect: Users interact more with something new simply because it is new. The treatment effect fades over time.
Primacy effect: Users resist change and initially perform worse, but adapt over time.

偵測方式：把結果按「新使用者 vs 回訪使用者」或「曝光天數」分群。如果 treatment effect 隨時間顯著改變，就有 novelty/primacy 問題。

Network Effects and Interference (SUTVA Violation)

Standard A/B tests assume SUTVA (Stable Unit Treatment Value Assumption): one user's assignment does not affect another's outcome. This is violated when:

Social features: User A sees a new sharing button → shares with User B (in control) → B benefits without being treated
Marketplace: Showing different prices to buyers affects seller supply, which affects all buyers
Limited inventory: Recommending item X more to treatment means fewer items for control

Solutions:

Cluster randomization: Randomize by geographic market, social cluster, or time
Switchback experiments: Alternate treatment/control over time periods
Ghost experiments: Log what the treatment would do without actually applying it

Multiple Testing

Running 20 tests at $\alpha = 0.05$ → expect 1 false positive. Corrections:

Bonferroni: $\alpha_{\text{adj}} = \alpha / m$ — 保守，控制 family-wise error rate
Benjamini-Hochberg (FDR): 控制 false discovery rate — 比 Bonferroni 寬鬆但更實用
Primary metric approach: 指定一個 primary metric（不需要 correction），其他作為 exploratory

Interpreting Results

The Decision Framework

Scenario	Action
Primary ↑, guardrails safe	Launch
Primary ↑, guardrail ↓	Investigate tradeoff, may need redesign
Primary flat, guardrails safe	No launch — the feature does not help
Primary ↓	Do not launch, analyze why
Inconclusive (underpowered)	Run longer, increase MDE, or use CUPED

Statistical vs. Practical Significance

A p-value of 0.001 with a 0.01% conversion lift is statistically significant but practically meaningless. Always report:

Effect size (absolute and relative)
Confidence interval for the effect
Business impact (estimated revenue, user hours, etc.)

面試中怎麼回答結果解讀

好的框架：「Primary metric 顯著提升 X%（CI [a, b]），guardrails 沒有顯著惡化。但 [某 secondary metric] 有下降趨勢，建議上線後持續監控。以目前 effect size 換算，預期年度 impact 約 Y。」

Real-World Use Cases

Case 1: 信用卡詐欺偵測模型上線

你訓練了新的詐欺偵測模型，需要 A/B test 決定是否上線。

Experiment design:

Randomization unit: Transaction（不是 user）— 因為同一使用者的不同交易可能有不同風險
Primary metric: Fraud loss rate（被漏掉的詐欺金額 / 總交易金額）
Guardrail: False positive rate（正常交易被擋下的比例）— 不能讓太多合法交易被誤擋
Challenge: Fraud labels 有延遲（chargeback 可能在 30-90 天後才確認），需要用 proxy metric（例如 model score distribution）來做早期監控

Delayed Labels Problem

詐欺偵測的 ground truth 有嚴重延遲。你不能等 90 天才知道實驗結果。解法：用 short-term proxy（model score > threshold 的比例）搭配 long-term validation（等 labels 回來後再確認 proxy 的準確性）。

Case 2: 推薦系統改版

電商平台的推薦演算法從 collaborative filtering 換成 deep learning model。

Experiment design:

Randomization unit: User — 確保同一使用者看到一致的推薦體驗
Primary metric: Conversion rate (purchase / visit)
Secondary: CTR on recommendations, average order value, items per order
Guardrails: Page load time（新模型推論更慢？）、diversity of recommendations（是否只推熱門商品？）
Challenge: Network effect — 如果 treatment 使用者買走熱門商品，control 使用者的推薦品質也受影響（limited inventory）

Case 3: 訂閱制定價策略

SaaS 產品想測試新的定價方案（月費從 $9.99 改為$ 12.99）。

Experiment design:

Randomization unit: New user（只對新使用者實驗，避免對既有使用者漲價造成的 primacy effect）
Primary metric: Revenue per user（30-day）
Guardrails: Signup rate（漲價後有多少人不註冊？）、Day-7 retention
Challenge:
- Long-term effects: 短期 revenue 可能上升（願意付的人付更多），但長期 churn 可能增加
- Ethical concerns: 同一產品同時存在兩種價格，可能引發信任問題
- Small effect, long timeline: 價格變動的影響需要看 LTV，可能需要跑數月的實驗

Hands-on: A/B Testing in Python

Complete A/B Test Analysis

import numpy as np
from scipy import stats

# Simulate A/B test: new checkout flow
n_control, n_treatment = 5000, 5000
control = np.random.binomial(1, 0.12, n_control)    # 12% baseline
treatment = np.random.binomial(1, 0.135, n_treatment)  # 13.5% treatment

conv_c, conv_t = control.mean(), treatment.mean()

# Two-proportion z-test
p_pool = (control.sum() + treatment.sum()) / (n_control + n_treatment)
se = np.sqrt(p_pool * (1 - p_pool) * (1/n_control + 1/n_treatment))
z_stat = (conv_t - conv_c) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

# 95% CI for the difference
se_diff = np.sqrt(conv_c*(1-conv_c)/n_control + conv_t*(1-conv_t)/n_treatment)
diff = conv_t - conv_c
ci = (diff - 1.96 * se_diff, diff + 1.96 * se_diff)

# SRM check (chi-squared test on allocation ratio)
expected = (n_control + n_treatment) / 2
chi2_srm = ((n_control - expected)**2 + (n_treatment - expected)**2) / expected
p_srm = 1 - stats.chi2.cdf(chi2_srm, df=1)
# p_srm > 0.01 → no sample ratio mismatch

Sample Size Calculation

from scipy import stats
import numpy as np

def required_sample_size(baseline, mde, alpha=0.05, power=0.80):
    """Sample size per group for a two-proportion z-test."""
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    p1, p2 = baseline, baseline + mde
    pooled_var = p1 * (1 - p1) + p2 * (1 - p2)
    return int(np.ceil((z_alpha + z_beta) ** 2 * pooled_var / mde ** 2))

# baseline=5%, MDE=0.5%: ~35,000 per group
# baseline=5%, MDE=1%:   ~ 9,000 per group
# baseline=5%, MDE=2%:   ~ 2,400 per group

Interview Signals

What interviewers listen for:

你不只會跑檢定，也會先想實驗是否測到正確問題
你會主動提出實驗風險（novelty effect、interference、selection bias）
你能把結果連回產品機制，而不是只背顯著性
你知道 sample size 怎麼算，也知道什麼時候需要 variance reduction
你能區分 statistical significance 和 practical significance

Practice

Flashcards

Flashcards (1/10)

Guardrail metric 的角色是什麼？

即使 primary metric 改善，也要確認沒有傷到其他關鍵品質（延遲、留存、客訴率）。Guardrail 是實驗的安全網，防止局部優化傷害整體產品。

Click card to flip

Quiz

Question 1/10

A/B test conversion 提升，但 page load time 變差很多，下一步最合理是什麼？

Mark as Complete

How confident are you with this topic?

3/5 — Okay