A/B Testing & Experimentation

Interview Mode: 決策推理

A/B testing 是 data scientist 面試的重中之重。面試官不只考你會不會跑檢定,還考你能不能設計完整實驗、定義正確指標、識別陷阱,並把結果轉化為產品決策。

What You Should Understand

  • 知道為什麼 randomization 能控制混淆因子,但不保證每次樣本都平衡
  • 能區分 guardrail metric、primary metric、secondary metric
  • 知道 sample ratio mismatch 是什麼,以及為什麼它是實驗品質警訊
  • 能說明 observed correlation 不一定能支持 causal claim
  • 能做 sample size calculation,理解 effect size、power、significance level 之間的關係

Core Concepts

Why Randomization Works

Randomization is the foundation of causal inference in experiments. By randomly assigning users to treatment and control:

  • Observed confounders (age, device, location) are balanced in expectation
  • Unobserved confounders (motivation, mood) are also balanced — this is the key advantage over observational studies

Randomization 是機率性的,不是確定性的。任何有限樣本中,兩組可能因為隨機性而存在差異,這正是 hypothesis testing 要處理的。

Randomization ≠ Perfection

Randomization 不能修正 instrumentation bugs、錯誤的 metric 定義、interference effects、或 selection bias。這些是實驗失效的常見原因,隨機化無法解決。

Experiment Design Checklist

A well-designed experiment answers these questions before launch:

  1. Hypothesis: What do we believe will happen, and why?
  2. Randomization unit: User, session, device, or page view?
  3. Primary metric: What single metric defines success?
  4. Guardrail metrics: What must not get worse?
  5. Sample size: How many users and how long?
  6. Triggering condition: Which users are eligible for the experiment?

Interview Tip

面試官問「how would you A/B test X?」時,先從這個 checklist 開始回答,而不是直接跳到 p-value。展示你有系統性思維。

Randomization Unit

The choice of randomization unit affects the analysis, metric definition, and validity:

UnitProsConsWhen to Use
UserConsistent experience, clean analysisSlower to accumulate samplesMost product experiments
SessionMore samples, fasterSame user sees different variantsShort-lived treatments (e.g., ranking)
Page viewMaximum samplesInconsistent UXBackend changes invisible to users
Cluster (e.g., market)Handles network effectsVery few units, low powerSocial features, marketplace

Randomization unit 決定了你的 metric 分母、independence assumption、和 sample size 計算。選錯 unit 會讓整個實驗無效。

Metric Design

Metric Hierarchy

A robust experiment has three layers of metrics:

Primary metric (OEC — Overall Evaluation Criterion): The single metric that determines the launch decision. It should be:

  • Directly movable by the treatment
  • Measurable within the experiment timeframe
  • Aligned with long-term business goals

Secondary metrics: Provide additional context about how the treatment works. 幫助你理解 primary metric 為什麼動了(或沒動)。

Guardrail metrics: Must not degrade. 是實驗的安全網:

  • Page load time (performance)
  • Crash rate (stability)
  • Revenue per user (monetization)
  • Customer support contacts (user satisfaction)

Good Metrics vs. Bad Metrics

A good metric should be:

PropertyMeaningBad Example
SensitiveDetects real changesRevenue(太 noisy,受太多外部因素影響)
RobustNot easily gamed or noisyPage views(可以靠 auto-refresh 灌水)
AlignedConnected to long-term valueClicks(不代表使用者真的得到價值)
TimelyObservable within experiment duration年度留存率(等不到)

Example — search engine improvement:

  • Bad primary: Revenue → 太多外部因素影響,noise 太大
  • Better primary: Successful session rate → 使用者找到想要的東西
  • Good guardrail: Queries per session → 增加可能代表使用者在掙扎

Metric Surrogacy Problem

Short-term proxy metrics (e.g., clicks) don't always predict long-term outcomes (e.g., retention). 你需要驗證 proxy 和真正目標之間的關係。Netflix 曾發現某些提升 short-term engagement 的改動反而傷害了 long-term retention。

Sample Size & Power Analysis

The Power Calculation

Before running an experiment, compute the required sample size. For a two-sample z-test on means:

n=2(zα/2+zβ)2σ2δ2n = \frac{2(z_{\alpha/2} + z_\beta)^2 \sigma^2}{\delta^2}

where:

  • nn = sample size per group
  • zα/2z_{\alpha/2} = z-score for significance level (1.96 for α=0.05\alpha = 0.05)
  • zβz_\beta = z-score for power (0.84 for power = 80%)
  • σ2\sigma^2 = variance of the metric
  • δ\delta = minimum detectable effect (MDE)

For proportions (e.g., conversion rate pp):

n=2(zα/2+zβ)2p(1p)δ2n = \frac{2(z_{\alpha/2} + z_\beta)^2 \cdot p(1-p)}{\delta^2}

The Four Levers

Power depends on four interconnected factors:

FactorIncrease → PowerTradeoff
Sample size (nn)Costs time and traffic
Effect size (δ\delta)Can't always control; smaller effects need more data
Significance level (α\alpha)More false positives
Variance (σ2\sigma^2)↓ increases powerUse variance reduction techniques

Variance Reduction: CUPED

CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance by leveraging pre-experiment behavior:

Y^cuped=Y^θ(X^E[X])\hat{Y}_{\text{cuped}} = \hat{Y} - \theta(\hat{X} - E[X])

where XX is a pre-experiment covariate (e.g., last week's engagement) and θ=Cov(Y,X)/Var(X)\theta = \text{Cov}(Y, X) / \text{Var}(X).

原理類似 regression adjustment:用實驗前的行為資料來解釋 metric 的 variance,把「本來就會有的個體差異」扣掉,讓 treatment effect 更容易被偵測到。

CUPED 可以降低 30-50% 的 variance,效果等同於免費多拿 30-50% 的樣本。Microsoft、Netflix、Uber 等公司廣泛使用。

Interview Gold

面試中主動提到 CUPED,會讓面試官知道你有實際的 experimentation 經驗。這是現代 A/B testing 最有影響力的技術之一。

Running the Experiment

How Long to Run

  • Run for at least one full business cycle (typically 1-2 weeks) to capture day-of-week effects
  • Do not stop early just because you see significance (peeking problem)
  • Do not extend indefinitely hoping for significance (p-hacking)

The Peeking Problem

If you check your p-value daily and stop when p<0.05p < 0.05, you inflate your false positive rate far above 5%:

P(at least one false positive)=1(1α)k(k = number of checks)P(\text{at least one false positive}) = 1 - (1 - \alpha)^{k} \quad \text{(k = number of checks)}

14 天每天看一次:10.95140.511 - 0.95^{14} \approx 0.51 — 超過 50% 的 false positive rate!

Solutions:

  • Fixed-horizon testing: 事先決定實驗時長,只看最終結果
  • Sequential testing: Use alpha-spending functions (O'Brien-Fleming, Lan-DeMets) that control overall Type I error while allowing interim analyses
  • Bayesian testing: Monitor P(B better than Adata)P(\text{B better than A} \mid \text{data}) continuously without p-value inflation

Sample Ratio Mismatch (SRM)

If you designed a 50/50 split but observe 51.2/48.8, run a chi-squared test:

χ2=(OiEi)2Ei\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}

If significant, do not trust the experiment results. Common SRM causes:

  • Buggy randomization code
  • Bots or crawlers assigned asymmetrically
  • Redirect latency causing differential dropout
  • Triggered analysis conditions that interact with treatment

SRM = Stop and Investigate

SRM 是最可靠的實驗品質診斷指標。如果偵測到 SRM,不管 primary metric 長怎樣,實驗結果都不可信。永遠在解讀結果前先檢查 SRM。

Common Pitfalls

Novelty and Primacy Effects

  • Novelty effect: Users interact more with something new simply because it is new. The treatment effect fades over time.
  • Primacy effect: Users resist change and initially perform worse, but adapt over time.

偵測方式:把結果按「新使用者 vs 回訪使用者」或「曝光天數」分群。如果 treatment effect 隨時間顯著改變,就有 novelty/primacy 問題。

Network Effects and Interference (SUTVA Violation)

Standard A/B tests assume SUTVA (Stable Unit Treatment Value Assumption): one user's assignment does not affect another's outcome. This is violated when:

  • Social features: User A sees a new sharing button → shares with User B (in control) → B benefits without being treated
  • Marketplace: Showing different prices to buyers affects seller supply, which affects all buyers
  • Limited inventory: Recommending item X more to treatment means fewer items for control

Solutions:

  • Cluster randomization: Randomize by geographic market, social cluster, or time
  • Switchback experiments: Alternate treatment/control over time periods
  • Ghost experiments: Log what the treatment would do without actually applying it

Multiple Testing

Running 20 tests at α=0.05\alpha = 0.05 → expect 1 false positive. Corrections:

  • Bonferroni: αadj=α/m\alpha_{\text{adj}} = \alpha / m — 保守,控制 family-wise error rate
  • Benjamini-Hochberg (FDR): 控制 false discovery rate — 比 Bonferroni 寬鬆但更實用
  • Primary metric approach: 指定一個 primary metric(不需要 correction),其他作為 exploratory

Interpreting Results

The Decision Framework

ScenarioAction
Primary ↑, guardrails safeLaunch
Primary ↑, guardrail ↓Investigate tradeoff, may need redesign
Primary flat, guardrails safeNo launch — the feature does not help
Primary ↓Do not launch, analyze why
Inconclusive (underpowered)Run longer, increase MDE, or use CUPED

Statistical vs. Practical Significance

A p-value of 0.001 with a 0.01% conversion lift is statistically significant but practically meaningless. Always report:

  1. Effect size (absolute and relative)
  2. Confidence interval for the effect
  3. Business impact (estimated revenue, user hours, etc.)

面試中怎麼回答結果解讀

好的框架:「Primary metric 顯著提升 X%(CI [a, b]),guardrails 沒有顯著惡化。但 [某 secondary metric] 有下降趨勢,建議上線後持續監控。以目前 effect size 換算,預期年度 impact 約 Y。」

Real-World Use Cases

Case 1: 信用卡詐欺偵測模型上線

你訓練了新的詐欺偵測模型,需要 A/B test 決定是否上線。

Experiment design:

  • Randomization unit: Transaction(不是 user)— 因為同一使用者的不同交易可能有不同風險
  • Primary metric: Fraud loss rate(被漏掉的詐欺金額 / 總交易金額)
  • Guardrail: False positive rate(正常交易被擋下的比例)— 不能讓太多合法交易被誤擋
  • Challenge: Fraud labels 有延遲(chargeback 可能在 30-90 天後才確認),需要用 proxy metric(例如 model score distribution)來做早期監控

Delayed Labels Problem

詐欺偵測的 ground truth 有嚴重延遲。你不能等 90 天才知道實驗結果。解法:用 short-term proxy(model score > threshold 的比例)搭配 long-term validation(等 labels 回來後再確認 proxy 的準確性)。

Case 2: 推薦系統改版

電商平台的推薦演算法從 collaborative filtering 換成 deep learning model。

Experiment design:

  • Randomization unit: User — 確保同一使用者看到一致的推薦體驗
  • Primary metric: Conversion rate (purchase / visit)
  • Secondary: CTR on recommendations, average order value, items per order
  • Guardrails: Page load time(新模型推論更慢?)、diversity of recommendations(是否只推熱門商品?)
  • Challenge: Network effect — 如果 treatment 使用者買走熱門商品,control 使用者的推薦品質也受影響(limited inventory)

Case 3: 訂閱制定價策略

SaaS 產品想測試新的定價方案(月費從 9.99改為9.99 改為 12.99)。

Experiment design:

  • Randomization unit: New user(只對新使用者實驗,避免對既有使用者漲價造成的 primacy effect)
  • Primary metric: Revenue per user(30-day)
  • Guardrails: Signup rate(漲價後有多少人不註冊?)、Day-7 retention
  • Challenge:
    • Long-term effects: 短期 revenue 可能上升(願意付的人付更多),但長期 churn 可能增加
    • Ethical concerns: 同一產品同時存在兩種價格,可能引發信任問題
    • Small effect, long timeline: 價格變動的影響需要看 LTV,可能需要跑數月的實驗

Hands-on: A/B Testing in Python

Complete A/B Test Analysis

import numpy as np
from scipy import stats

# Simulate A/B test: new checkout flow
n_control, n_treatment = 5000, 5000
control = np.random.binomial(1, 0.12, n_control)    # 12% baseline
treatment = np.random.binomial(1, 0.135, n_treatment)  # 13.5% treatment

conv_c, conv_t = control.mean(), treatment.mean()

# Two-proportion z-test
p_pool = (control.sum() + treatment.sum()) / (n_control + n_treatment)
se = np.sqrt(p_pool * (1 - p_pool) * (1/n_control + 1/n_treatment))
z_stat = (conv_t - conv_c) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

# 95% CI for the difference
se_diff = np.sqrt(conv_c*(1-conv_c)/n_control + conv_t*(1-conv_t)/n_treatment)
diff = conv_t - conv_c
ci = (diff - 1.96 * se_diff, diff + 1.96 * se_diff)

# SRM check (chi-squared test on allocation ratio)
expected = (n_control + n_treatment) / 2
chi2_srm = ((n_control - expected)**2 + (n_treatment - expected)**2) / expected
p_srm = 1 - stats.chi2.cdf(chi2_srm, df=1)
# p_srm > 0.01 → no sample ratio mismatch

Sample Size Calculation

from scipy import stats
import numpy as np

def required_sample_size(baseline, mde, alpha=0.05, power=0.80):
    """Sample size per group for a two-proportion z-test."""
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    p1, p2 = baseline, baseline + mde
    pooled_var = p1 * (1 - p1) + p2 * (1 - p2)
    return int(np.ceil((z_alpha + z_beta) ** 2 * pooled_var / mde ** 2))

# baseline=5%, MDE=0.5%: ~35,000 per group
# baseline=5%, MDE=1%:   ~ 9,000 per group
# baseline=5%, MDE=2%:   ~ 2,400 per group

Interview Signals

What interviewers listen for:

  • 你不只會跑檢定,也會先想實驗是否測到正確問題
  • 你會主動提出實驗風險(novelty effect、interference、selection bias)
  • 你能把結果連回產品機制,而不是只背顯著性
  • 你知道 sample size 怎麼算,也知道什麼時候需要 variance reduction
  • 你能區分 statistical significance 和 practical significance

Practice

Flashcards

Flashcards (1/10)

Guardrail metric 的角色是什麼?

即使 primary metric 改善,也要確認沒有傷到其他關鍵品質(延遲、留存、客訴率)。Guardrail 是實驗的安全網,防止局部優化傷害整體產品。

Click card to flip

Quiz

Question 1/10

A/B test conversion 提升,但 page load time 變差很多,下一步最合理是什麼?

Mark as Complete

3/5 — Okay