A/B Testing & Experimentation
Interview Mode: 決策推理
A/B testing 是 data scientist 面試的重中之重。面試官不只考你會不會跑檢定,還考你能不能設計完整實驗、定義正確指標、識別陷阱,並把結果轉化為產品決策。
What You Should Understand
- 知道為什麼 randomization 能控制混淆因子,但不保證每次樣本都平衡
- 能區分 guardrail metric、primary metric、secondary metric
- 知道 sample ratio mismatch 是什麼,以及為什麼它是實驗品質警訊
- 能說明 observed correlation 不一定能支持 causal claim
- 能做 sample size calculation,理解 effect size、power、significance level 之間的關係
Core Concepts
Why Randomization Works
Randomization is the foundation of causal inference in experiments. By randomly assigning users to treatment and control:
- Observed confounders (age, device, location) are balanced in expectation
- Unobserved confounders (motivation, mood) are also balanced — this is the key advantage over observational studies
Randomization 是機率性的,不是確定性的。任何有限樣本中,兩組可能因為隨機性而存在差異,這正是 hypothesis testing 要處理的。
Randomization ≠ Perfection
Randomization 不能修正 instrumentation bugs、錯誤的 metric 定義、interference effects、或 selection bias。這些是實驗失效的常見原因,隨機化無法解決。
Experiment Design Checklist
A well-designed experiment answers these questions before launch:
- Hypothesis: What do we believe will happen, and why?
- Randomization unit: User, session, device, or page view?
- Primary metric: What single metric defines success?
- Guardrail metrics: What must not get worse?
- Sample size: How many users and how long?
- Triggering condition: Which users are eligible for the experiment?
Interview Tip
面試官問「how would you A/B test X?」時,先從這個 checklist 開始回答,而不是直接跳到 p-value。展示你有系統性思維。
Randomization Unit
The choice of randomization unit affects the analysis, metric definition, and validity:
| Unit | Pros | Cons | When to Use |
|---|---|---|---|
| User | Consistent experience, clean analysis | Slower to accumulate samples | Most product experiments |
| Session | More samples, faster | Same user sees different variants | Short-lived treatments (e.g., ranking) |
| Page view | Maximum samples | Inconsistent UX | Backend changes invisible to users |
| Cluster (e.g., market) | Handles network effects | Very few units, low power | Social features, marketplace |
Randomization unit 決定了你的 metric 分母、independence assumption、和 sample size 計算。選錯 unit 會讓整個實驗無效。
Metric Design
Metric Hierarchy
A robust experiment has three layers of metrics:
Primary metric (OEC — Overall Evaluation Criterion): The single metric that determines the launch decision. It should be:
- Directly movable by the treatment
- Measurable within the experiment timeframe
- Aligned with long-term business goals
Secondary metrics: Provide additional context about how the treatment works. 幫助你理解 primary metric 為什麼動了(或沒動)。
Guardrail metrics: Must not degrade. 是實驗的安全網:
- Page load time (performance)
- Crash rate (stability)
- Revenue per user (monetization)
- Customer support contacts (user satisfaction)
Good Metrics vs. Bad Metrics
A good metric should be:
| Property | Meaning | Bad Example |
|---|---|---|
| Sensitive | Detects real changes | Revenue(太 noisy,受太多外部因素影響) |
| Robust | Not easily gamed or noisy | Page views(可以靠 auto-refresh 灌水) |
| Aligned | Connected to long-term value | Clicks(不代表使用者真的得到價值) |
| Timely | Observable within experiment duration | 年度留存率(等不到) |
Example — search engine improvement:
- Bad primary: Revenue → 太多外部因素影響,noise 太大
- Better primary: Successful session rate → 使用者找到想要的東西
- Good guardrail: Queries per session → 增加可能代表使用者在掙扎
Metric Surrogacy Problem
Short-term proxy metrics (e.g., clicks) don't always predict long-term outcomes (e.g., retention). 你需要驗證 proxy 和真正目標之間的關係。Netflix 曾發現某些提升 short-term engagement 的改動反而傷害了 long-term retention。
Sample Size & Power Analysis
The Power Calculation
Before running an experiment, compute the required sample size. For a two-sample z-test on means:
where:
- = sample size per group
- = z-score for significance level (1.96 for )
- = z-score for power (0.84 for power = 80%)
- = variance of the metric
- = minimum detectable effect (MDE)
For proportions (e.g., conversion rate ):
The Four Levers
Power depends on four interconnected factors:
| Factor | Increase → Power | Tradeoff |
|---|---|---|
| Sample size () | ↑ | Costs time and traffic |
| Effect size () | ↑ | Can't always control; smaller effects need more data |
| Significance level () | ↑ | More false positives |
| Variance () | ↓ increases power | Use variance reduction techniques |
Variance Reduction: CUPED
CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance by leveraging pre-experiment behavior:
where is a pre-experiment covariate (e.g., last week's engagement) and .
原理類似 regression adjustment:用實驗前的行為資料來解釋 metric 的 variance,把「本來就會有的個體差異」扣掉,讓 treatment effect 更容易被偵測到。
CUPED 可以降低 30-50% 的 variance,效果等同於免費多拿 30-50% 的樣本。Microsoft、Netflix、Uber 等公司廣泛使用。
Interview Gold
面試中主動提到 CUPED,會讓面試官知道你有實際的 experimentation 經驗。這是現代 A/B testing 最有影響力的技術之一。
Running the Experiment
How Long to Run
- Run for at least one full business cycle (typically 1-2 weeks) to capture day-of-week effects
- Do not stop early just because you see significance (peeking problem)
- Do not extend indefinitely hoping for significance (p-hacking)
The Peeking Problem
If you check your p-value daily and stop when , you inflate your false positive rate far above 5%:
14 天每天看一次: — 超過 50% 的 false positive rate!
Solutions:
- Fixed-horizon testing: 事先決定實驗時長,只看最終結果
- Sequential testing: Use alpha-spending functions (O'Brien-Fleming, Lan-DeMets) that control overall Type I error while allowing interim analyses
- Bayesian testing: Monitor continuously without p-value inflation
Sample Ratio Mismatch (SRM)
If you designed a 50/50 split but observe 51.2/48.8, run a chi-squared test:
If significant, do not trust the experiment results. Common SRM causes:
- Buggy randomization code
- Bots or crawlers assigned asymmetrically
- Redirect latency causing differential dropout
- Triggered analysis conditions that interact with treatment
SRM = Stop and Investigate
SRM 是最可靠的實驗品質診斷指標。如果偵測到 SRM,不管 primary metric 長怎樣,實驗結果都不可信。永遠在解讀結果前先檢查 SRM。
Common Pitfalls
Novelty and Primacy Effects
- Novelty effect: Users interact more with something new simply because it is new. The treatment effect fades over time.
- Primacy effect: Users resist change and initially perform worse, but adapt over time.
偵測方式:把結果按「新使用者 vs 回訪使用者」或「曝光天數」分群。如果 treatment effect 隨時間顯著改變,就有 novelty/primacy 問題。
Network Effects and Interference (SUTVA Violation)
Standard A/B tests assume SUTVA (Stable Unit Treatment Value Assumption): one user's assignment does not affect another's outcome. This is violated when:
- Social features: User A sees a new sharing button → shares with User B (in control) → B benefits without being treated
- Marketplace: Showing different prices to buyers affects seller supply, which affects all buyers
- Limited inventory: Recommending item X more to treatment means fewer items for control
Solutions:
- Cluster randomization: Randomize by geographic market, social cluster, or time
- Switchback experiments: Alternate treatment/control over time periods
- Ghost experiments: Log what the treatment would do without actually applying it
Multiple Testing
Running 20 tests at → expect 1 false positive. Corrections:
- Bonferroni: — 保守,控制 family-wise error rate
- Benjamini-Hochberg (FDR): 控制 false discovery rate — 比 Bonferroni 寬鬆但更實用
- Primary metric approach: 指定一個 primary metric(不需要 correction),其他作為 exploratory
Interpreting Results
The Decision Framework
| Scenario | Action |
|---|---|
| Primary ↑, guardrails safe | Launch |
| Primary ↑, guardrail ↓ | Investigate tradeoff, may need redesign |
| Primary flat, guardrails safe | No launch — the feature does not help |
| Primary ↓ | Do not launch, analyze why |
| Inconclusive (underpowered) | Run longer, increase MDE, or use CUPED |
Statistical vs. Practical Significance
A p-value of 0.001 with a 0.01% conversion lift is statistically significant but practically meaningless. Always report:
- Effect size (absolute and relative)
- Confidence interval for the effect
- Business impact (estimated revenue, user hours, etc.)
面試中怎麼回答結果解讀
好的框架:「Primary metric 顯著提升 X%(CI [a, b]),guardrails 沒有顯著惡化。但 [某 secondary metric] 有下降趨勢,建議上線後持續監控。以目前 effect size 換算,預期年度 impact 約 Y。」
Real-World Use Cases
Case 1: 信用卡詐欺偵測模型上線
你訓練了新的詐欺偵測模型,需要 A/B test 決定是否上線。
Experiment design:
- Randomization unit: Transaction(不是 user)— 因為同一使用者的不同交易可能有不同風險
- Primary metric: Fraud loss rate(被漏掉的詐欺金額 / 總交易金額)
- Guardrail: False positive rate(正常交易被擋下的比例)— 不能讓太多合法交易被誤擋
- Challenge: Fraud labels 有延遲(chargeback 可能在 30-90 天後才確認),需要用 proxy metric(例如 model score distribution)來做早期監控
Delayed Labels Problem
詐欺偵測的 ground truth 有嚴重延遲。你不能等 90 天才知道實驗結果。解法:用 short-term proxy(model score > threshold 的比例)搭配 long-term validation(等 labels 回來後再確認 proxy 的準確性)。
Case 2: 推薦系統改版
電商平台的推薦演算法從 collaborative filtering 換成 deep learning model。
Experiment design:
- Randomization unit: User — 確保同一使用者看到一致的推薦體驗
- Primary metric: Conversion rate (purchase / visit)
- Secondary: CTR on recommendations, average order value, items per order
- Guardrails: Page load time(新模型推論更慢?)、diversity of recommendations(是否只推熱門商品?)
- Challenge: Network effect — 如果 treatment 使用者買走熱門商品,control 使用者的推薦品質也受影響(limited inventory)
Case 3: 訂閱制定價策略
SaaS 產品想測試新的定價方案(月費從 12.99)。
Experiment design:
- Randomization unit: New user(只對新使用者實驗,避免對既有使用者漲價造成的 primacy effect)
- Primary metric: Revenue per user(30-day)
- Guardrails: Signup rate(漲價後有多少人不註冊?)、Day-7 retention
- Challenge:
- Long-term effects: 短期 revenue 可能上升(願意付的人付更多),但長期 churn 可能增加
- Ethical concerns: 同一產品同時存在兩種價格,可能引發信任問題
- Small effect, long timeline: 價格變動的影響需要看 LTV,可能需要跑數月的實驗
Hands-on: A/B Testing in Python
Complete A/B Test Analysis
import numpy as np
from scipy import stats
# Simulate A/B test: new checkout flow
n_control, n_treatment = 5000, 5000
control = np.random.binomial(1, 0.12, n_control) # 12% baseline
treatment = np.random.binomial(1, 0.135, n_treatment) # 13.5% treatment
conv_c, conv_t = control.mean(), treatment.mean()
# Two-proportion z-test
p_pool = (control.sum() + treatment.sum()) / (n_control + n_treatment)
se = np.sqrt(p_pool * (1 - p_pool) * (1/n_control + 1/n_treatment))
z_stat = (conv_t - conv_c) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
# 95% CI for the difference
se_diff = np.sqrt(conv_c*(1-conv_c)/n_control + conv_t*(1-conv_t)/n_treatment)
diff = conv_t - conv_c
ci = (diff - 1.96 * se_diff, diff + 1.96 * se_diff)
# SRM check (chi-squared test on allocation ratio)
expected = (n_control + n_treatment) / 2
chi2_srm = ((n_control - expected)**2 + (n_treatment - expected)**2) / expected
p_srm = 1 - stats.chi2.cdf(chi2_srm, df=1)
# p_srm > 0.01 → no sample ratio mismatch
Sample Size Calculation
from scipy import stats
import numpy as np
def required_sample_size(baseline, mde, alpha=0.05, power=0.80):
"""Sample size per group for a two-proportion z-test."""
z_alpha = stats.norm.ppf(1 - alpha / 2)
z_beta = stats.norm.ppf(power)
p1, p2 = baseline, baseline + mde
pooled_var = p1 * (1 - p1) + p2 * (1 - p2)
return int(np.ceil((z_alpha + z_beta) ** 2 * pooled_var / mde ** 2))
# baseline=5%, MDE=0.5%: ~35,000 per group
# baseline=5%, MDE=1%: ~ 9,000 per group
# baseline=5%, MDE=2%: ~ 2,400 per group
Interview Signals
What interviewers listen for:
- 你不只會跑檢定,也會先想實驗是否測到正確問題
- 你會主動提出實驗風險(novelty effect、interference、selection bias)
- 你能把結果連回產品機制,而不是只背顯著性
- 你知道 sample size 怎麼算,也知道什麼時候需要 variance reduction
- 你能區分 statistical significance 和 practical significance
Practice
Flashcards
Flashcards (1/10)
Guardrail metric 的角色是什麼?
即使 primary metric 改善,也要確認沒有傷到其他關鍵品質(延遲、留存、客訴率)。Guardrail 是實驗的安全網,防止局部優化傷害整體產品。
Quiz
A/B test conversion 提升,但 page load time 變差很多,下一步最合理是什麼?