Causal Inference
Interview Context
A/B test 是因果推論的黃金標準,但很多場景做不了實驗 — 無法隨機分配(倫理、成本、政策限制)或只有觀察資料。面試官想知道你在「不能做 A/B test」的情況下,還有什麼工具可以推斷因果關係。
What You Should Understand
- 知道 correlation ≠ causation 的具體原因(confounders, reverse causality, collider bias)
- 能說明 potential outcomes framework 和 average treatment effect (ATE)
- 理解 4 大 quasi-experimental methods:DiD, IV, PSM, RDD
- 知道每種方法的假設、適用場景和限制
- 能判斷什麼時候觀察性方法不夠,必須做實驗
Core Concepts
Why Correlation ≠ Causation
Three main reasons why observational association does not imply causation:
1. Confounding: A third variable causes both X and Y.
例如:冰淇淋銷量和溺水人數正相關 — 但冰淇淋不會導致溺水。Confounder 是「天氣熱」,同時增加了冰淇淋銷量和游泳人數。
2. Reverse causality: Y causes X, not X causes Y.
例如:公司 training budget 高的員工績效好 — 但可能是因為績效好的員工被投入更多 training(selection effect),不是 training 導致績效好。
3. Collider bias: Conditioning on a common effect creates a spurious association.
例如:在已錄取的學生中,SAT 分數和課外活動負相關 — 但只是因為錄取(collider)要求至少一個要好。在全體申請者中沒有這個關係。
Simpson's Paradox
一個趨勢在每個子群體中成立,但在合併資料後反轉。經典案例:某醫院的 Treatment A 在每個嚴重程度的子群體中都比 B 好,但整體看起來 B 比 A 好 — 因為 A 被分配到更多重症患者。這就是 confounding 的後果。
The Potential Outcomes Framework (Rubin Causal Model)
The foundation of modern causal inference. For each individual :
- = potential outcome if treated
- = potential outcome if not treated
- Individual treatment effect:
The Fundamental Problem of Causal Inference: We can never observe both and for the same individual. 你只能看到其中一個 — 另一個是 counterfactual(反事實)。
因此,我們通常估計 population-level 的因果效應:
Average Treatment Effect (ATE):
Average Treatment Effect on the Treated (ATT):
ATT 只關心「被處理的那群人」的效果,在 policy evaluation 中更常用(「這個政策對參與者的效果是什麼?」)。
Why A/B Testing Solves This
In a randomized experiment, treatment assignment is independent of potential outcomes:
所以 。
簡單的均值差就是 causal effect。這就是 randomization 的魔力。
When A/B Testing Is Not Possible
| Scenario | Why Can't Randomize | Example |
|---|---|---|
| Ethical constraints | 不能隨機給人有害的 treatment | 吸煙對健康的影響 |
| Already happened | 事件已經發生,無法回到過去隨機化 | 2008 金融危機對就業的影響 |
| Contamination risk | Treatment 會 spill over 到 control | 社群網路的 feature 影響 |
| Political/business | Stakeholder 不允許 holdout group | 全公司強制推行新 policy |
| Long time horizon | 結果要很多年才能觀察 | 教育政策對終身收入的影響 |
這些場景需要 quasi-experimental methods — 用巧妙的研究設計從觀察資料中推斷因果關係。
Difference-in-Differences (DiD)
The Idea
Compare the change in outcomes over time between a treatment group and a control group:
第一個差分去除了 treatment group 本身的 time trend,第二個差分去除了影響所有人的 common trend。
The Parallel Trends Assumption
DiD 的核心假設:如果沒有 treatment,treatment group 和 control group 的 trend 會平行。
注意:不需要兩組的 level 相同,只需要 trend 相同。
Parallel Trends 無法直接驗證
因為我們永遠看不到 treatment group 在沒有 treatment 時的 counterfactual trend。能做的是:(1)檢查 pre-treatment period 的 trend 是否平行,(2)做 placebo tests(用 treatment 前的時間點做假的 DiD),(3)加入 covariates 讓 parallel trends 更合理。
DiD in Regression Form
就是 DiD estimator — the causal effect of the treatment.
import statsmodels.formula.api as smf
# Treat: 1 if in treatment group, 0 if control
# Post: 1 if after treatment, 0 if before
# Treat_Post: interaction term
model = smf.ols("revenue ~ treat + post + treat_post", data=df).fit()
# model.params["treat_post"] = DiD estimate of causal effect
Instrumental Variables (IV)
The Problem
You want to estimate the causal effect of on , but there's an unobserved confounder that affects both:
OLS estimate of 's coefficient is biased because is correlated with the error term (endogeneity).
The Solution
Find an instrument that:
- Relevance: is correlated with —
- Exclusion restriction: affects only through — has no direct effect on
- Independence: is uncorrelated with
Two-Stage Least Squares (2SLS)
Stage 1: Regress on — get the predicted (the part of explained by , free from confounding):
Stage 2: Regress on — since is only driven by (exogenous), the coefficient is causal:
Classic Examples
| Instrument | Endogenous | Outcome | Why It Works |
|---|---|---|---|
| Distance to college | Years of education | Income | 距離影響是否上大學,但不直接影響收入 |
| Rainfall | Agricultural output | Economic growth | 降雨影響農業產出,不直接影響其他經濟活動 |
| Draft lottery number | Military service | Earnings | 隨機抽籤決定是否服役,不直接影響收入 |
IV 最大的挑戰:找到好的 instrument
好的 instrument 在實務中非常難找。Exclusion restriction 無法統計檢驗(需要 domain knowledge 來辯護),而 weak instruments(和 X 的 correlation 很弱)會導致嚴重的 bias。面試中提到 IV 時,一定要討論 instrument 的合理性。
Propensity Score Matching (PSM)
The Idea
When treatment assignment is not random but depends on observable characteristics, you can estimate the causal effect by matching treated individuals with similar untreated individuals.
Propensity score: The probability of receiving treatment given observed covariates:
Why Propensity Scores Work
Rosenbaum & Rubin (1983) proved that if treatment assignment is strongly ignorable (no unobserved confounders):
只要在相同 propensity score 的 subgroup 內比較,就像做了 mini randomized experiments。
Steps
- Estimate propensity score: 用 logistic regression(或 GBM、random forest)predict
- Match: 每個 treated individual 配一個 propensity score 最接近的 control individual
- Check balance: 確認 matched sample 中 treated 和 control 的 covariate distributions 相似
- Estimate effect: 在 matched sample 上計算均值差
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
import numpy as np
# Step 1: Estimate propensity scores
ps_model = LogisticRegression().fit(X_covariates, treatment)
propensity = ps_model.predict_proba(X_covariates)[:, 1]
# Step 2: Match treated to control using nearest neighbor
treated_idx = np.where(treatment == 1)[0]
control_idx = np.where(treatment == 0)[0]
nn = NearestNeighbors(n_neighbors=1).fit(propensity[control_idx].reshape(-1, 1))
distances, matches = nn.kneighbors(propensity[treated_idx].reshape(-1, 1))
matched_control_idx = control_idx[matches.flatten()]
# Step 3: Estimate ATT
att = y[treated_idx].mean() - y[matched_control_idx].mean()
Alternatives to Matching
- Inverse Probability Weighting (IPW): Weight each observation by for treated and for control. 不需要丟棄任何 data。
- Stratification: 按 propensity score 分層,在每層內比較。
- Doubly robust: 結合 outcome model 和 propensity model — 只要其中一個正確,estimator 就是 consistent。
PSM 的致命假設:No Unobserved Confounders
PSM 只能控制 observed confounders。如果存在 unobserved confounder(例如「動機」或「健康意識」),即使 propensity score 完美匹配,估計仍然有 bias。和 IV 不同,PSM 無法處理 unobserved confounding。面試中一定要主動提到這個限制。
Regression Discontinuity Design (RDD)
The Idea
When treatment is assigned based on whether a running variable crosses a cutoff, individuals just above and just below the cutoff are nearly identical — creating a "natural experiment."
Sharp vs Fuzzy RDD
Sharp RDD: Treatment is deterministic at the cutoff — everyone above gets treated, everyone below doesn't.
Fuzzy RDD: The probability of treatment jumps at the cutoff but isn't deterministic. Like an IV where the cutoff instruments for treatment.
Classic Examples
| Cutoff | Running Variable | Treatment | Outcome |
|---|---|---|---|
| SAT score ≥ 1400 | SAT score | Admission to elite university | Lifetime earnings |
| Age ≥ 21 | Age | Legal alcohol purchase | Health outcomes |
| Poverty score ≤ threshold | Income score | Eligibility for subsidy | Child education |
| GPA ≥ 3.0 | GPA | Dean's list recognition | Subsequent academic performance |
Key Assumptions
- Continuity: Potential outcomes and are continuous at the cutoff — cutoff 附近的人除了 treatment 以外沒有系統性差異
- No manipulation: Individuals cannot precisely manipulate the running variable to be just above or below the cutoff — 如果可以操控,cutoff 附近的人就不是 "as-if random" 了
偵測 manipulation:用 McCrary density test 檢查 running variable 在 cutoff 附近的密度是否有跳躍。如果有,表示有人在操控。
RDD in Practice
import numpy as np
import statsmodels.formula.api as smf
# Local linear regression around the cutoff
bandwidth = 5 # only use observations within ±5 of cutoff
cutoff = 70
df_local = df[(df["score"] >= cutoff - bandwidth) & (df["score"] <= cutoff + bandwidth)]
df_local["above"] = (df_local["score"] >= cutoff).astype(int)
df_local["score_centered"] = df_local["score"] - cutoff
model = smf.ols("outcome ~ above * score_centered", data=df_local).fit()
# model.params["above"] = RDD estimate of causal effect at the cutoff
RDD 的優缺點
優點:假設最弱(只需 cutoff 附近的 continuity),internal validity 很強,被稱為 "the closest thing to a true experiment with observational data"。
缺點:效果只在 cutoff 附近有效(local effect),不能推廣到離 cutoff 很遠的 population。而且需要足夠多的 observations 在 cutoff 附近。
Method Comparison
| Method | Identifies | Key Assumption | Handles Unobserved Confounders? | Effect Type |
|---|---|---|---|---|
| A/B Test | ATE | Randomization | Yes (by design) | Global |
| DiD | ATT | Parallel trends | Only time-invariant ones | Group-level |
| IV | LATE | Exclusion restriction | Yes (if instrument valid) | Compliers only |
| PSM | ATT | No unobserved confounders | No | Matched sample |
| RDD | LATE | Continuity at cutoff | Yes (locally) | At cutoff only |
LATE = Local Average Treatment Effect
IV 和 RDD 估計的是 LATE — 只對 "compliers"(受 instrument 或 cutoff 影響而改變行為的人)有效。不是所有人的 ATE。面試中被問「IV 估計的是什麼?」要回答 LATE,不是 ATE。
Real-World Use Cases
Case 1: 信用卡 — 信用額度對消費的因果效應
銀行想知道:提高信用額度是否會增加消費?
直接比較高額度和低額度客戶的消費不行 — 高額度客戶本身就更有消費能力(confounding by creditworthiness)。
RDD approach: 銀行的信用額度是由信用評分決定的,score ≥ 700 給 5K。比較 score 在 695-705 之間的客戶 — 他們幾乎一樣,但額度差 $5K。
Case 2: 推薦系統 — 新演算法對 engagement 的因果效應
已經全量上線了新的推薦演算法(沒有做 A/B test),老闆問:「新演算法到底有沒有提升 engagement?」
DiD approach: 用上線前後的 engagement 變化,和一個沒有更換演算法的 comparable product line(或 market)做比較。Pre-treatment trends 如果平行,DiD 就可以估計因果效應。
Case 3: 定價策略 — 折扣對 LTV 的因果效應
你想知道:給新使用者首月折扣是否增加 LTV?直接比較折扣 user 和原價 user 不行 — 接受折扣的 user 可能本來就是 price-sensitive、LTV 較低的人。
PSM approach: 用 observed features(demographics, acquisition channel, first session behavior)估計 propensity score,匹配接受折扣和沒有接受折扣的 user。在 matched sample 上比較 12-month LTV。但要承認:如果有 unobserved confounder(例如「購買意圖」),PSM 的估計仍然有 bias。
Hands-on: Causal Inference in Python
Difference-in-Differences
import pandas as pd
import statsmodels.formula.api as smf
# DiD regression
# treat: 1 if treatment group
# post: 1 if after treatment period
# treat_post: interaction (the DiD estimator)
model = smf.ols("revenue ~ treat + post + treat_post", data=df).fit()
print(model.summary())
# treat_post coefficient = causal effect estimate
# Check: pre-treatment trends should be parallel
Propensity Score Matching
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
# Estimate propensity scores
ps = LogisticRegression(C=1.0).fit(X, treatment).predict_proba(X)[:, 1]
# Nearest neighbor matching
treated = np.where(treatment == 1)[0]
control = np.where(treatment == 0)[0]
nn = NearestNeighbors(n_neighbors=1).fit(ps[control].reshape(-1, 1))
_, matches = nn.kneighbors(ps[treated].reshape(-1, 1))
matched = control[matches.flatten()]
# ATT estimate
att = outcome[treated].mean() - outcome[matched].mean()
# Check covariate balance after matching
Regression Discontinuity
import statsmodels.formula.api as smf
# Local linear regression within bandwidth
bw = 5
cutoff = 70
local = df[df["score"].between(cutoff - bw, cutoff + bw)].copy()
local["above"] = (local["score"] >= cutoff).astype(int)
local["score_c"] = local["score"] - cutoff
model = smf.ols("outcome ~ above * score_c", data=local).fit()
# model.params["above"] = RDD causal effect at cutoff
Interview Signals
What interviewers listen for:
- 你知道 correlation ≠ causation 的具體原因(不只是背一句口號)
- 你能根據場景選擇正確的 causal inference method
- 你會主動說出每種方法的核心假設和限制
- 你理解 ATE vs ATT vs LATE 的差異
- 你知道 PSM 無法處理 unobserved confounding,IV 需要好的 instrument
Practice
Flashcards
Flashcards (1/10)
Difference-in-Differences 的核心假設是什麼?如何檢驗?
Parallel trends assumption:如果沒有 treatment,兩組的 outcome trend 會平行。無法直接驗證(counterfactual 不可觀察),但可以:(1)檢查 pre-treatment trends 是否平行,(2)做 placebo test 用 pre-treatment 時間點做假 DiD。
Quiz
你發現使用公司 premium feature 的 user 留存率更高。能直接說 premium feature 提升留存嗎?