Causal Inference

Interview Context

A/B test 是因果推論的黃金標準，但很多場景做不了實驗 — 無法隨機分配（倫理、成本、政策限制）或只有觀察資料。面試官想知道你在「不能做 A/B test」的情況下，還有什麼工具可以推斷因果關係。

What You Should Understand

知道 correlation ≠ causation 的具體原因（confounders, reverse causality, collider bias）
能說明 potential outcomes framework 和 average treatment effect (ATE)
理解 4 大 quasi-experimental methods：DiD, IV, PSM, RDD
知道每種方法的假設、適用場景和限制
能判斷什麼時候觀察性方法不夠，必須做實驗

Core Concepts

Why Correlation ≠ Causation

Three main reasons why observational association does not imply causation:

1. Confounding: A third variable causes both X and Y.

例如：冰淇淋銷量和溺水人數正相關 — 但冰淇淋不會導致溺水。Confounder 是「天氣熱」，同時增加了冰淇淋銷量和游泳人數。

2. Reverse causality: Y causes X, not X causes Y.

例如：公司 training budget 高的員工績效好 — 但可能是因為績效好的員工被投入更多 training（selection effect），不是 training 導致績效好。

3. Collider bias: Conditioning on a common effect creates a spurious association.

例如：在已錄取的學生中，SAT 分數和課外活動負相關 — 但只是因為錄取（collider）要求至少一個要好。在全體申請者中沒有這個關係。

Simpson's Paradox

一個趨勢在每個子群體中成立，但在合併資料後反轉。經典案例：某醫院的 Treatment A 在每個嚴重程度的子群體中都比 B 好，但整體看起來 B 比 A 好 — 因為 A 被分配到更多重症患者。這就是 confounding 的後果。

The Potential Outcomes Framework (Rubin Causal Model)

The foundation of modern causal inference. For each individual $i$ :

$Y_i(1)$ = potential outcome if treated
$Y_i(0)$ = potential outcome if not treated
Individual treatment effect: $\tau_i = Y_i(1) - Y_i(0)$

The Fundamental Problem of Causal Inference: We can never observe both $Y_i(1)$ and $Y_i(0)$ for the same individual. 你只能看到其中一個 — 另一個是 counterfactual（反事實）。

因此，我們通常估計 population-level 的因果效應：

Average Treatment Effect (ATE):

\text{ATE} = E[Y(1) - Y(0)] = E[Y(1)] - E[Y(0)]

Average Treatment Effect on the Treated (ATT):

\text{ATT} = E[Y(1) - Y(0) \mid T = 1]

ATT 只關心「被處理的那群人」的效果，在 policy evaluation 中更常用（「這個政策對參與者的效果是什麼？」）。

Why A/B Testing Solves This

In a randomized experiment, treatment assignment $T$ is independent of potential outcomes:

(Y(0), Y(1)) \perp T

所以 $E[Y|T=1] - E[Y|T=0] = E[Y(1)] - E[Y(0)] = \text{ATE}$ 。

簡單的均值差就是 causal effect。這就是 randomization 的魔力。

When A/B Testing Is Not Possible

Scenario	Why Can't Randomize	Example
Ethical constraints	不能隨機給人有害的 treatment	吸煙對健康的影響
Already happened	事件已經發生，無法回到過去隨機化	2008 金融危機對就業的影響
Contamination risk	Treatment 會 spill over 到 control	社群網路的 feature 影響
Political/business	Stakeholder 不允許 holdout group	全公司強制推行新 policy
Long time horizon	結果要很多年才能觀察	教育政策對終身收入的影響

這些場景需要 quasi-experimental methods — 用巧妙的研究設計從觀察資料中推斷因果關係。

Difference-in-Differences (DiD)

The Idea

Compare the change in outcomes over time between a treatment group and a control group:

\hat{\tau}_{\text{DiD}} = (\bar{Y}_{\text{treat,after}} - \bar{Y}_{\text{treat,before}}) - (\bar{Y}_{\text{control,after}} - \bar{Y}_{\text{control,before}})

第一個差分去除了 treatment group 本身的 time trend，第二個差分去除了影響所有人的 common trend。

The Parallel Trends Assumption

DiD 的核心假設：如果沒有 treatment，treatment group 和 control group 的 trend 會平行。

E[Y(0)_{\text{treat,after}} - Y(0)_{\text{treat,before}}] = E[Y(0)_{\text{control,after}} - Y(0)_{\text{control,before}}]

注意：不需要兩組的 level 相同，只需要 trend 相同。

Parallel Trends 無法直接驗證

因為我們永遠看不到 treatment group 在沒有 treatment 時的 counterfactual trend。能做的是：（1）檢查 pre-treatment period 的 trend 是否平行，（2）做 placebo tests（用 treatment 前的時間點做假的 DiD），（3）加入 covariates 讓 parallel trends 更合理。

DiD in Regression Form

Y_{it} = \beta_0 + \beta_1 \cdot \text{Treat}_i + \beta_2 \cdot \text{Post}_t + \beta_3 \cdot (\text{Treat}_i \times \text{Post}_t) + \epsilon_{it}

$\beta_3$ 就是 DiD estimator — the causal effect of the treatment.

import statsmodels.formula.api as smf

# Treat: 1 if in treatment group, 0 if control
# Post: 1 if after treatment, 0 if before
# Treat_Post: interaction term
model = smf.ols("revenue ~ treat + post + treat_post", data=df).fit()
# model.params["treat_post"] = DiD estimate of causal effect

Instrumental Variables (IV)

The Problem

You want to estimate the causal effect of $X$ on $Y$ , but there's an unobserved confounder $U$ that affects both:

X \leftarrow U \rightarrow Y

OLS estimate of $X$ 's coefficient is biased because $X$ is correlated with the error term (endogeneity).

The Solution

Find an instrument $Z$ that:

Relevance: $Z$ is correlated with $X$ — $\text{Cov}(Z, X) \neq 0$
Exclusion restriction: $Z$ affects $Y$ only through $X$ — $Z$ has no direct effect on $Y$
Independence: $Z$ is uncorrelated with $U$

Z \rightarrow X \rightarrow Y \quad (\text{but } Z \not\rightarrow Y \text{ directly})

Two-Stage Least Squares (2SLS)

Stage 1: Regress $X$ on $Z$ — get the predicted $\hat{X}$ (the part of $X$ explained by $Z$ , free from confounding):

\hat{X} = \hat{\alpha}_0 + \hat{\alpha}_1 Z

Stage 2: Regress $Y$ on $\hat{X}$ — since $\hat{X}$ is only driven by $Z$ (exogenous), the coefficient is causal:

Y = \beta_0 + \beta_1 \hat{X} + \epsilon

Classic Examples

Instrument $Z$	Endogenous $X$	Outcome $Y$	Why It Works
Distance to college	Years of education	Income	距離影響是否上大學，但不直接影響收入
Rainfall	Agricultural output	Economic growth	降雨影響農業產出，不直接影響其他經濟活動
Draft lottery number	Military service	Earnings	隨機抽籤決定是否服役，不直接影響收入

IV 最大的挑戰：找到好的 instrument

好的 instrument 在實務中非常難找。Exclusion restriction 無法統計檢驗（需要 domain knowledge 來辯護），而 weak instruments（和 X 的 correlation 很弱）會導致嚴重的 bias。面試中提到 IV 時，一定要討論 instrument 的合理性。

Propensity Score Matching (PSM)

The Idea

When treatment assignment is not random but depends on observable characteristics, you can estimate the causal effect by matching treated individuals with similar untreated individuals.

Propensity score: The probability of receiving treatment given observed covariates:

e(x) = P(T = 1 \mid X = x)

Why Propensity Scores Work

Rosenbaum & Rubin (1983) proved that if treatment assignment is strongly ignorable (no unobserved confounders):

(Y(0), Y(1)) \perp T \mid e(X)

只要在相同 propensity score 的 subgroup 內比較，就像做了 mini randomized experiments。

Steps

Estimate propensity score: 用 logistic regression（或 GBM、random forest）predict $P(T=1 \mid X)$
Match: 每個 treated individual 配一個 propensity score 最接近的 control individual
Check balance: 確認 matched sample 中 treated 和 control 的 covariate distributions 相似
Estimate effect: 在 matched sample 上計算均值差

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
import numpy as np

# Step 1: Estimate propensity scores
ps_model = LogisticRegression().fit(X_covariates, treatment)
propensity = ps_model.predict_proba(X_covariates)[:, 1]

# Step 2: Match treated to control using nearest neighbor
treated_idx = np.where(treatment == 1)[0]
control_idx = np.where(treatment == 0)[0]
nn = NearestNeighbors(n_neighbors=1).fit(propensity[control_idx].reshape(-1, 1))
distances, matches = nn.kneighbors(propensity[treated_idx].reshape(-1, 1))
matched_control_idx = control_idx[matches.flatten()]

# Step 3: Estimate ATT
att = y[treated_idx].mean() - y[matched_control_idx].mean()

Alternatives to Matching

Inverse Probability Weighting (IPW): Weight each observation by $1/e(x)$ for treated and $1/(1-e(x))$ for control. 不需要丟棄任何 data。
Stratification: 按 propensity score 分層，在每層內比較。
Doubly robust: 結合 outcome model 和 propensity model — 只要其中一個正確，estimator 就是 consistent。

PSM 的致命假設：No Unobserved Confounders

PSM 只能控制 observed confounders。如果存在 unobserved confounder（例如「動機」或「健康意識」），即使 propensity score 完美匹配，估計仍然有 bias。和 IV 不同，PSM 無法處理 unobserved confounding。面試中一定要主動提到這個限制。

Regression Discontinuity Design (RDD)

The Idea

When treatment is assigned based on whether a running variable crosses a cutoff, individuals just above and just below the cutoff are nearly identical — creating a "natural experiment."

T_i = \mathbf{1}(X_i \geq c)

Sharp vs Fuzzy RDD

Sharp RDD: Treatment is deterministic at the cutoff — everyone above gets treated, everyone below doesn't.

\tau_{\text{RDD}} = \lim_{x \downarrow c} E[Y \mid X = x] - \lim_{x \uparrow c} E[Y \mid X = x]

Fuzzy RDD: The probability of treatment jumps at the cutoff but isn't deterministic. Like an IV where the cutoff instruments for treatment.

Classic Examples

Cutoff	Running Variable	Treatment	Outcome
SAT score ≥ 1400	SAT score	Admission to elite university	Lifetime earnings
Age ≥ 21	Age	Legal alcohol purchase	Health outcomes
Poverty score ≤ threshold	Income score	Eligibility for subsidy	Child education
GPA ≥ 3.0	GPA	Dean's list recognition	Subsequent academic performance

Key Assumptions

Continuity: Potential outcomes $E[Y(0)|X=x]$ and $E[Y(1)|X=x]$ are continuous at the cutoff — cutoff 附近的人除了 treatment 以外沒有系統性差異
No manipulation: Individuals cannot precisely manipulate the running variable to be just above or below the cutoff — 如果可以操控，cutoff 附近的人就不是 "as-if random" 了

偵測 manipulation：用 McCrary density test 檢查 running variable 在 cutoff 附近的密度是否有跳躍。如果有，表示有人在操控。

RDD in Practice

import numpy as np
import statsmodels.formula.api as smf

# Local linear regression around the cutoff
bandwidth = 5  # only use observations within ±5 of cutoff
cutoff = 70

df_local = df[(df["score"] >= cutoff - bandwidth) & (df["score"] <= cutoff + bandwidth)]
df_local["above"] = (df_local["score"] >= cutoff).astype(int)
df_local["score_centered"] = df_local["score"] - cutoff

model = smf.ols("outcome ~ above * score_centered", data=df_local).fit()
# model.params["above"] = RDD estimate of causal effect at the cutoff

RDD 的優缺點

優點：假設最弱（只需 cutoff 附近的 continuity），internal validity 很強，被稱為 "the closest thing to a true experiment with observational data"。

缺點：效果只在 cutoff 附近有效（local effect），不能推廣到離 cutoff 很遠的 population。而且需要足夠多的 observations 在 cutoff 附近。

Method Comparison

Method	Identifies	Key Assumption	Handles Unobserved Confounders?	Effect Type
A/B Test	ATE	Randomization	Yes (by design)	Global
DiD	ATT	Parallel trends	Only time-invariant ones	Group-level
IV	LATE	Exclusion restriction	Yes (if instrument valid)	Compliers only
PSM	ATT	No unobserved confounders	No	Matched sample
RDD	LATE	Continuity at cutoff	Yes (locally)	At cutoff only

LATE = Local Average Treatment Effect

IV 和 RDD 估計的是 LATE — 只對 "compliers"（受 instrument 或 cutoff 影響而改變行為的人）有效。不是所有人的 ATE。面試中被問「IV 估計的是什麼？」要回答 LATE，不是 ATE。

Real-World Use Cases

Case 1: 信用卡 — 信用額度對消費的因果效應

銀行想知道：提高信用額度是否會增加消費？

直接比較高額度和低額度客戶的消費不行 — 高額度客戶本身就更有消費能力（confounding by creditworthiness）。

RDD approach: 銀行的信用額度是由信用評分決定的，score ≥ 700 給 $10K 額度，< 700 給$ 5K。比較 score 在 695-705 之間的客戶 — 他們幾乎一樣，但額度差 $5K。

Case 2: 推薦系統 — 新演算法對 engagement 的因果效應

已經全量上線了新的推薦演算法（沒有做 A/B test），老闆問：「新演算法到底有沒有提升 engagement？」

DiD approach: 用上線前後的 engagement 變化，和一個沒有更換演算法的 comparable product line（或 market）做比較。Pre-treatment trends 如果平行，DiD 就可以估計因果效應。

Case 3: 定價策略 — 折扣對 LTV 的因果效應

你想知道：給新使用者首月折扣是否增加 LTV？直接比較折扣 user 和原價 user 不行 — 接受折扣的 user 可能本來就是 price-sensitive、LTV 較低的人。

PSM approach: 用 observed features（demographics, acquisition channel, first session behavior）估計 propensity score，匹配接受折扣和沒有接受折扣的 user。在 matched sample 上比較 12-month LTV。但要承認：如果有 unobserved confounder（例如「購買意圖」），PSM 的估計仍然有 bias。

Hands-on: Causal Inference in Python

Difference-in-Differences

import pandas as pd
import statsmodels.formula.api as smf

# DiD regression
# treat: 1 if treatment group
# post: 1 if after treatment period
# treat_post: interaction (the DiD estimator)
model = smf.ols("revenue ~ treat + post + treat_post", data=df).fit()
print(model.summary())
# treat_post coefficient = causal effect estimate
# Check: pre-treatment trends should be parallel

Propensity Score Matching

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors

# Estimate propensity scores
ps = LogisticRegression(C=1.0).fit(X, treatment).predict_proba(X)[:, 1]

# Nearest neighbor matching
treated = np.where(treatment == 1)[0]
control = np.where(treatment == 0)[0]
nn = NearestNeighbors(n_neighbors=1).fit(ps[control].reshape(-1, 1))
_, matches = nn.kneighbors(ps[treated].reshape(-1, 1))
matched = control[matches.flatten()]

# ATT estimate
att = outcome[treated].mean() - outcome[matched].mean()
# Check covariate balance after matching

Regression Discontinuity

import statsmodels.formula.api as smf

# Local linear regression within bandwidth
bw = 5
cutoff = 70
local = df[df["score"].between(cutoff - bw, cutoff + bw)].copy()
local["above"] = (local["score"] >= cutoff).astype(int)
local["score_c"] = local["score"] - cutoff

model = smf.ols("outcome ~ above * score_c", data=local).fit()
# model.params["above"] = RDD causal effect at cutoff

Interview Signals

What interviewers listen for:

你知道 correlation ≠ causation 的具體原因（不只是背一句口號）
你能根據場景選擇正確的 causal inference method
你會主動說出每種方法的核心假設和限制
你理解 ATE vs ATT vs LATE 的差異
你知道 PSM 無法處理 unobserved confounding，IV 需要好的 instrument

Practice

Flashcards

Flashcards (1/10)

Difference-in-Differences 的核心假設是什麼？如何檢驗？

Parallel trends assumption：如果沒有 treatment，兩組的 outcome trend 會平行。無法直接驗證（counterfactual 不可觀察），但可以：（1）檢查 pre-treatment trends 是否平行，（2）做 placebo test 用 pre-treatment 時間點做假 DiD。

Click card to flip

Quiz

Question 1/10

你發現使用公司 premium feature 的 user 留存率更高。能直接說 premium feature 提升留存嗎？

Mark as Complete

How confident are you with this topic?

3/5 — Okay