Bayesian Inference
Interview Context
Bayesian 問題在面試中常以兩種形式出現:(1)直接套用 Bayes' theorem 的計算題,(2)討論 Bayesian vs Frequentist 哲學差異的開放式問題。兩種都要準備。
Bayes' Theorem
The foundation of all Bayesian inference:
Each term has a specific name:
- — Posterior: what you believe about after observing
- — Likelihood: how probable the observed data is, given
- — Prior: what you believed about before seeing data
- — Evidence (marginal likelihood): a normalizing constant
分母通常用 law of total probability 計算:
Classic Interview Example: Disease Testing
A disease affects 1% of the population. A test has 95% sensitivity (TPR) and 90% specificity (TNR). If a person tests positive, what is the probability they actually have the disease?
只有 8.8% — 即使測試 95% 準確!
Base Rate Fallacy
這是面試中最常見的 Bayesian 陷阱。直覺上覺得 95% 準確率的檢測應該很可靠,但當疾病本身很罕見(低 prior),大量的 false positives 會淹沒 true positives。永遠先問 base rate。
Step-by-Step Framework for Bayes Problems
面試中遇到 Bayes 計算題,用這個框架不會出錯:
- Define events clearly: = has disease, = tests positive
- List what you know: , ,
- Compute the denominator:
- Apply Bayes' theorem: plug in and compute
- Sanity check: 如果 base rate 很低,posterior 通常也不會太高
The Bayesian Framework
In the Bayesian approach, parameters are treated as random variables with distributions, not fixed unknown constants:
Or more concisely:
Posterior 濃縮了我們在觀察資料後對參數的所有認知。隨著資料增加,posterior 會越來越集中 — 資料逐漸壓過 prior 的影響。
Prior, Likelihood, and Posterior
Choosing a Prior
The prior encodes your beliefs before seeing data. Common choices:
| Prior Type | Description | When to Use |
|---|---|---|
| Uninformative (flat) | — no preference | 沒有任何先驗知識時 |
| Weakly informative | Regularizes extremes, e.g., | 有大概的範圍但不確定 |
| Informative | Based on domain knowledge or past data | 有強先驗知識(如歷史數據) |
| Jeffreys prior | — invariant under reparameterization | 想要 "objective" Bayesian |
面試怎麼回答 Prior 選擇
好的回答:「我會根據 domain knowledge 設定 weakly informative prior,然後做 sensitivity analysis 看結果對 prior 的敏感度。如果資料夠多,prior 的影響會很小 — 這也是一種檢驗分析穩健性的方式。」
Conjugate Priors
A prior is conjugate to a likelihood if the posterior belongs to the same distribution family. 好處是可以直接算出 posterior 的參數,不需要數值方法:
| Likelihood | Conjugate Prior | Posterior | Use Case |
|---|---|---|---|
| Bernoulli / Binomial | Beta() | Beta() | CTR, conversion rate |
| Poisson | Gamma() | Gamma() | Event counts |
| Normal (known ) | Normal() | Normal (weighted mean) | Continuous measurements |
| Exponential | Gamma() | Gamma() | Waiting times |
| Multinomial | Dirichlet() | Dirichlet() | Category counts (NLP) |
Beta-Binomial: The Most Important Example
You want to estimate a website's conversion rate.
Prior: Beta(2, 8) — 大約相信 conversion rate 在 20% 左右()
Data: 15 conversions out of 100 visits
Posterior:
MLE 是 15/100 = 0.15,posterior mean 被 prior 拉向 0.20。資料越多,posterior 越接近 MLE。
直覺:Beta() 相當於你已經看過 次成功和 次失敗。Prior Beta(2, 8) 就像你已經有 1 次成功 + 7 次失敗的「虛擬觀測」。100 筆真實資料後,虛擬觀測的影響已經很小。
Posterior as a Compromise
Posterior mean 永遠介於 prior mean 和 MLE 之間:
where .
資料越多(data precision 越大),,posterior → MLE。Prior 只在小樣本時有明顯影響。
MAP vs MLE
Two key point estimation methods:
Maximum Likelihood Estimation (MLE) — find the parameter that maximizes the likelihood:
Maximum A Posteriori (MAP) — find the parameter that maximizes the posterior:
The Key Connection to Regularization
| Prior on weights | MAP = | Regularization term |
|---|---|---|
| Gaussian: | Ridge regression | (L2) |
| Laplace: | Lasso regression | $\lambda \sum |
| Flat: | OLS (no regularization) | None |
Connecting Bayesian and Regularization
這個連結是面試中的加分利器:L2 regularization = Gaussian prior on weights,L1 = Laplace prior。Prior 的 precision()對應 regularization strength 。理解這點,你就能用 Bayesian language 解釋 regularization 為什麼能防止 overfitting — 因為 prior 把 weights 拉向零。
Bayesian vs Frequentist
| Aspect | Frequentist | Bayesian |
|---|---|---|
| Parameters | Fixed but unknown constants | Random variables with distributions |
| Probability | Long-run frequency | Degree of belief |
| Prior information | Not used | Explicitly incorporated |
| Result | Point estimate + confidence interval | Full posterior distribution |
| Interval interpretation | "95% of CIs from repeated experiments contain the true value" | "95% probability the parameter is in this interval" |
| Small sample | Can be unreliable | Prior helps regularize |
| Computation | Usually closed-form or simple | May need MCMC, variational inference |
兩者都不是絕對好或壞 — 取決於場景:
- Bayesian 更適合: small data, strong prior knowledge, quantify full uncertainty, sequential updating, 需要直覺性的 probability statement
- Frequentist 更適合: large data, computational simplicity, regulatory settings(FDA 要求 well-understood procedures), 不想爭辯 prior 的合理性
Bayesian Updating
One of Bayesian inference's most elegant properties — the posterior from one analysis becomes the prior for the next:
不需要重新處理歷史資料。新資料到來時,直接更新 posterior。這讓 Bayesian methods 非常適合 online learning 和 sequential decision-making。
Credible Interval vs Confidence Interval
| Confidence Interval (Frequentist) | Credible Interval (Bayesian) | |
|---|---|---|
| Statement | "95% of CIs from repeated experiments contain " | "" |
| Fixed quantity | is fixed, interval is random | Interval is fixed, is random |
| Interpretation | 關於方法的長期表現 | 關於這次結果的信念 |
| Depends on prior? | No | Yes |
面試官如果問「CI 和 credible interval 有什麼不同?」— 核心差異:CI 是頻率主義的 procedure guarantee,credible interval 是直接的 probability statement about the parameter。大多數人直覺上想要的解讀其實是 credible interval。
Real-World Use Cases
Case 1: 信用卡詐欺偵測 — Naive Bayes
Naive Bayes classifier 直接應用 Bayes' theorem:
"Naive" assumption: features are conditionally independent given the class.
即使這個假設幾乎不成立(交易金額和交易地點一定相關),Naive Bayes 在實務中依然表現得不錯,因為它只需要排序正確,不需要精確的 posterior probability。
面試 follow-up:「Naive Bayes 假設 feature independence,但真實資料的 features 幾乎都有相關。為什麼它還能 work?」— 因為 classification 只需要找到 ,即使 absolute probability 不準確,ranking 通常是對的。
Case 2: 推薦系統 — Thompson Sampling
Thompson Sampling is a Bayesian bandit algorithm for balancing exploration and exploitation:
- For each option (arm), maintain a Beta posterior for its success rate
- Sample from each posterior:
- Select the arm with the highest sampled value
- Observe the outcome, update the posterior
Thompson Sampling 自然地在 exploration(嘗試不確定的選項)和 exploitation(選擇已知最好的)之間取得平衡。Posterior variance 大的選項被探索的機率更高。
import numpy as np
class ThompsonSampling:
def __init__(self, n_arms):
self.alpha = np.ones(n_arms) # successes + 1
self.beta = np.ones(n_arms) # failures + 1
def select_arm(self):
samples = [np.random.beta(a, b) for a, b in zip(self.alpha, self.beta)]
return np.argmax(samples)
def update(self, arm, reward):
if reward:
self.alpha[arm] += 1
else:
self.beta[arm] += 1
Case 3: 房價預測 — Bayesian Linear Regression
Standard linear regression gives point estimates for coefficients. Bayesian linear regression gives a full posterior distribution for each coefficient:
好處:
- Uncertainty quantification: 每個預測都帶有 credible interval,不只是一個數字
- Automatic regularization: prior on 就是 regularization
- Interpretable feature importance: posterior width 反映 coefficient 的不確定性
面試中的典型問題:「你的房價模型預測這間房子值 $500K。你有多確定?」— Frequentist 只能給 prediction interval,Bayesian 直接給 posterior predictive distribution,直覺上更容易溝通。
When to Use Bayesian in Practice
| Application | Why Bayesian | Method |
|---|---|---|
| A/B testing (small sample) | 直接給出 ,比 p-value 更直覺 | Beta-Bernoulli model |
| Personalization / Ads | 在 exploration 和 exploitation 之間自動平衡 | Thompson Sampling |
| Spam filtering | 計算 做分類 | Naive Bayes |
| Forecasting | 注入 domain knowledge 作為 prior | Bayesian structural time series |
| Medical trials | Sequential updating,可以提早停止 | Bayesian adaptive designs |
| Calibration | 從 model scores 得到 calibrated probabilities | Platt scaling (MAP) |
Hands-on: Bayesian Inference in Python
Beta-Binomial Updating & Bayesian A/B Test
import numpy as np
from scipy import stats
# === Beta-Binomial Updating ===
alpha_prior, beta_prior = 2, 8 # Prior: ~20% conversion belief
successes, trials = 15, 100
alpha_post = alpha_prior + successes # 17
beta_post = beta_prior + (trials - successes) # 93
posterior_mean = alpha_post / (alpha_post + beta_post) # 0.155
# MLE = 15/100 = 0.15 → posterior pulled toward prior
# 95% Credible Interval
ci = stats.beta.ppf([0.025, 0.975], alpha_post, beta_post)
# === Bayesian A/B Test ===
# Control: 120/1000, Treatment: 145/1000
a_A, b_A = 1 + 120, 1 + 880 # Control posterior
a_B, b_B = 1 + 145, 1 + 855 # Treatment posterior
# Monte Carlo: P(B > A | data)
samples_A = np.random.beta(a_A, b_A, 100000)
samples_B = np.random.beta(a_B, b_B, 100000)
prob_B_better = (samples_B > samples_A).mean()
# prob_B_better ≈ 0.97 → strong evidence Treatment is better
expected_lift = ((samples_B - samples_A) / samples_A).mean()
Naive Bayes Classifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.model_selection import cross_val_score
# For continuous features (e.g., fraud detection)
gnb = GaussianNB()
scores = cross_val_score(gnb, X, y, cv=5, scoring="f1")
# For text/count features (e.g., spam filtering)
mnb = MultinomialNB(alpha=1.0) # alpha = Laplace smoothing
scores = cross_val_score(mnb, X_text, y, cv=5, scoring="f1")
# alpha=1.0 corresponds to adding 1 "virtual" count per feature per class
Interview Signals
What interviewers listen for:
- 你能正確套用 Bayes' theorem 解決計算題,不會搞混 prior 和 likelihood
- 你理解 base rate 的重要性,不會只看 test accuracy 就下結論
- 你能解釋 MAP 和 MLE 的差異,以及它們和 regularization 的連結
- 你知道 Bayesian 不是萬能的,能客觀比較兩個學派的優缺點
- 你能區分 credible interval 和 confidence interval
Practice
Flashcards
Flashcards (1/10)
MAP estimation 和 Ridge regression 有什麼關係?
MAP with Gaussian prior on weights = Ridge regression (L2). Prior precision 1/σ² 對應 regularization strength λ。Laplace prior 則對應 Lasso (L1)。Flat prior → MAP = MLE = OLS。
Quiz
某工廠產品不良率 0.1%。檢測器 sensitivity = 99%,specificity = 98%。一個產品被標記不良,實際不良的機率最接近?