Bayesian Inference

Interview Context

Bayesian 問題在面試中常以兩種形式出現:(1)直接套用 Bayes' theorem 的計算題,(2)討論 Bayesian vs Frequentist 哲學差異的開放式問題。兩種都要準備。

Bayes' Theorem

The foundation of all Bayesian inference:

P(AB)=P(BA)P(A)P(B)P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}

Each term has a specific name:

  • P(AB)P(A \mid B)Posterior: what you believe about AA after observing BB
  • P(BA)P(B \mid A)Likelihood: how probable the observed data BB is, given AA
  • P(A)P(A)Prior: what you believed about AA before seeing data
  • P(B)P(B)Evidence (marginal likelihood): a normalizing constant

分母通常用 law of total probability 計算:

P(B)=iP(BAi)P(Ai)P(B) = \sum_i P(B \mid A_i) \cdot P(A_i)

Classic Interview Example: Disease Testing

A disease affects 1% of the population. A test has 95% sensitivity (TPR) and 90% specificity (TNR). If a person tests positive, what is the probability they actually have the disease?

P(disease+)=P(+disease)P(disease)P(+)P(\text{disease} \mid +) = \frac{P(+ \mid \text{disease}) \cdot P(\text{disease})}{P(+)} =0.95×0.010.95×0.01+0.10×0.99=0.00950.10850.088= \frac{0.95 \times 0.01}{0.95 \times 0.01 + 0.10 \times 0.99} = \frac{0.0095}{0.1085} \approx 0.088

只有 8.8% — 即使測試 95% 準確!

Base Rate Fallacy

這是面試中最常見的 Bayesian 陷阱。直覺上覺得 95% 準確率的檢測應該很可靠,但當疾病本身很罕見(低 prior),大量的 false positives 會淹沒 true positives。永遠先問 base rate。

Step-by-Step Framework for Bayes Problems

面試中遇到 Bayes 計算題,用這個框架不會出錯:

  1. Define events clearly: AA = has disease, BB = tests positive
  2. List what you know: P(A)P(A), P(BA)P(B|A), P(BAc)P(B|A^c)
  3. Compute the denominator: P(B)=P(BA)P(A)+P(BAc)P(Ac)P(B) = P(B|A)P(A) + P(B|A^c)P(A^c)
  4. Apply Bayes' theorem: plug in and compute
  5. Sanity check: 如果 base rate 很低,posterior 通常也不會太高

The Bayesian Framework

In the Bayesian approach, parameters are treated as random variables with distributions, not fixed unknown constants:

P(θdata)=P(dataθ)P(θ)P(data)P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) \cdot P(\theta)}{P(\text{data})}

Or more concisely:

PosteriorLikelihood×Prior\text{Posterior} \propto \text{Likelihood} \times \text{Prior}

Posterior 濃縮了我們在觀察資料後對參數的所有認知。隨著資料增加,posterior 會越來越集中 — 資料逐漸壓過 prior 的影響。

Prior, Likelihood, and Posterior

Choosing a Prior

The prior encodes your beliefs before seeing data. Common choices:

Prior TypeDescriptionWhen to Use
Uninformative (flat)P(θ)1P(\theta) \propto 1 — no preference沒有任何先驗知識時
Weakly informativeRegularizes extremes, e.g., θN(0,10)\theta \sim N(0, 10)有大概的範圍但不確定
InformativeBased on domain knowledge or past data有強先驗知識(如歷史數據)
Jeffreys priorP(θ)I(θ)P(\theta) \propto \sqrt{I(\theta)} — invariant under reparameterization想要 "objective" Bayesian

面試怎麼回答 Prior 選擇

好的回答:「我會根據 domain knowledge 設定 weakly informative prior,然後做 sensitivity analysis 看結果對 prior 的敏感度。如果資料夠多,prior 的影響會很小 — 這也是一種檢驗分析穩健性的方式。」

Conjugate Priors

A prior is conjugate to a likelihood if the posterior belongs to the same distribution family. 好處是可以直接算出 posterior 的參數,不需要數值方法:

LikelihoodConjugate PriorPosteriorUse Case
Bernoulli / BinomialBeta(α,β\alpha, \beta)Beta(α+k,β+nk\alpha + k, \beta + n - k)CTR, conversion rate
PoissonGamma(α,β\alpha, \beta)Gamma(α+xi,β+n\alpha + \sum x_i, \beta + n)Event counts
Normal (known σ\sigma)Normal(μ0,σ02\mu_0, \sigma_0^2)Normal (weighted mean)Continuous measurements
ExponentialGamma(α,β\alpha, \beta)Gamma(α+n,β+xi\alpha + n, \beta + \sum x_i)Waiting times
MultinomialDirichlet(α\boldsymbol{\alpha})Dirichlet(α+k\boldsymbol{\alpha} + \mathbf{k})Category counts (NLP)

Beta-Binomial: The Most Important Example

You want to estimate a website's conversion rate.

Prior: Beta(2, 8) — 大約相信 conversion rate 在 20% 左右(22+8\frac{2}{2+8}

Data: 15 conversions out of 100 visits

Posterior:

Beta(2+15,  8+85)=Beta(17,93)\text{Beta}(2 + 15,\; 8 + 85) = \text{Beta}(17, 93) Posterior mean=1717+93=0.155\text{Posterior mean} = \frac{17}{17+93} = 0.155

MLE 是 15/100 = 0.15,posterior mean 被 prior 拉向 0.20。資料越多,posterior 越接近 MLE。

直覺:Beta(α,β\alpha, \beta) 相當於你已經看過 α1\alpha - 1 次成功和 β1\beta - 1 次失敗。Prior Beta(2, 8) 就像你已經有 1 次成功 + 7 次失敗的「虛擬觀測」。100 筆真實資料後,虛擬觀測的影響已經很小。

Posterior as a Compromise

Posterior mean 永遠介於 prior mean 和 MLE 之間:

Posterior mean=wPrior mean+(1w)MLE\text{Posterior mean} = w \cdot \text{Prior mean} + (1-w) \cdot \text{MLE}

where w=prior precisionprior precision+data precisionw = \frac{\text{prior precision}}{\text{prior precision} + \text{data precision}}.

資料越多(data precision 越大),w0w \to 0,posterior → MLE。Prior 只在小樣本時有明顯影響。

MAP vs MLE

Two key point estimation methods:

Maximum Likelihood Estimation (MLE) — find the parameter that maximizes the likelihood:

θ^MLE=argmaxθ  P(dataθ)\hat{\theta}_{\text{MLE}} = \arg\max_\theta \; P(\text{data} \mid \theta)

Maximum A Posteriori (MAP) — find the parameter that maximizes the posterior:

θ^MAP=argmaxθ  P(dataθ)P(θ)\hat{\theta}_{\text{MAP}} = \arg\max_\theta \; P(\text{data} \mid \theta) \cdot P(\theta)

The Key Connection to Regularization

Prior on weightsMAP =Regularization term
Gaussian: wN(0,σ2)w \sim N(0, \sigma^2)Ridge regressionλwj2\lambda \sum w_j^2 (L2)
Laplace: wLaplace(0,b)w \sim \text{Laplace}(0, b)Lasso regression$\lambda \sum
Flat: P(w)1P(w) \propto 1OLS (no regularization)None

Connecting Bayesian and Regularization

這個連結是面試中的加分利器:L2 regularization = Gaussian prior on weights,L1 = Laplace prior。Prior 的 precision(1/σ21/\sigma^2)對應 regularization strength λ\lambda。理解這點,你就能用 Bayesian language 解釋 regularization 為什麼能防止 overfitting — 因為 prior 把 weights 拉向零。

Bayesian vs Frequentist

AspectFrequentistBayesian
ParametersFixed but unknown constantsRandom variables with distributions
ProbabilityLong-run frequencyDegree of belief
Prior informationNot usedExplicitly incorporated
ResultPoint estimate + confidence intervalFull posterior distribution
Interval interpretation"95% of CIs from repeated experiments contain the true value""95% probability the parameter is in this interval"
Small sampleCan be unreliablePrior helps regularize
ComputationUsually closed-form or simpleMay need MCMC, variational inference

兩者都不是絕對好或壞 — 取決於場景:

  • Bayesian 更適合: small data, strong prior knowledge, quantify full uncertainty, sequential updating, 需要直覺性的 probability statement
  • Frequentist 更適合: large data, computational simplicity, regulatory settings(FDA 要求 well-understood procedures), 不想爭辯 prior 的合理性

Bayesian Updating

One of Bayesian inference's most elegant properties — the posterior from one analysis becomes the prior for the next:

P(θD1,D2)P(D2θ)P(θD1)previous posterior = new priorP(\theta \mid D_1, D_2) \propto P(D_2 \mid \theta) \cdot \underbrace{P(\theta \mid D_1)}_{\text{previous posterior = new prior}}

不需要重新處理歷史資料。新資料到來時,直接更新 posterior。這讓 Bayesian methods 非常適合 online learningsequential decision-making

Credible Interval vs Confidence Interval

Confidence Interval (Frequentist)Credible Interval (Bayesian)
Statement"95% of CIs from repeated experiments contain θ\theta""P(θ[a,b]data)=0.95P(\theta \in [a,b] \mid \text{data}) = 0.95"
Fixed quantityθ\theta is fixed, interval is randomInterval is fixed, θ\theta is random
Interpretation關於方法的長期表現關於這次結果的信念
Depends on prior?NoYes

面試官如果問「CI 和 credible interval 有什麼不同?」— 核心差異:CI 是頻率主義的 procedure guarantee,credible interval 是直接的 probability statement about the parameter。大多數人直覺上想要的解讀其實是 credible interval。

Real-World Use Cases

Case 1: 信用卡詐欺偵測 — Naive Bayes

Naive Bayes classifier 直接應用 Bayes' theorem:

P(fraudx)P(xfraud)P(fraud)P(\text{fraud} \mid \mathbf{x}) \propto P(\mathbf{x} \mid \text{fraud}) \cdot P(\text{fraud})

"Naive" assumption: features are conditionally independent given the class.

即使這個假設幾乎不成立(交易金額和交易地點一定相關),Naive Bayes 在實務中依然表現得不錯,因為它只需要排序正確,不需要精確的 posterior probability。

面試 follow-up:「Naive Bayes 假設 feature independence,但真實資料的 features 幾乎都有相關。為什麼它還能 work?」— 因為 classification 只需要找到 argmaxcP(cx)\arg\max_c P(c|\mathbf{x}),即使 absolute probability 不準確,ranking 通常是對的。

Case 2: 推薦系統 — Thompson Sampling

Thompson Sampling is a Bayesian bandit algorithm for balancing exploration and exploitation:

  1. For each option (arm), maintain a Beta posterior for its success rate
  2. Sample from each posterior: p~iBeta(αi,βi)\tilde{p}_i \sim \text{Beta}(\alpha_i, \beta_i)
  3. Select the arm with the highest sampled value
  4. Observe the outcome, update the posterior

Thompson Sampling 自然地在 exploration(嘗試不確定的選項)和 exploitation(選擇已知最好的)之間取得平衡。Posterior variance 大的選項被探索的機率更高。

import numpy as np

class ThompsonSampling:
    def __init__(self, n_arms):
        self.alpha = np.ones(n_arms)  # successes + 1
        self.beta = np.ones(n_arms)   # failures + 1

    def select_arm(self):
        samples = [np.random.beta(a, b) for a, b in zip(self.alpha, self.beta)]
        return np.argmax(samples)

    def update(self, arm, reward):
        if reward:
            self.alpha[arm] += 1
        else:
            self.beta[arm] += 1

Case 3: 房價預測 — Bayesian Linear Regression

Standard linear regression gives point estimates for coefficients. Bayesian linear regression gives a full posterior distribution for each coefficient:

P(βy,X)P(yX,β)P(β)P(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X}) \propto P(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}) \cdot P(\boldsymbol{\beta})

好處:

  • Uncertainty quantification: 每個預測都帶有 credible interval,不只是一個數字
  • Automatic regularization: prior on β\boldsymbol{\beta} 就是 regularization
  • Interpretable feature importance: posterior width 反映 coefficient 的不確定性

面試中的典型問題:「你的房價模型預測這間房子值 $500K。你有多確定?」— Frequentist 只能給 prediction interval,Bayesian 直接給 posterior predictive distribution,直覺上更容易溝通。

When to Use Bayesian in Practice

ApplicationWhy BayesianMethod
A/B testing (small sample)直接給出 P(B>Adata)P(\text{B} > \text{A} \mid \text{data}),比 p-value 更直覺Beta-Bernoulli model
Personalization / Ads在 exploration 和 exploitation 之間自動平衡Thompson Sampling
Spam filtering計算 P(spamwords)P(\text{spam} \mid \text{words}) 做分類Naive Bayes
Forecasting注入 domain knowledge 作為 priorBayesian structural time series
Medical trialsSequential updating,可以提早停止Bayesian adaptive designs
Calibration從 model scores 得到 calibrated probabilitiesPlatt scaling (MAP)

Hands-on: Bayesian Inference in Python

Beta-Binomial Updating & Bayesian A/B Test

import numpy as np
from scipy import stats

# === Beta-Binomial Updating ===
alpha_prior, beta_prior = 2, 8  # Prior: ~20% conversion belief
successes, trials = 15, 100

alpha_post = alpha_prior + successes           # 17
beta_post = beta_prior + (trials - successes)  # 93
posterior_mean = alpha_post / (alpha_post + beta_post)  # 0.155
# MLE = 15/100 = 0.15 → posterior pulled toward prior

# 95% Credible Interval
ci = stats.beta.ppf([0.025, 0.975], alpha_post, beta_post)

# === Bayesian A/B Test ===
# Control: 120/1000, Treatment: 145/1000
a_A, b_A = 1 + 120, 1 + 880   # Control posterior
a_B, b_B = 1 + 145, 1 + 855   # Treatment posterior

# Monte Carlo: P(B > A | data)
samples_A = np.random.beta(a_A, b_A, 100000)
samples_B = np.random.beta(a_B, b_B, 100000)
prob_B_better = (samples_B > samples_A).mean()
# prob_B_better ≈ 0.97 → strong evidence Treatment is better
expected_lift = ((samples_B - samples_A) / samples_A).mean()

Naive Bayes Classifier

from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.model_selection import cross_val_score

# For continuous features (e.g., fraud detection)
gnb = GaussianNB()
scores = cross_val_score(gnb, X, y, cv=5, scoring="f1")

# For text/count features (e.g., spam filtering)
mnb = MultinomialNB(alpha=1.0)  # alpha = Laplace smoothing
scores = cross_val_score(mnb, X_text, y, cv=5, scoring="f1")
# alpha=1.0 corresponds to adding 1 "virtual" count per feature per class

Interview Signals

What interviewers listen for:

  • 你能正確套用 Bayes' theorem 解決計算題,不會搞混 prior 和 likelihood
  • 你理解 base rate 的重要性,不會只看 test accuracy 就下結論
  • 你能解釋 MAP 和 MLE 的差異,以及它們和 regularization 的連結
  • 你知道 Bayesian 不是萬能的,能客觀比較兩個學派的優缺點
  • 你能區分 credible interval 和 confidence interval

Practice

Flashcards

Flashcards (1/10)

MAP estimation 和 Ridge regression 有什麼關係?

MAP with Gaussian prior on weights = Ridge regression (L2). Prior precision 1/σ² 對應 regularization strength λ。Laplace prior 則對應 Lasso (L1)。Flat prior → MAP = MLE = OLS。

Click card to flip

Quiz

Question 1/10

某工廠產品不良率 0.1%。檢測器 sensitivity = 99%,specificity = 98%。一個產品被標記不良,實際不良的機率最接近?

Mark as Complete

3/5 — Okay