Bayesian Inference

Interview Context

Bayesian 問題在面試中常以兩種形式出現：（1）直接套用 Bayes' theorem 的計算題，（2）討論 Bayesian vs Frequentist 哲學差異的開放式問題。兩種都要準備。

Bayes' Theorem

The foundation of all Bayesian inference:

P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}

Each term has a specific name:

$P(A \mid B)$ — Posterior: what you believe about $A$ after observing $B$
$P(B \mid A)$ — Likelihood: how probable the observed data $B$ is, given $A$
$P(A)$ — Prior: what you believed about $A$ before seeing data
$P(B)$ — Evidence (marginal likelihood): a normalizing constant

分母通常用 law of total probability 計算：

P(B) = \sum_i P(B \mid A_i) \cdot P(A_i)

Classic Interview Example: Disease Testing

A disease affects 1% of the population. A test has 95% sensitivity (TPR) and 90% specificity (TNR). If a person tests positive, what is the probability they actually have the disease?

P(\text{disease} \mid +) = \frac{P(+ \mid \text{disease}) \cdot P(\text{disease})}{P(+)}

= \frac{0.95 \times 0.01}{0.95 \times 0.01 + 0.10 \times 0.99} = \frac{0.0095}{0.1085} \approx 0.088

只有 8.8% — 即使測試 95% 準確！

Base Rate Fallacy

這是面試中最常見的 Bayesian 陷阱。直覺上覺得 95% 準確率的檢測應該很可靠，但當疾病本身很罕見（低 prior），大量的 false positives 會淹沒 true positives。永遠先問 base rate。

Step-by-Step Framework for Bayes Problems

面試中遇到 Bayes 計算題，用這個框架不會出錯：

Define events clearly: $A$ = has disease, $B$ = tests positive
List what you know: $P(A)$ , $P(B|A)$ , $P(B|A^c)$
Compute the denominator: $P(B) = P(B|A)P(A) + P(B|A^c)P(A^c)$
Apply Bayes' theorem: plug in and compute
Sanity check: 如果 base rate 很低，posterior 通常也不會太高

The Bayesian Framework

In the Bayesian approach, parameters are treated as random variables with distributions, not fixed unknown constants:

P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) \cdot P(\theta)}{P(\text{data})}

Or more concisely:

\text{Posterior} \propto \text{Likelihood} \times \text{Prior}

Posterior 濃縮了我們在觀察資料後對參數的所有認知。隨著資料增加，posterior 會越來越集中 — 資料逐漸壓過 prior 的影響。

Prior, Likelihood, and Posterior

Choosing a Prior

The prior encodes your beliefs before seeing data. Common choices:

Prior Type	Description	When to Use
Uninformative (flat)	$P(\theta) \propto 1$ — no preference	沒有任何先驗知識時
Weakly informative	Regularizes extremes, e.g., $\theta \sim N(0, 10)$	有大概的範圍但不確定
Informative	Based on domain knowledge or past data	有強先驗知識（如歷史數據）
Jeffreys prior	$P(\theta) \propto \sqrt{I(\theta)}$ — invariant under reparameterization	想要 "objective" Bayesian

面試怎麼回答 Prior 選擇

好的回答：「我會根據 domain knowledge 設定 weakly informative prior，然後做 sensitivity analysis 看結果對 prior 的敏感度。如果資料夠多，prior 的影響會很小 — 這也是一種檢驗分析穩健性的方式。」

Conjugate Priors

A prior is conjugate to a likelihood if the posterior belongs to the same distribution family. 好處是可以直接算出 posterior 的參數，不需要數值方法：

Likelihood	Conjugate Prior	Posterior	Use Case
Bernoulli / Binomial	Beta( $\alpha, \beta$ )	Beta( $\alpha + k, \beta + n - k$ )	CTR, conversion rate
Poisson	Gamma( $\alpha, \beta$ )	Gamma( $\alpha + \sum x_i, \beta + n$ )	Event counts
Normal (known $\sigma$ )	Normal( $\mu_0, \sigma_0^2$ )	Normal (weighted mean)	Continuous measurements
Exponential	Gamma( $\alpha, \beta$ )	Gamma( $\alpha + n, \beta + \sum x_i$ )	Waiting times
Multinomial	Dirichlet( $\boldsymbol{\alpha}$ )	Dirichlet( $\boldsymbol{\alpha} + \mathbf{k}$ )	Category counts (NLP)

Beta-Binomial: The Most Important Example

You want to estimate a website's conversion rate.

Prior: Beta(2, 8) — 大約相信 conversion rate 在 20% 左右（ $\frac{2}{2+8}$ ）

Data: 15 conversions out of 100 visits

Posterior:

\text{Beta}(2 + 15,\; 8 + 85) = \text{Beta}(17, 93)

\text{Posterior mean} = \frac{17}{17+93} = 0.155

MLE 是 15/100 = 0.15，posterior mean 被 prior 拉向 0.20。資料越多，posterior 越接近 MLE。

直覺：Beta( $\alpha, \beta$ ) 相當於你已經看過 $\alpha - 1$ 次成功和 $\beta - 1$ 次失敗。Prior Beta(2, 8) 就像你已經有 1 次成功 + 7 次失敗的「虛擬觀測」。100 筆真實資料後，虛擬觀測的影響已經很小。

Posterior as a Compromise

Posterior mean 永遠介於 prior mean 和 MLE 之間：

\text{Posterior mean} = w \cdot \text{Prior mean} + (1-w) \cdot \text{MLE}

where $w = \frac{\text{prior precision}}{\text{prior precision} + \text{data precision}}$ .

資料越多（data precision 越大）， $w \to 0$ ，posterior → MLE。Prior 只在小樣本時有明顯影響。

MAP vs MLE

Two key point estimation methods:

Maximum Likelihood Estimation (MLE) — find the parameter that maximizes the likelihood:

\hat{\theta}_{\text{MLE}} = \arg\max_\theta \; P(\text{data} \mid \theta)

Maximum A Posteriori (MAP) — find the parameter that maximizes the posterior:

\hat{\theta}_{\text{MAP}} = \arg\max_\theta \; P(\text{data} \mid \theta) \cdot P(\theta)

The Key Connection to Regularization

Prior on weights	MAP =	Regularization term
Gaussian: $w \sim N(0, \sigma^2)$	Ridge regression	$\lambda \sum w_j^2$ (L2)
Laplace: $w \sim \text{Laplace}(0, b)$	Lasso regression	$\lambda \sum
Flat: $P(w) \propto 1$	OLS (no regularization)	None

Connecting Bayesian and Regularization

這個連結是面試中的加分利器：L2 regularization = Gaussian prior on weights，L1 = Laplace prior。Prior 的 precision（ $1/\sigma^2$ ）對應 regularization strength $\lambda$ 。理解這點，你就能用 Bayesian language 解釋 regularization 為什麼能防止 overfitting — 因為 prior 把 weights 拉向零。

Bayesian vs Frequentist

Aspect	Frequentist	Bayesian
Parameters	Fixed but unknown constants	Random variables with distributions
Probability	Long-run frequency	Degree of belief
Prior information	Not used	Explicitly incorporated
Result	Point estimate + confidence interval	Full posterior distribution
Interval interpretation	"95% of CIs from repeated experiments contain the true value"	"95% probability the parameter is in this interval"
Small sample	Can be unreliable	Prior helps regularize
Computation	Usually closed-form or simple	May need MCMC, variational inference

兩者都不是絕對好或壞 — 取決於場景：

Bayesian 更適合: small data, strong prior knowledge, quantify full uncertainty, sequential updating, 需要直覺性的 probability statement
Frequentist 更適合: large data, computational simplicity, regulatory settings（FDA 要求 well-understood procedures）, 不想爭辯 prior 的合理性

Bayesian Updating

One of Bayesian inference's most elegant properties — the posterior from one analysis becomes the prior for the next:

P(\theta \mid D_1, D_2) \propto P(D_2 \mid \theta) \cdot \underbrace{P(\theta \mid D_1)}_{\text{previous posterior = new prior}}

不需要重新處理歷史資料。新資料到來時，直接更新 posterior。這讓 Bayesian methods 非常適合 online learning 和 sequential decision-making。

Credible Interval vs Confidence Interval

	Confidence Interval (Frequentist)	Credible Interval (Bayesian)
Statement	"95% of CIs from repeated experiments contain $\theta$ "	" $P(\theta \in [a,b] \mid \text{data}) = 0.95$ "
Fixed quantity	$\theta$ is fixed, interval is random	Interval is fixed, $\theta$ is random
Interpretation	關於方法的長期表現	關於這次結果的信念
Depends on prior?	No	Yes

面試官如果問「CI 和 credible interval 有什麼不同？」— 核心差異：CI 是頻率主義的 procedure guarantee，credible interval 是直接的 probability statement about the parameter。大多數人直覺上想要的解讀其實是 credible interval。

Real-World Use Cases

Case 1: 信用卡詐欺偵測 — Naive Bayes

Naive Bayes classifier 直接應用 Bayes' theorem：

P(\text{fraud} \mid \mathbf{x}) \propto P(\mathbf{x} \mid \text{fraud}) \cdot P(\text{fraud})

"Naive" assumption: features are conditionally independent given the class.

即使這個假設幾乎不成立（交易金額和交易地點一定相關），Naive Bayes 在實務中依然表現得不錯，因為它只需要排序正確，不需要精確的 posterior probability。

面試 follow-up：「Naive Bayes 假設 feature independence，但真實資料的 features 幾乎都有相關。為什麼它還能 work？」— 因為 classification 只需要找到 $\arg\max_c P(c|\mathbf{x})$ ，即使 absolute probability 不準確，ranking 通常是對的。

Case 2: 推薦系統 — Thompson Sampling

Thompson Sampling is a Bayesian bandit algorithm for balancing exploration and exploitation:

For each option (arm), maintain a Beta posterior for its success rate
Sample from each posterior: $\tilde{p}_i \sim \text{Beta}(\alpha_i, \beta_i)$
Select the arm with the highest sampled value
Observe the outcome, update the posterior

Thompson Sampling 自然地在 exploration（嘗試不確定的選項）和 exploitation（選擇已知最好的）之間取得平衡。Posterior variance 大的選項被探索的機率更高。

import numpy as np

class ThompsonSampling:
    def __init__(self, n_arms):
        self.alpha = np.ones(n_arms)  # successes + 1
        self.beta = np.ones(n_arms)   # failures + 1

    def select_arm(self):
        samples = [np.random.beta(a, b) for a, b in zip(self.alpha, self.beta)]
        return np.argmax(samples)

    def update(self, arm, reward):
        if reward:
            self.alpha[arm] += 1
        else:
            self.beta[arm] += 1

Case 3: 房價預測 — Bayesian Linear Regression

Standard linear regression gives point estimates for coefficients. Bayesian linear regression gives a full posterior distribution for each coefficient:

P(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X}) \propto P(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}) \cdot P(\boldsymbol{\beta})

好處：

Uncertainty quantification: 每個預測都帶有 credible interval，不只是一個數字
Automatic regularization: prior on $\boldsymbol{\beta}$ 就是 regularization
Interpretable feature importance: posterior width 反映 coefficient 的不確定性

面試中的典型問題：「你的房價模型預測這間房子值 $500K。你有多確定？」— Frequentist 只能給 prediction interval，Bayesian 直接給 posterior predictive distribution，直覺上更容易溝通。

When to Use Bayesian in Practice

Application	Why Bayesian	Method
A/B testing (small sample)	直接給出 $P(\text{B} > \text{A} \mid \text{data})$ ，比 p-value 更直覺	Beta-Bernoulli model
Personalization / Ads	在 exploration 和 exploitation 之間自動平衡	Thompson Sampling
Spam filtering	計算 $P(\text{spam} \mid \text{words})$ 做分類	Naive Bayes
Forecasting	注入 domain knowledge 作為 prior	Bayesian structural time series
Medical trials	Sequential updating，可以提早停止	Bayesian adaptive designs
Calibration	從 model scores 得到 calibrated probabilities	Platt scaling (MAP)

Hands-on: Bayesian Inference in Python

Beta-Binomial Updating & Bayesian A/B Test

import numpy as np
from scipy import stats

# === Beta-Binomial Updating ===
alpha_prior, beta_prior = 2, 8  # Prior: ~20% conversion belief
successes, trials = 15, 100

alpha_post = alpha_prior + successes           # 17
beta_post = beta_prior + (trials - successes)  # 93
posterior_mean = alpha_post / (alpha_post + beta_post)  # 0.155
# MLE = 15/100 = 0.15 → posterior pulled toward prior

# 95% Credible Interval
ci = stats.beta.ppf([0.025, 0.975], alpha_post, beta_post)

# === Bayesian A/B Test ===
# Control: 120/1000, Treatment: 145/1000
a_A, b_A = 1 + 120, 1 + 880   # Control posterior
a_B, b_B = 1 + 145, 1 + 855   # Treatment posterior

# Monte Carlo: P(B > A | data)
samples_A = np.random.beta(a_A, b_A, 100000)
samples_B = np.random.beta(a_B, b_B, 100000)
prob_B_better = (samples_B > samples_A).mean()
# prob_B_better ≈ 0.97 → strong evidence Treatment is better
expected_lift = ((samples_B - samples_A) / samples_A).mean()

Naive Bayes Classifier

from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.model_selection import cross_val_score

# For continuous features (e.g., fraud detection)
gnb = GaussianNB()
scores = cross_val_score(gnb, X, y, cv=5, scoring="f1")

# For text/count features (e.g., spam filtering)
mnb = MultinomialNB(alpha=1.0)  # alpha = Laplace smoothing
scores = cross_val_score(mnb, X_text, y, cv=5, scoring="f1")
# alpha=1.0 corresponds to adding 1 "virtual" count per feature per class

Interview Signals

What interviewers listen for:

你能正確套用 Bayes' theorem 解決計算題，不會搞混 prior 和 likelihood
你理解 base rate 的重要性，不會只看 test accuracy 就下結論
你能解釋 MAP 和 MLE 的差異，以及它們和 regularization 的連結
你知道 Bayesian 不是萬能的，能客觀比較兩個學派的優缺點
你能區分 credible interval 和 confidence interval

Practice

Flashcards

Flashcards (1/10)

MAP estimation 和 Ridge regression 有什麼關係？

MAP with Gaussian prior on weights = Ridge regression (L2). Prior precision 1/σ² 對應 regularization strength λ。Laplace prior 則對應 Lasso (L1)。Flat prior → MAP = MLE = OLS。

Click card to flip

Quiz

Question 1/10

某工廠產品不良率 0.1%。檢測器 sensitivity = 99%，specificity = 98%。一個產品被標記不良，實際不良的機率最接近？

Mark as Complete

How confident are you with this topic?

3/5 — Okay