Bandits & Exploration
Interview Context
Multi-armed bandits 是 DS 面試中最 practical 的 RL 概念 — 直接 applicable to A/B test optimization、ad serving、recommendation、動態定價。不像 full RL 需要考慮 state transitions,bandits 是 stateless 的 → 更容易理解和實作。面試中常問「你怎麼加速 A/B test?」或「怎麼在 explore 和 exploit 之間取得平衡?」。
What You Should Understand
- 能描述 multi-armed bandit problem 和它和 A/B testing 的關係
- 理解三大策略:-greedy、UCB、Thompson Sampling
- 知道 contextual bandits 如何加入 user/item features
- 理解 regret 的概念和不同演算法的 regret bounds
- 能在推薦、廣告、A/B testing 場景中應用 bandits
The Multi-Armed Bandit Problem
Imagine a row of slot machines (one-armed bandits). Each machine has an unknown reward probability. You have total pulls. Goal: maximize total reward.
The fundamental tradeoff:
- Exploit: Pull the machine that looks best so far → maximize short-term
- Explore: Try other machines that might be better → gather information
Formal Definition
- arms, each with unknown reward distribution and mean
- At each round : choose arm , receive reward
- Goal: maximize , equivalently minimize regret
Bandit vs A/B Test vs Full RL
| Aspect | A/B Test | Multi-Armed Bandit | Full RL |
|---|---|---|---|
| Exploration | Fixed ratio (50/50) for entire duration | Adaptive — shift traffic to better arm | Adaptive |
| State | None | None(stateless) | Has state transitions |
| When to decide | After experiment ends | During experiment — continuously update | Continuously |
| Regret | High(send 50% traffic to worse arm throughout) | Low(quickly converge to best arm) | Depends on problem |
| Use case | Definitive causal effect estimation | Optimize while learning | Sequential decisions with states |
Bandit ≠ A/B Test Replacement
Bandits optimize allocation(minimize regret),A/B tests estimate causal effects(statistical inference)。Bandits 在 experiment 結束後不能給你 clean effect size estimate + confidence interval — 因為 allocation 不 balanced。如果你需要 rigorous causal inference → A/B test。如果你需要 quickly allocate to the best option → bandit。
Epsilon-Greedy
The simplest strategy:
where is the estimated mean reward of arm .
| Pros | Cons |
|---|---|
| Simple to implement | Explores uniformly — even obviously bad arms get pulled |
| Easy to understand | is fixed — doesn't adapt over time |
| Reasonable baseline | Linear regret — never fully converges |
Decaying
前期 explore 多( 高),後期 exploit 多( → 0)。但 decay schedule 需要 tuning — universal optimal schedule 不存在。
import numpy as np
class EpsilonGreedy:
def __init__(self, n_arms, epsilon=0.1):
self.counts = np.zeros(n_arms)
self.values = np.zeros(n_arms)
self.epsilon = epsilon
def select_arm(self):
if np.random.random() < self.epsilon:
return np.random.randint(len(self.values))
return self.values.argmax()
def update(self, arm, reward):
self.counts[arm] += 1
n = self.counts[arm]
self.values[arm] += (reward - self.values[arm]) / n # incremental mean
Upper Confidence Bound (UCB)
The Principle: Optimism in the Face of Uncertainty
UCB selects the arm with the highest upper confidence bound — combine estimated reward + exploration bonus:
| Term | Meaning |
|---|---|
| Estimated mean reward(exploitation) | |
| Exploration bonus — large when is small(少被 pull 的 arm 有更大的 bonus) | |
| Controls exploration strength |
直覺:「如果一個 arm 被 pull 的次數少,它的 true mean 可能比估計的更高 — 給它 benefit of the doubt。」
UCB Properties
- No random exploration — deterministically picks the arm with highest UCB
- Arms that haven't been pulled much get inflated scores → automatically explored
- As grows → bonus shrinks → converges to best arm
- Logarithmic regret — much better than -greedy's linear regret
class UCB:
def __init__(self, n_arms, c=2.0):
self.counts = np.zeros(n_arms)
self.values = np.zeros(n_arms)
self.c = c
self.t = 0
def select_arm(self):
self.t += 1
# Pull each arm once first
for arm in range(len(self.values)):
if self.counts[arm] == 0:
return arm
ucb_values = self.values + self.c * np.sqrt(np.log(self.t) / self.counts)
return ucb_values.argmax()
def update(self, arm, reward):
self.counts[arm] += 1
n = self.counts[arm]
self.values[arm] += (reward - self.values[arm]) / n
Thompson Sampling
The Bayesian Approach
Maintain a posterior distribution for each arm's reward probability. At each round:
- Sample from each arm's posterior:
- Select arm with highest sampled value:
- Observe reward, update posterior
Beta-Bernoulli Bandit
For binary rewards (click/no-click), use Beta posterior:
- Prior: (uniform — no prior belief)
- After observing success:
- After observing failure:
- Posterior mean:
Why Thompson Sampling Works
| Property | Explanation |
|---|---|
| Natural exploration | Arms with high uncertainty(wide posterior)→ sometimes sample high → get explored |
| Automatic convergence | As data grows → posterior narrows → samples concentrate near true mean → exploit |
| No hyperparameters | 不需要 tune 或 — posterior 自動 calibrate exploration |
| Optimal regret | — matches UCB's lower bound |
直覺:Thompson Sampling 的 exploration 是 probability matching — 選一個 arm 的機率 ≈ 它是 best arm 的 posterior probability。Uncertainty 高 → 有機會被 sample 到高值 → 被 explore。Uncertainty 低 → 穩定在 true mean 附近 → exploit if best。
class ThompsonSampling:
def __init__(self, n_arms):
self.alpha = np.ones(n_arms) # successes + 1
self.beta = np.ones(n_arms) # failures + 1
def select_arm(self):
samples = [np.random.beta(a, b) for a, b in zip(self.alpha, self.beta)]
return np.argmax(samples)
def update(self, arm, reward):
if reward:
self.alpha[arm] += 1
else:
self.beta[arm] += 1
面試中的 Thompson Sampling
Thompson Sampling 是面試中 bandits 最常問的策略 — 因為它連結了 Bayesian inference(Beta posterior)和 exploration/exploitation。能說出 Beta-Bernoulli 的 update rule + 為什麼 uncertainty 自動驅動 exploration 是很強的信號。
Algorithm Comparison
| Algorithm | Exploration | Regret | Hyperparams | Complexity | Best For |
|---|---|---|---|---|---|
| -greedy | Uniform random | linear | Simple baseline | ||
| UCB | Optimistic (bonus) | Deterministic, theoretical guarantees | |||
| Thompson | Bayesian (sample) | None | per sample | Industry standard, no tuning | |
| EXP3 | Adversarial setting | — | Non-stochastic rewards |
Why Thompson Sampling Wins in Practice
Empirically, Thompson Sampling 通常 outperforms UCB — 因為它的 exploration 更 targeted(probability matching vs optimistic upper bound)。而且不需要 tune hyperparameters。Google、Microsoft、Netflix 等公司都在 production 中使用 Thompson Sampling。
Regret
Definition
Regret = the total reward you missed by not always choosing the best arm:
where (best arm's mean)and (gap)。
Regret Intuition
| Strategy | Regret Growth | Intuition |
|---|---|---|
| Always random | linear | Never learns — constant per-step regret |
| -greedy (fixed) | linear | Always explores fraction randomly |
| UCB / Thompson | sublinear | Exploration decreases over time → per-step regret → 0 |
| Optimal | Theoretical lower bound — no algorithm can do better |
Sublinear regret 代表 learning — per-round regret 隨時間遞減。Linear regret 代表沒學到東西。
Contextual Bandits
The Extension
Standard bandits: 每個 arm 的 reward 是固定分布。但在推薦和廣告中,reward 取決於 context(user features, item features)。
| Standard Bandit | Contextual Bandit |
|---|---|
| Reward (固定) | Reward (depends on context ) |
| Same treatment for all users | Personalized — different actions for different users |
| Few arms (A/B/C) | Potentially many arms with shared features |
LinUCB
Model the expected reward as a linear function of context:
UCB version:
captures the uncertainty in — contexts where we have less data get larger exploration bonus。
Contextual Thompson Sampling
同樣可以把 Thompson Sampling 推廣到 contextual setting — 對 維持 posterior distribution(例如 multivariate Gaussian),每次 sample ,選 最大的 arm。
class ContextualThompsonSampling:
def __init__(self, n_arms, context_dim, lambda_=1.0):
self.n_arms = n_arms
self.d = context_dim
# Per-arm: B = X^T X + λI, f = X^T y
self.B = [lambda_ * np.eye(context_dim) for _ in range(n_arms)]
self.f = [np.zeros(context_dim) for _ in range(n_arms)]
def select_arm(self, context):
samples = []
for a in range(self.n_arms):
theta_hat = np.linalg.solve(self.B[a], self.f[a])
cov = np.linalg.inv(self.B[a])
theta_sample = np.random.multivariate_normal(theta_hat, cov)
samples.append(context @ theta_sample)
return np.argmax(samples)
def update(self, arm, context, reward):
self.B[arm] += np.outer(context, context)
self.f[arm] += reward * context
Bandits vs A/B Testing
When to Use Which
| Need | Use | Why |
|---|---|---|
| Causal effect estimation with CI | A/B test | Equal allocation → unbiased estimate, valid statistical inference |
| Quick optimization — find and use the best option | Bandit | Adaptively shift traffic to winner → less regret |
| Multiple options (> 2-3 treatments) | Bandit | A/B test with 10 arms → 10% traffic per arm → underpowered |
| Personalization (different best option per user) | Contextual bandit | One-size-fits-all A/B test misses personalization opportunity |
| Continuous deployment (no clear end) | Bandit | A/B test has fixed duration; bandits operate continuously |
Hybrid: A/B Test + Bandit
| Phase | Approach | Goal |
|---|---|---|
| Phase 1 | Pure A/B test (equal split) | Estimate causal effects, validate metrics |
| Phase 2 | Switch to bandit | Optimize allocation, minimize opportunity cost |
或用 Bayesian A/B test — Thompson Sampling 的 posterior probability 自然提供 → 同時做 inference 和 optimization。
Real-World Use Cases
Case 1: 廣告展示 — Ad Creative Selection
你有 5 個不同的 ad creatives 要展示給 users。哪個 CTR 最高?
| Approach | How | Tradeoff |
|---|---|---|
| A/B test | 每個 creative 分配 20% 流量,跑 2 週 | Accurate but slow, 80% traffic goes to non-optimal |
| -greedy | 90% 流量到 best-so-far,10% random explore | Simple but wastes exploration on obviously bad creatives |
| Thompson Sampling | Sample from Beta posteriors, pick highest sample | Best: natural exploration, fast convergence, no tuning |
# Ad creative Thompson Sampling
class AdBandit:
def __init__(self, creative_ids):
self.creatives = creative_ids
self.alpha = {c: 1 for c in creative_ids} # clicks + 1
self.beta = {c: 1 for c in creative_ids} # no-clicks + 1
def select_creative(self):
samples = {c: np.random.beta(self.alpha[c], self.beta[c]) for c in self.creatives}
return max(samples, key=samples.get)
def record(self, creative, clicked):
if clicked:
self.alpha[creative] += 1
else:
self.beta[creative] += 1
面試 follow-up:「有 10K 個 ads 怎麼辦?每個 ad 不夠 data?」→ Contextual bandit — share information across ads with similar features。
Case 2: 推薦系統 — Explore-Exploit for Cold Start
新 item 沒有 interaction data → 推薦系統不會推薦它 → 永遠不會有 data → cold start loop。
Bandit 自然解決:新 item 的 posterior uncertainty 很大 → Thompson Sampling 有機會 sample 到高值 → 被推薦 → 收集 data → posterior 收斂。
| Item Type | Bandit Behavior |
|---|---|
| Popular item(已知好) | Posterior narrow, high mean → almost always selected |
| Known bad item | Posterior narrow, low mean → rarely selected |
| New item(uncertain) | Posterior wide → sometimes samples high → gets explored |
Case 3: 動態定價 — Price Optimization
你要為一個產品找最佳價格。每個價格是一個 arm:
| MDP Component | In Pricing |
|---|---|
| Arms | Discrete price points: 12.99, 19.99 |
| Reward | Revenue = price × conversion_rate(price) |
| Challenge | Conversion rate 和 price 有 monotone relationship → 不只是 independent arms |
用 contextual bandit:price 是 action,context = user features + time + demand signals → personalized pricing。
定價實驗的倫理考量
同一時間對不同使用者展示不同價格有 ethical concerns(price discrimination)。面試中主動提到這一點是加分。解法:(1)只對新使用者實驗,(2)在 acceptable range 內變動,(3)transparent pricing policy。
Case 4: Feature / UI Testing
你想測試 10 個不同的 homepage layouts。A/B test 需要把流量分成 10 份 → 每份只有 10% → underpowered。
Bandit approach:Thompson Sampling quickly identifies top 2-3 layouts → allocates most traffic there → continues exploring others with less traffic → finds best layout faster with less wasted traffic。
Hands-on: Bandits in Python
Complete Simulation
import numpy as np
def simulate_bandit(bandit, true_probs, n_rounds=10000):
"""Simulate a bandit algorithm and return cumulative regret."""
best_prob = max(true_probs)
regret = 0
regrets = []
for t in range(n_rounds):
arm = bandit.select_arm()
reward = np.random.binomial(1, true_probs[arm])
bandit.update(arm, reward)
regret += best_prob - true_probs[arm]
regrets.append(regret)
return regrets
# True click probabilities (unknown to the algorithm)
true_probs = [0.05, 0.08, 0.12, 0.10, 0.03]
# Compare algorithms
eg = EpsilonGreedy(5, epsilon=0.1)
ucb = UCB(5, c=2.0)
ts = ThompsonSampling(5)
regrets_eg = simulate_bandit(eg, true_probs)
regrets_ucb = simulate_bandit(ucb, true_probs)
regrets_ts = simulate_bandit(ts, true_probs)
# Thompson Sampling typically has lowest regret
Thompson Sampling with Gaussian Rewards
For continuous rewards(not binary), use Normal-Inverse-Gamma posterior:
class GaussianThompsonSampling:
def __init__(self, n_arms):
self.n = np.zeros(n_arms)
self.mean = np.zeros(n_arms)
self.var = np.ones(n_arms) * 100 # high initial variance
def select_arm(self):
samples = [np.random.normal(m, np.sqrt(v / max(n, 1)))
for m, v, n in zip(self.mean, self.var, self.n)]
return np.argmax(samples)
def update(self, arm, reward):
self.n[arm] += 1
old_mean = self.mean[arm]
self.mean[arm] += (reward - old_mean) / self.n[arm]
self.var[arm] += (reward - old_mean) * (reward - self.mean[arm])
Interview Signals
What interviewers listen for:
- 你能區分 bandits 和 A/B test 的適用場景
- 你知道 Thompson Sampling 的 Beta-Bernoulli update rule
- 你能解釋 UCB 的 optimism principle 和 exploration bonus
- 你理解 regret 的概念(linear vs sublinear = learned or not)
- 你能在推薦系統和廣告場景中 formulate bandit problems
Practice
Flashcards
Flashcards (1/10)
Multi-armed bandit 和 A/B test 的核心差異?
A/B test: 固定 allocation(50/50)到實驗結束 → 準確的 causal effect estimation。Bandit: 動態 allocation → 快速 converge 到 best arm → minimize regret。但 bandit 的 unequal allocation 讓 statistical inference(CI, p-value)不 valid。需要 causal inference → A/B test。需要 optimize → bandit。
Quiz
Thompson Sampling 對一個新 arm(只被 pull 過 2 次,1 success 1 failure)的 posterior 是什麼?