Bandits & Exploration

Interview Context

Multi-armed bandits 是 DS 面試中最 practical 的 RL 概念 — 直接 applicable to A/B test optimization、ad serving、recommendation、動態定價。不像 full RL 需要考慮 state transitions，bandits 是 stateless 的 → 更容易理解和實作。面試中常問「你怎麼加速 A/B test？」或「怎麼在 explore 和 exploit 之間取得平衡？」。

What You Should Understand

能描述 multi-armed bandit problem 和它和 A/B testing 的關係
理解三大策略： $\epsilon$ -greedy、UCB、Thompson Sampling
知道 contextual bandits 如何加入 user/item features
理解 regret 的概念和不同演算法的 regret bounds
能在推薦、廣告、A/B testing 場景中應用 bandits

The Multi-Armed Bandit Problem

Imagine a row of slot machines (one-armed bandits). Each machine has an unknown reward probability. You have $T$ total pulls. Goal: maximize total reward.

The fundamental tradeoff:

Exploit: Pull the machine that looks best so far → maximize short-term
Explore: Try other machines that might be better → gather information

Formal Definition

$K$ arms, each with unknown reward distribution $P_k$ and mean $\mu_k$
At each round $t = 1, \ldots, T$ : choose arm $A_t$ , receive reward $r_t \sim P_{A_t}$
Goal: maximize $\sum_{t=1}^{T} r_t$ , equivalently minimize regret

Bandit vs A/B Test vs Full RL

Aspect	A/B Test	Multi-Armed Bandit	Full RL
Exploration	Fixed ratio (50/50) for entire duration	Adaptive — shift traffic to better arm	Adaptive
State	None	None（stateless）	Has state transitions
When to decide	After experiment ends	During experiment — continuously update	Continuously
Regret	High（send 50% traffic to worse arm throughout）	Low（quickly converge to best arm）	Depends on problem
Use case	Definitive causal effect estimation	Optimize while learning	Sequential decisions with states

Bandit ≠ A/B Test Replacement

Bandits optimize allocation（minimize regret），A/B tests estimate causal effects（statistical inference）。Bandits 在 experiment 結束後不能給你 clean effect size estimate + confidence interval — 因為 allocation 不 balanced。如果你需要 rigorous causal inference → A/B test。如果你需要 quickly allocate to the best option → bandit。

Epsilon-Greedy

The simplest strategy:

A_t = \begin{cases} \text{random arm} & \text{with probability } \epsilon \\ \arg\max_k \hat{\mu}_k & \text{with probability } 1 - \epsilon \end{cases}

where $\hat{\mu}_k$ is the estimated mean reward of arm $k$ .

Pros	Cons
Simple to implement	Explores uniformly — even obviously bad arms get pulled
Easy to understand	$\epsilon$ is fixed — doesn't adapt over time
Reasonable baseline	Linear regret $O(\epsilon T)$ — never fully converges

Decaying $\epsilon$

\epsilon_t = \min\left(1, \frac{c}{t}\right)

前期 explore 多（ $\epsilon$ 高），後期 exploit 多（ $\epsilon$ → 0）。但 decay schedule 需要 tuning — universal optimal schedule 不存在。

import numpy as np

class EpsilonGreedy:
    def __init__(self, n_arms, epsilon=0.1):
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)
        self.epsilon = epsilon

    def select_arm(self):
        if np.random.random() < self.epsilon:
            return np.random.randint(len(self.values))
        return self.values.argmax()

    def update(self, arm, reward):
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n  # incremental mean

Upper Confidence Bound (UCB)

The Principle: Optimism in the Face of Uncertainty

UCB selects the arm with the highest upper confidence bound — combine estimated reward + exploration bonus:

A_t = \arg\max_k \left[\hat{\mu}_k + c\sqrt{\frac{\ln t}{N_k}}\right]

Term	Meaning
$\hat{\mu}_k$	Estimated mean reward（exploitation）
$c\sqrt{\frac{\ln t}{N_k}}$	Exploration bonus — large when $N_k$ is small（少被 pull 的 arm 有更大的 bonus）
$c$	Controls exploration strength

直覺：「如果一個 arm 被 pull 的次數少，它的 true mean 可能比估計的更高 — 給它 benefit of the doubt。」

UCB Properties

No random exploration — deterministically picks the arm with highest UCB
Arms that haven't been pulled much get inflated scores → automatically explored
As $N_k$ grows → bonus shrinks → converges to best arm
Logarithmic regret $O(\sqrt{KT \ln T})$ — much better than $\epsilon$ -greedy's linear regret

class UCB:
    def __init__(self, n_arms, c=2.0):
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)
        self.c = c
        self.t = 0

    def select_arm(self):
        self.t += 1
        # Pull each arm once first
        for arm in range(len(self.values)):
            if self.counts[arm] == 0:
                return arm
        ucb_values = self.values + self.c * np.sqrt(np.log(self.t) / self.counts)
        return ucb_values.argmax()

    def update(self, arm, reward):
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n

Thompson Sampling

The Bayesian Approach

Maintain a posterior distribution for each arm's reward probability. At each round:

Sample from each arm's posterior: $\tilde{\mu}_k \sim \text{posterior}_k$
Select arm with highest sampled value: $A_t = \arg\max_k \tilde{\mu}_k$
Observe reward, update posterior

Beta-Bernoulli Bandit

For binary rewards (click/no-click), use Beta posterior:

Prior: $\text{Beta}(\alpha_k = 1, \beta_k = 1)$ （uniform — no prior belief）
After observing success: $\alpha_k \leftarrow \alpha_k + 1$
After observing failure: $\beta_k \leftarrow \beta_k + 1$
Posterior mean: $\hat{\mu}_k = \frac{\alpha_k}{\alpha_k + \beta_k}$

\tilde{\mu}_k \sim \text{Beta}(\alpha_k, \beta_k)

Why Thompson Sampling Works

Property	Explanation
Natural exploration	Arms with high uncertainty（wide posterior）→ sometimes sample high → get explored
Automatic convergence	As data grows → posterior narrows → samples concentrate near true mean → exploit
No hyperparameters	不需要 tune $\epsilon$ 或 $c$ — posterior 自動 calibrate exploration
Optimal regret	$O(\sqrt{KT \ln T})$ — matches UCB's lower bound

直覺：Thompson Sampling 的 exploration 是 probability matching — 選一個 arm 的機率 ≈ 它是 best arm 的 posterior probability。Uncertainty 高 → 有機會被 sample 到高值 → 被 explore。Uncertainty 低 → 穩定在 true mean 附近 → exploit if best。

class ThompsonSampling:
    def __init__(self, n_arms):
        self.alpha = np.ones(n_arms)  # successes + 1
        self.beta = np.ones(n_arms)   # failures + 1

    def select_arm(self):
        samples = [np.random.beta(a, b) for a, b in zip(self.alpha, self.beta)]
        return np.argmax(samples)

    def update(self, arm, reward):
        if reward:
            self.alpha[arm] += 1
        else:
            self.beta[arm] += 1

面試中的 Thompson Sampling

Thompson Sampling 是面試中 bandits 最常問的策略 — 因為它連結了 Bayesian inference（Beta posterior）和 exploration/exploitation。能說出 Beta-Bernoulli 的 update rule + 為什麼 uncertainty 自動驅動 exploration 是很強的信號。

Algorithm Comparison

Algorithm	Exploration	Regret	Hyperparams	Complexity	Best For
$\epsilon$ -greedy	Uniform random	$O(\epsilon T)$ linear	$\epsilon$	$O(1)$	Simple baseline
UCB	Optimistic (bonus)	$O(\sqrt{KT \ln T})$	$c$	$O(1)$	Deterministic, theoretical guarantees
Thompson	Bayesian (sample)	$O(\sqrt{KT \ln T})$	None	$O(K)$ per sample	Industry standard, no tuning
EXP3	Adversarial setting	$O(\sqrt{KT \ln K})$	—	$O(K)$	Non-stochastic rewards

Why Thompson Sampling Wins in Practice

Empirically, Thompson Sampling 通常 outperforms UCB — 因為它的 exploration 更 targeted（probability matching vs optimistic upper bound）。而且不需要 tune hyperparameters。Google、Microsoft、Netflix 等公司都在 production 中使用 Thompson Sampling。

Regret

Definition

Regret = the total reward you missed by not always choosing the best arm:

\text{Regret}_T = T \cdot \mu^* - \sum_{t=1}^{T} \mu_{A_t} = \sum_{k=1}^{K} \Delta_k \cdot E[N_k(T)]

where $\mu^* = \max_k \mu_k$ （best arm's mean）and $\Delta_k = \mu^* - \mu_k$ （gap）。

Regret Intuition

Strategy	Regret Growth	Intuition
Always random	$O(T)$ linear	Never learns — constant per-step regret
$\epsilon$ -greedy (fixed)	$O(\epsilon T)$ linear	Always explores $\epsilon$ fraction randomly
UCB / Thompson	$O(\sqrt{T \ln T})$ sublinear	Exploration decreases over time → per-step regret → 0
Optimal	$\Omega(\sqrt{KT})$	Theoretical lower bound — no algorithm can do better

Sublinear regret 代表 learning — per-round regret 隨時間遞減。Linear regret 代表沒學到東西。

Contextual Bandits

The Extension

Standard bandits: 每個 arm 的 reward 是固定分布。但在推薦和廣告中，reward 取決於 context（user features, item features）。

Standard Bandit	Contextual Bandit
Reward $\sim P_k$ (固定)	Reward $\sim P_k(\mathbf{x})$ (depends on context $\mathbf{x}$ )
Same treatment for all users	Personalized — different actions for different users
Few arms (A/B/C)	Potentially many arms with shared features

LinUCB

Model the expected reward as a linear function of context:

\hat{r}_{a} = \mathbf{x}^\top \boldsymbol{\theta}_a

UCB version:

A_t = \arg\max_a \left[\mathbf{x}_t^\top \hat{\boldsymbol{\theta}}_a + c\sqrt{\mathbf{x}_t^\top \mathbf{A}_a^{-1} \mathbf{x}_t}\right]

$\mathbf{A}_a^{-1}$ captures the uncertainty in $\hat{\boldsymbol{\theta}}_a$ — contexts where we have less data get larger exploration bonus。

Contextual Thompson Sampling

同樣可以把 Thompson Sampling 推廣到 contextual setting — 對 $\boldsymbol{\theta}_a$ 維持 posterior distribution（例如 multivariate Gaussian），每次 sample $\tilde{\boldsymbol{\theta}}_a$ ，選 $\mathbf{x}^\top \tilde{\boldsymbol{\theta}}_a$ 最大的 arm。

class ContextualThompsonSampling:
    def __init__(self, n_arms, context_dim, lambda_=1.0):
        self.n_arms = n_arms
        self.d = context_dim
        # Per-arm: B = X^T X + λI, f = X^T y
        self.B = [lambda_ * np.eye(context_dim) for _ in range(n_arms)]
        self.f = [np.zeros(context_dim) for _ in range(n_arms)]

    def select_arm(self, context):
        samples = []
        for a in range(self.n_arms):
            theta_hat = np.linalg.solve(self.B[a], self.f[a])
            cov = np.linalg.inv(self.B[a])
            theta_sample = np.random.multivariate_normal(theta_hat, cov)
            samples.append(context @ theta_sample)
        return np.argmax(samples)

    def update(self, arm, context, reward):
        self.B[arm] += np.outer(context, context)
        self.f[arm] += reward * context

Bandits vs A/B Testing

When to Use Which

Need	Use	Why
Causal effect estimation with CI	A/B test	Equal allocation → unbiased estimate, valid statistical inference
Quick optimization — find and use the best option	Bandit	Adaptively shift traffic to winner → less regret
Multiple options (> 2-3 treatments)	Bandit	A/B test with 10 arms → 10% traffic per arm → underpowered
Personalization (different best option per user)	Contextual bandit	One-size-fits-all A/B test misses personalization opportunity
Continuous deployment (no clear end)	Bandit	A/B test has fixed duration; bandits operate continuously

Hybrid: A/B Test + Bandit

Phase	Approach	Goal
Phase 1	Pure A/B test (equal split)	Estimate causal effects, validate metrics
Phase 2	Switch to bandit	Optimize allocation, minimize opportunity cost

或用 Bayesian A/B test — Thompson Sampling 的 posterior probability 自然提供 $P(\text{B} > \text{A} \mid \text{data})$ → 同時做 inference 和 optimization。

Real-World Use Cases

Case 1: 廣告展示 — Ad Creative Selection

你有 5 個不同的 ad creatives 要展示給 users。哪個 CTR 最高？

Approach	How	Tradeoff
A/B test	每個 creative 分配 20% 流量，跑 2 週	Accurate but slow, 80% traffic goes to non-optimal
$\epsilon$ -greedy	90% 流量到 best-so-far，10% random explore	Simple but wastes exploration on obviously bad creatives
Thompson Sampling	Sample from Beta posteriors, pick highest sample	Best: natural exploration, fast convergence, no tuning

# Ad creative Thompson Sampling
class AdBandit:
    def __init__(self, creative_ids):
        self.creatives = creative_ids
        self.alpha = {c: 1 for c in creative_ids}  # clicks + 1
        self.beta = {c: 1 for c in creative_ids}   # no-clicks + 1

    def select_creative(self):
        samples = {c: np.random.beta(self.alpha[c], self.beta[c]) for c in self.creatives}
        return max(samples, key=samples.get)

    def record(self, creative, clicked):
        if clicked:
            self.alpha[creative] += 1
        else:
            self.beta[creative] += 1

面試 follow-up：「有 10K 個 ads 怎麼辦？每個 ad 不夠 data？」→ Contextual bandit — share information across ads with similar features。

Case 2: 推薦系統 — Explore-Exploit for Cold Start

新 item 沒有 interaction data → 推薦系統不會推薦它 → 永遠不會有 data → cold start loop。

Bandit 自然解決：新 item 的 posterior uncertainty 很大 → Thompson Sampling 有機會 sample 到高值 → 被推薦 → 收集 data → posterior 收斂。

Item Type	Bandit Behavior
Popular item（已知好）	Posterior narrow, high mean → almost always selected
Known bad item	Posterior narrow, low mean → rarely selected
New item（uncertain）	Posterior wide → sometimes samples high → gets explored

Case 3: 動態定價 — Price Optimization

你要為一個產品找最佳價格。每個價格是一個 arm：

MDP Component	In Pricing
Arms	Discrete price points: $9.99,$ 12.99, $14.99,$ 19.99
Reward	Revenue = price × conversion_rate(price)
Challenge	Conversion rate 和 price 有 monotone relationship → 不只是 independent arms

用 contextual bandit：price 是 action，context = user features + time + demand signals → personalized pricing。

定價實驗的倫理考量

同一時間對不同使用者展示不同價格有 ethical concerns（price discrimination）。面試中主動提到這一點是加分。解法：（1）只對新使用者實驗，（2）在 acceptable range 內變動，（3）transparent pricing policy。

Case 4: Feature / UI Testing

你想測試 10 個不同的 homepage layouts。A/B test 需要把流量分成 10 份 → 每份只有 10% → underpowered。

Bandit approach：Thompson Sampling quickly identifies top 2-3 layouts → allocates most traffic there → continues exploring others with less traffic → finds best layout faster with less wasted traffic。

Hands-on: Bandits in Python

Complete Simulation

import numpy as np

def simulate_bandit(bandit, true_probs, n_rounds=10000):
    """Simulate a bandit algorithm and return cumulative regret."""
    best_prob = max(true_probs)
    regret = 0
    regrets = []

    for t in range(n_rounds):
        arm = bandit.select_arm()
        reward = np.random.binomial(1, true_probs[arm])
        bandit.update(arm, reward)
        regret += best_prob - true_probs[arm]
        regrets.append(regret)

    return regrets

# True click probabilities (unknown to the algorithm)
true_probs = [0.05, 0.08, 0.12, 0.10, 0.03]

# Compare algorithms
eg = EpsilonGreedy(5, epsilon=0.1)
ucb = UCB(5, c=2.0)
ts = ThompsonSampling(5)

regrets_eg = simulate_bandit(eg, true_probs)
regrets_ucb = simulate_bandit(ucb, true_probs)
regrets_ts = simulate_bandit(ts, true_probs)

# Thompson Sampling typically has lowest regret

Thompson Sampling with Gaussian Rewards

For continuous rewards（not binary）, use Normal-Inverse-Gamma posterior:

class GaussianThompsonSampling:
    def __init__(self, n_arms):
        self.n = np.zeros(n_arms)
        self.mean = np.zeros(n_arms)
        self.var = np.ones(n_arms) * 100  # high initial variance

    def select_arm(self):
        samples = [np.random.normal(m, np.sqrt(v / max(n, 1)))
                   for m, v, n in zip(self.mean, self.var, self.n)]
        return np.argmax(samples)

    def update(self, arm, reward):
        self.n[arm] += 1
        old_mean = self.mean[arm]
        self.mean[arm] += (reward - old_mean) / self.n[arm]
        self.var[arm] += (reward - old_mean) * (reward - self.mean[arm])

Interview Signals

What interviewers listen for:

你能區分 bandits 和 A/B test 的適用場景
你知道 Thompson Sampling 的 Beta-Bernoulli update rule
你能解釋 UCB 的 optimism principle 和 exploration bonus
你理解 regret 的概念（linear vs sublinear = learned or not）
你能在推薦系統和廣告場景中 formulate bandit problems

Practice

Flashcards

Flashcards (1/10)

Multi-armed bandit 和 A/B test 的核心差異？

A/B test: 固定 allocation（50/50）到實驗結束 → 準確的 causal effect estimation。Bandit: 動態 allocation → 快速 converge 到 best arm → minimize regret。但 bandit 的 unequal allocation 讓 statistical inference（CI, p-value）不 valid。需要 causal inference → A/B test。需要 optimize → bandit。

Click card to flip

Quiz

Question 1/10

Thompson Sampling 對一個新 arm（只被 pull 過 2 次，1 success 1 failure）的 posterior 是什麼？

Mark as Complete

How confident are you with this topic?

3/5 — Okay