Bandits & Exploration

Interview Context

Multi-armed bandits 是 DS 面試中最 practical 的 RL 概念 — 直接 applicable to A/B test optimization、ad serving、recommendation、動態定價。不像 full RL 需要考慮 state transitions,bandits 是 stateless 的 → 更容易理解和實作。面試中常問「你怎麼加速 A/B test?」或「怎麼在 explore 和 exploit 之間取得平衡?」。

What You Should Understand

  • 能描述 multi-armed bandit problem 和它和 A/B testing 的關係
  • 理解三大策略:ϵ\epsilon-greedy、UCB、Thompson Sampling
  • 知道 contextual bandits 如何加入 user/item features
  • 理解 regret 的概念和不同演算法的 regret bounds
  • 能在推薦、廣告、A/B testing 場景中應用 bandits

The Multi-Armed Bandit Problem

Imagine a row of slot machines (one-armed bandits). Each machine has an unknown reward probability. You have TT total pulls. Goal: maximize total reward.

The fundamental tradeoff:

  • Exploit: Pull the machine that looks best so far → maximize short-term
  • Explore: Try other machines that might be better → gather information

Formal Definition

  • KK arms, each with unknown reward distribution PkP_k and mean μk\mu_k
  • At each round t=1,,Tt = 1, \ldots, T: choose arm AtA_t, receive reward rtPAtr_t \sim P_{A_t}
  • Goal: maximize t=1Trt\sum_{t=1}^{T} r_t, equivalently minimize regret

Bandit vs A/B Test vs Full RL

AspectA/B TestMulti-Armed BanditFull RL
ExplorationFixed ratio (50/50) for entire durationAdaptive — shift traffic to better armAdaptive
StateNoneNone(stateless)Has state transitions
When to decideAfter experiment endsDuring experiment — continuously updateContinuously
RegretHigh(send 50% traffic to worse arm throughout)Low(quickly converge to best arm)Depends on problem
Use caseDefinitive causal effect estimationOptimize while learningSequential decisions with states

Bandit ≠ A/B Test Replacement

Bandits optimize allocation(minimize regret),A/B tests estimate causal effects(statistical inference)。Bandits 在 experiment 結束後不能給你 clean effect size estimate + confidence interval — 因為 allocation 不 balanced。如果你需要 rigorous causal inference → A/B test。如果你需要 quickly allocate to the best option → bandit。

Epsilon-Greedy

The simplest strategy:

At={random armwith probability ϵargmaxkμ^kwith probability 1ϵA_t = \begin{cases} \text{random arm} & \text{with probability } \epsilon \\ \arg\max_k \hat{\mu}_k & \text{with probability } 1 - \epsilon \end{cases}

where μ^k\hat{\mu}_k is the estimated mean reward of arm kk.

ProsCons
Simple to implementExplores uniformly — even obviously bad arms get pulled
Easy to understandϵ\epsilon is fixed — doesn't adapt over time
Reasonable baselineLinear regret O(ϵT)O(\epsilon T) — never fully converges

Decaying ϵ\epsilon

ϵt=min(1,ct)\epsilon_t = \min\left(1, \frac{c}{t}\right)

前期 explore 多(ϵ\epsilon 高),後期 exploit 多(ϵ\epsilon → 0)。但 decay schedule 需要 tuning — universal optimal schedule 不存在。

import numpy as np

class EpsilonGreedy:
    def __init__(self, n_arms, epsilon=0.1):
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)
        self.epsilon = epsilon

    def select_arm(self):
        if np.random.random() < self.epsilon:
            return np.random.randint(len(self.values))
        return self.values.argmax()

    def update(self, arm, reward):
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n  # incremental mean

Upper Confidence Bound (UCB)

The Principle: Optimism in the Face of Uncertainty

UCB selects the arm with the highest upper confidence bound — combine estimated reward + exploration bonus:

At=argmaxk[μ^k+clntNk]A_t = \arg\max_k \left[\hat{\mu}_k + c\sqrt{\frac{\ln t}{N_k}}\right]
TermMeaning
μ^k\hat{\mu}_kEstimated mean reward(exploitation)
clntNkc\sqrt{\frac{\ln t}{N_k}}Exploration bonus — large when NkN_k is small(少被 pull 的 arm 有更大的 bonus)
ccControls exploration strength

直覺:「如果一個 arm 被 pull 的次數少,它的 true mean 可能比估計的更高 — 給它 benefit of the doubt。」

UCB Properties

  • No random exploration — deterministically picks the arm with highest UCB
  • Arms that haven't been pulled much get inflated scores → automatically explored
  • As NkN_k grows → bonus shrinks → converges to best arm
  • Logarithmic regret O(KTlnT)O(\sqrt{KT \ln T}) — much better than ϵ\epsilon-greedy's linear regret
class UCB:
    def __init__(self, n_arms, c=2.0):
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)
        self.c = c
        self.t = 0

    def select_arm(self):
        self.t += 1
        # Pull each arm once first
        for arm in range(len(self.values)):
            if self.counts[arm] == 0:
                return arm
        ucb_values = self.values + self.c * np.sqrt(np.log(self.t) / self.counts)
        return ucb_values.argmax()

    def update(self, arm, reward):
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n

Thompson Sampling

The Bayesian Approach

Maintain a posterior distribution for each arm's reward probability. At each round:

  1. Sample from each arm's posterior: μ~kposteriork\tilde{\mu}_k \sim \text{posterior}_k
  2. Select arm with highest sampled value: At=argmaxkμ~kA_t = \arg\max_k \tilde{\mu}_k
  3. Observe reward, update posterior

Beta-Bernoulli Bandit

For binary rewards (click/no-click), use Beta posterior:

  • Prior: Beta(αk=1,βk=1)\text{Beta}(\alpha_k = 1, \beta_k = 1)(uniform — no prior belief)
  • After observing success: αkαk+1\alpha_k \leftarrow \alpha_k + 1
  • After observing failure: βkβk+1\beta_k \leftarrow \beta_k + 1
  • Posterior mean: μ^k=αkαk+βk\hat{\mu}_k = \frac{\alpha_k}{\alpha_k + \beta_k}
μ~kBeta(αk,βk)\tilde{\mu}_k \sim \text{Beta}(\alpha_k, \beta_k)

Why Thompson Sampling Works

PropertyExplanation
Natural explorationArms with high uncertainty(wide posterior)→ sometimes sample high → get explored
Automatic convergenceAs data grows → posterior narrows → samples concentrate near true mean → exploit
No hyperparameters不需要 tune ϵ\epsiloncc — posterior 自動 calibrate exploration
Optimal regretO(KTlnT)O(\sqrt{KT \ln T}) — matches UCB's lower bound

直覺:Thompson Sampling 的 exploration 是 probability matching — 選一個 arm 的機率 ≈ 它是 best arm 的 posterior probability。Uncertainty 高 → 有機會被 sample 到高值 → 被 explore。Uncertainty 低 → 穩定在 true mean 附近 → exploit if best。

class ThompsonSampling:
    def __init__(self, n_arms):
        self.alpha = np.ones(n_arms)  # successes + 1
        self.beta = np.ones(n_arms)   # failures + 1

    def select_arm(self):
        samples = [np.random.beta(a, b) for a, b in zip(self.alpha, self.beta)]
        return np.argmax(samples)

    def update(self, arm, reward):
        if reward:
            self.alpha[arm] += 1
        else:
            self.beta[arm] += 1

面試中的 Thompson Sampling

Thompson Sampling 是面試中 bandits 最常問的策略 — 因為它連結了 Bayesian inference(Beta posterior)和 exploration/exploitation。能說出 Beta-Bernoulli 的 update rule + 為什麼 uncertainty 自動驅動 exploration 是很強的信號。

Algorithm Comparison

AlgorithmExplorationRegretHyperparamsComplexityBest For
ϵ\epsilon-greedyUniform randomO(ϵT)O(\epsilon T) linearϵ\epsilonO(1)O(1)Simple baseline
UCBOptimistic (bonus)O(KTlnT)O(\sqrt{KT \ln T})ccO(1)O(1)Deterministic, theoretical guarantees
ThompsonBayesian (sample)O(KTlnT)O(\sqrt{KT \ln T})NoneO(K)O(K) per sampleIndustry standard, no tuning
EXP3Adversarial settingO(KTlnK)O(\sqrt{KT \ln K})O(K)O(K)Non-stochastic rewards

Why Thompson Sampling Wins in Practice

Empirically, Thompson Sampling 通常 outperforms UCB — 因為它的 exploration 更 targeted(probability matching vs optimistic upper bound)。而且不需要 tune hyperparameters。Google、Microsoft、Netflix 等公司都在 production 中使用 Thompson Sampling。

Regret

Definition

Regret = the total reward you missed by not always choosing the best arm:

RegretT=Tμt=1TμAt=k=1KΔkE[Nk(T)]\text{Regret}_T = T \cdot \mu^* - \sum_{t=1}^{T} \mu_{A_t} = \sum_{k=1}^{K} \Delta_k \cdot E[N_k(T)]

where μ=maxkμk\mu^* = \max_k \mu_k(best arm's mean)and Δk=μμk\Delta_k = \mu^* - \mu_k(gap)。

Regret Intuition

StrategyRegret GrowthIntuition
Always randomO(T)O(T) linearNever learns — constant per-step regret
ϵ\epsilon-greedy (fixed)O(ϵT)O(\epsilon T) linearAlways explores ϵ\epsilon fraction randomly
UCB / ThompsonO(TlnT)O(\sqrt{T \ln T}) sublinearExploration decreases over time → per-step regret → 0
OptimalΩ(KT)\Omega(\sqrt{KT})Theoretical lower bound — no algorithm can do better

Sublinear regret 代表 learning — per-round regret 隨時間遞減。Linear regret 代表沒學到東西。

Contextual Bandits

The Extension

Standard bandits: 每個 arm 的 reward 是固定分布。但在推薦和廣告中,reward 取決於 context(user features, item features)。

Standard BanditContextual Bandit
Reward Pk\sim P_k (固定)Reward Pk(x)\sim P_k(\mathbf{x}) (depends on context x\mathbf{x})
Same treatment for all usersPersonalized — different actions for different users
Few arms (A/B/C)Potentially many arms with shared features

LinUCB

Model the expected reward as a linear function of context:

r^a=xθa\hat{r}_{a} = \mathbf{x}^\top \boldsymbol{\theta}_a

UCB version:

At=argmaxa[xtθ^a+cxtAa1xt]A_t = \arg\max_a \left[\mathbf{x}_t^\top \hat{\boldsymbol{\theta}}_a + c\sqrt{\mathbf{x}_t^\top \mathbf{A}_a^{-1} \mathbf{x}_t}\right]

Aa1\mathbf{A}_a^{-1} captures the uncertainty in θ^a\hat{\boldsymbol{\theta}}_a — contexts where we have less data get larger exploration bonus。

Contextual Thompson Sampling

同樣可以把 Thompson Sampling 推廣到 contextual setting — 對 θa\boldsymbol{\theta}_a 維持 posterior distribution(例如 multivariate Gaussian),每次 sample θ~a\tilde{\boldsymbol{\theta}}_a,選 xθ~a\mathbf{x}^\top \tilde{\boldsymbol{\theta}}_a 最大的 arm。

class ContextualThompsonSampling:
    def __init__(self, n_arms, context_dim, lambda_=1.0):
        self.n_arms = n_arms
        self.d = context_dim
        # Per-arm: B = X^T X + λI, f = X^T y
        self.B = [lambda_ * np.eye(context_dim) for _ in range(n_arms)]
        self.f = [np.zeros(context_dim) for _ in range(n_arms)]

    def select_arm(self, context):
        samples = []
        for a in range(self.n_arms):
            theta_hat = np.linalg.solve(self.B[a], self.f[a])
            cov = np.linalg.inv(self.B[a])
            theta_sample = np.random.multivariate_normal(theta_hat, cov)
            samples.append(context @ theta_sample)
        return np.argmax(samples)

    def update(self, arm, context, reward):
        self.B[arm] += np.outer(context, context)
        self.f[arm] += reward * context

Bandits vs A/B Testing

When to Use Which

NeedUseWhy
Causal effect estimation with CIA/B testEqual allocation → unbiased estimate, valid statistical inference
Quick optimization — find and use the best optionBanditAdaptively shift traffic to winner → less regret
Multiple options (> 2-3 treatments)BanditA/B test with 10 arms → 10% traffic per arm → underpowered
Personalization (different best option per user)Contextual banditOne-size-fits-all A/B test misses personalization opportunity
Continuous deployment (no clear end)BanditA/B test has fixed duration; bandits operate continuously

Hybrid: A/B Test + Bandit

PhaseApproachGoal
Phase 1Pure A/B test (equal split)Estimate causal effects, validate metrics
Phase 2Switch to banditOptimize allocation, minimize opportunity cost

或用 Bayesian A/B test — Thompson Sampling 的 posterior probability 自然提供 P(B>Adata)P(\text{B} > \text{A} \mid \text{data}) → 同時做 inference 和 optimization。

Real-World Use Cases

Case 1: 廣告展示 — Ad Creative Selection

你有 5 個不同的 ad creatives 要展示給 users。哪個 CTR 最高?

ApproachHowTradeoff
A/B test每個 creative 分配 20% 流量,跑 2 週Accurate but slow, 80% traffic goes to non-optimal
ϵ\epsilon-greedy90% 流量到 best-so-far,10% random exploreSimple but wastes exploration on obviously bad creatives
Thompson SamplingSample from Beta posteriors, pick highest sampleBest: natural exploration, fast convergence, no tuning
# Ad creative Thompson Sampling
class AdBandit:
    def __init__(self, creative_ids):
        self.creatives = creative_ids
        self.alpha = {c: 1 for c in creative_ids}  # clicks + 1
        self.beta = {c: 1 for c in creative_ids}   # no-clicks + 1

    def select_creative(self):
        samples = {c: np.random.beta(self.alpha[c], self.beta[c]) for c in self.creatives}
        return max(samples, key=samples.get)

    def record(self, creative, clicked):
        if clicked:
            self.alpha[creative] += 1
        else:
            self.beta[creative] += 1

面試 follow-up:「有 10K 個 ads 怎麼辦?每個 ad 不夠 data?」→ Contextual bandit — share information across ads with similar features。

Case 2: 推薦系統 — Explore-Exploit for Cold Start

新 item 沒有 interaction data → 推薦系統不會推薦它 → 永遠不會有 data → cold start loop

Bandit 自然解決:新 item 的 posterior uncertainty 很大 → Thompson Sampling 有機會 sample 到高值 → 被推薦 → 收集 data → posterior 收斂。

Item TypeBandit Behavior
Popular item(已知好)Posterior narrow, high mean → almost always selected
Known bad itemPosterior narrow, low mean → rarely selected
New item(uncertain)Posterior wide → sometimes samples high → gets explored

Case 3: 動態定價 — Price Optimization

你要為一個產品找最佳價格。每個價格是一個 arm:

MDP ComponentIn Pricing
ArmsDiscrete price points: 9.99,9.99, 12.99, 14.99,14.99, 19.99
RewardRevenue = price × conversion_rate(price)
ChallengeConversion rate 和 price 有 monotone relationship → 不只是 independent arms

用 contextual bandit:price 是 action,context = user features + time + demand signals → personalized pricing。

定價實驗的倫理考量

同一時間對不同使用者展示不同價格有 ethical concerns(price discrimination)。面試中主動提到這一點是加分。解法:(1)只對新使用者實驗,(2)在 acceptable range 內變動,(3)transparent pricing policy。

Case 4: Feature / UI Testing

你想測試 10 個不同的 homepage layouts。A/B test 需要把流量分成 10 份 → 每份只有 10% → underpowered。

Bandit approach:Thompson Sampling quickly identifies top 2-3 layouts → allocates most traffic there → continues exploring others with less traffic → finds best layout faster with less wasted traffic。

Hands-on: Bandits in Python

Complete Simulation

import numpy as np

def simulate_bandit(bandit, true_probs, n_rounds=10000):
    """Simulate a bandit algorithm and return cumulative regret."""
    best_prob = max(true_probs)
    regret = 0
    regrets = []

    for t in range(n_rounds):
        arm = bandit.select_arm()
        reward = np.random.binomial(1, true_probs[arm])
        bandit.update(arm, reward)
        regret += best_prob - true_probs[arm]
        regrets.append(regret)

    return regrets

# True click probabilities (unknown to the algorithm)
true_probs = [0.05, 0.08, 0.12, 0.10, 0.03]

# Compare algorithms
eg = EpsilonGreedy(5, epsilon=0.1)
ucb = UCB(5, c=2.0)
ts = ThompsonSampling(5)

regrets_eg = simulate_bandit(eg, true_probs)
regrets_ucb = simulate_bandit(ucb, true_probs)
regrets_ts = simulate_bandit(ts, true_probs)

# Thompson Sampling typically has lowest regret

Thompson Sampling with Gaussian Rewards

For continuous rewards(not binary), use Normal-Inverse-Gamma posterior:

class GaussianThompsonSampling:
    def __init__(self, n_arms):
        self.n = np.zeros(n_arms)
        self.mean = np.zeros(n_arms)
        self.var = np.ones(n_arms) * 100  # high initial variance

    def select_arm(self):
        samples = [np.random.normal(m, np.sqrt(v / max(n, 1)))
                   for m, v, n in zip(self.mean, self.var, self.n)]
        return np.argmax(samples)

    def update(self, arm, reward):
        self.n[arm] += 1
        old_mean = self.mean[arm]
        self.mean[arm] += (reward - old_mean) / self.n[arm]
        self.var[arm] += (reward - old_mean) * (reward - self.mean[arm])

Interview Signals

What interviewers listen for:

  • 你能區分 bandits 和 A/B test 的適用場景
  • 你知道 Thompson Sampling 的 Beta-Bernoulli update rule
  • 你能解釋 UCB 的 optimism principle 和 exploration bonus
  • 你理解 regret 的概念(linear vs sublinear = learned or not)
  • 你能在推薦系統和廣告場景中 formulate bandit problems

Practice

Flashcards

Flashcards (1/10)

Multi-armed bandit 和 A/B test 的核心差異?

A/B test: 固定 allocation(50/50)到實驗結束 → 準確的 causal effect estimation。Bandit: 動態 allocation → 快速 converge 到 best arm → minimize regret。但 bandit 的 unequal allocation 讓 statistical inference(CI, p-value)不 valid。需要 causal inference → A/B test。需要 optimize → bandit。

Click card to flip

Quiz

Question 1/10

Thompson Sampling 對一個新 arm(只被 pull 過 2 次,1 success 1 failure)的 posterior 是什麼?

Mark as Complete

3/5 — Okay