RL Fundamentals

Interview Context

RL 在 DS 面試中通常不考深入的 algorithm implementation,而是考概念理解 — MDP formulation、exploration vs exploitation、reward design。對 ML Engineer 面試則可能考 DQN 或 policy gradient 的細節。對 DS 來說,最重要的是理解 RL 和 supervised learning 的差異,以及 RL 在推薦系統、廣告出價等場景的應用。

What You Should Understand

  • 能用 MDP framework 描述一個 RL 問題
  • 理解 Bellman equation 和 value function 的意義
  • 知道 value-based vs policy-based methods 的差異
  • 能解釋 exploration vs exploitation tradeoff
  • 了解 RL 和 supervised learning 的根本差異

RL vs Supervised Learning

AspectSupervised LearningReinforcement Learning
FeedbackCorrect label for each inputReward signal(delayed, sparse)
DataFixed dataset, i.i.d.Agent generates its own data by interacting
GoalMinimize prediction errorMaximize cumulative reward
Sequential decisionsNoYes — current action affects future states
ExplorationNot neededMust explore to discover good actions
Credit assignmentImmediateDelayed — which past action caused this reward?

為什麼 DS 需要了解 RL?

RL 的概念直接 applicable to:(1)推薦系統(sequence of recommendations = sequential decisions),(2)廣告出價(bid → outcome → learn),(3)A/B test optimization(Thompson Sampling = bandit = simple RL),(4)動態定價。不需要 train AlphaGo,但需要理解 framework。

Markov Decision Process (MDP)

Components

An MDP is defined by a tuple (S,A,P,R,γ)(S, A, P, R, \gamma):

ComponentSymbolMeaning
StatesSSAll possible situations the agent can be in
ActionsAAAll possible actions the agent can take
TransitionP(ss,a)P(s' \mid s, a)Probability of reaching state ss' from ss after action aa
RewardR(s,a,s)R(s, a, s')Immediate reward for taking action aa in state ss
Discount factorγ[0,1]\gamma \in [0, 1]How much to value future rewards vs immediate

The Markov Property

P(st+1st,at,st1,at1,)=P(st+1st,at)P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} \mid s_t, a_t)

The future depends only on the current state, not on how we got here. 這大幅簡化了問題 — 不需要記住完整歷史。

面試中的常見 challenge:「什麼時候 Markov property 不成立?」— 當 state representation 不完整時(例如在推薦系統中只看 current page 而沒有 browsing history → 需要用 history 作為 state 的一部分)。

Discount Factor γ\gamma

Gt=rt+γrt+1+γ2rt+2+=k=0γkrt+kG_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots = \sum_{k=0}^{\infty} \gamma^k r_{t+k}
γ\gammaBehavior
γ=0\gamma = 0Completely myopic — only care about immediate reward
γ=0.9\gamma = 0.9Balanced — future rewards matter but discounted
γ=0.99\gamma = 0.99Very far-sighted — almost equal weight to far future
γ=1\gamma = 1No discounting — can cause divergence in infinite-horizon MDPs

γ<1\gamma < 1 確保 infinite sum converges,也反映了 uncertainty(未來越遠越不確定)和 time preference(現在的 reward 比未來的更有價值)。

Policy

A policy π\pi maps states to actions:

  • Deterministic: a=π(s)a = \pi(s)
  • Stochastic: aπ(as)a \sim \pi(a \mid s)

RL 的 goal: find the optimal policy π\pi^* that maximizes expected cumulative reward。

Value Functions

State-Value Function Vπ(s)V^\pi(s)

Expected cumulative reward starting from state ss, following policy π\pi:

Vπ(s)=Eπ[k=0γkrt+kst=s]V^\pi(s) = E_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \mid s_t = s\right]

「在 state ss,如果我一直按 policy π\pi 做決策,期望能拿到多少 total reward?」

Action-Value Function Qπ(s,a)Q^\pi(s, a)

Expected cumulative reward starting from state ss, taking action aa, then following π\pi:

Qπ(s,a)=Eπ[k=0γkrt+kst=s,at=a]Q^\pi(s, a) = E_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \mid s_t = s, a_t = a\right]

「在 state ss 做 action aa,然後一直按 π\pi 做,期望能拿到多少 total reward?」

Relationship

Vπ(s)=aπ(as)Qπ(s,a)V^\pi(s) = \sum_a \pi(a \mid s) \cdot Q^\pi(s, a)

VV = averaged over all actions according to policy。QQ = specific to a particular action。

Bellman Equations

Bellman Expectation Equation

VV and QQ satisfy recursive relationships:

Vπ(s)=aπ(as)sP(ss,a)[R(s,a,s)+γVπ(s)]V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)\left[R(s,a,s') + \gamma V^\pi(s')\right] Qπ(s,a)=sP(ss,a)[R(s,a,s)+γaπ(as)Qπ(s,a)]Q^\pi(s, a) = \sum_{s'} P(s' \mid s, a)\left[R(s,a,s') + \gamma \sum_{a'} \pi(a' \mid s') Q^\pi(s', a')\right]

直覺:current value = immediate reward + discounted future value。把 long-term planning 分解成 one-step lookahead

Bellman Optimality Equation

For the optimal policy π\pi^*:

V(s)=maxasP(ss,a)[R(s,a,s)+γV(s)]V^*(s) = \max_a \sum_{s'} P(s' \mid s, a)\left[R(s,a,s') + \gamma V^*(s')\right] Q(s,a)=sP(ss,a)[R(s,a,s)+γmaxaQ(s,a)]Q^*(s, a) = \sum_{s'} P(s' \mid s, a)\left[R(s,a,s') + \gamma \max_{a'} Q^*(s', a')\right]

VV^* 取 max(選最好的 action),不是 average over policy。如果知道 QQ^*,optimal policy 就是 π(s)=argmaxaQ(s,a)\pi^*(s) = \arg\max_a Q^*(s, a)

面試 Key Insight

Bellman equation 把 RL 問題從「optimize over entire trajectory」轉化為「optimize one step at a time」。這是 dynamic programming 的核心思想 — 把大問題拆成 overlapping subproblems。

Value-Based Methods

Q-Learning (Off-Policy)

Learn QQ^* directly without knowing the transition model:

Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha \left[r + \gamma \max_{a'} Q(s', a') - Q(s, a)\right]
TermMeaning
r+γmaxaQ(s,a)r + \gamma \max_{a'} Q(s', a')TD target — estimated true value
Q(s,a)Q(s, a)Current estimate
DifferenceTD error — how wrong was our estimate
α\alphaLearning rate

Off-policy: Updates use maxa\max_{a'}(best possible action),不管 agent 實際做了什麼 action。所以可以 explore freely(例如用 ϵ\epsilon-greedy)但仍然 learn the optimal Q。

SARSA (On-Policy)

Q(s,a)Q(s,a)+α[r+γQ(s,a)Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha \left[r + \gamma Q(s', a') - Q(s, a)\right]

aa' 是 agent 實際選的 action(according to current policy),不是 max\max

On-policy: Updates based on what the agent actually does → learns the value of the current policy(包括 exploration 的 cost)。

Q-Learning vs SARSA

AspectQ-LearningSARSA
TypeOff-policyOn-policy
Update targetmaxaQ(s,a)\max_{a'} Q(s', a')Q(s,a)Q(s', a')(actual next action)
What it learnsQQ^*(optimal Q)QπQ^\pi(Q of current policy)
ExplorationCan explore aggressivelyExploration affects learned Q
SafetyMay learn risky optimal policyLearns safer policy(accounts for exploration noise)

Deep Q-Network (DQN)

When state space is too large for a table → use a neural network to approximate Q(s,a;θ)Q(s, a; \theta):

L=E[(r+γmaxaQ(s,a;θ)Q(s,a;θ))2]\mathcal{L} = E\left[\left(r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta)\right)^2\right]

Key innovations (Mnih et al., 2015):

InnovationWhy Needed
Experience replayStore transitions in buffer → sample random batches → breaks correlation between consecutive samples
Target network θ\theta^-Separate network for TD target → updated periodically → stabilizes training(avoid chasing moving target)
ϵ\epsilon-greedy explorationWith prob ϵ\epsilon take random action, otherwise argmaxQ\arg\max Q
import torch
import torch.nn as nn

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128), nn.ReLU(),
            nn.Linear(128, 128), nn.ReLU(),
            nn.Linear(128, action_dim),  # output Q(s, a) for each action
        )

    def forward(self, state):
        return self.net(state)  # [batch, action_dim]

# Select action: epsilon-greedy
def select_action(state, q_net, epsilon):
    if random.random() < epsilon:
        return random.randint(0, action_dim - 1)  # explore
    with torch.no_grad():
        return q_net(state).argmax().item()  # exploit

Policy-Based Methods

Why Not Just Q-Learning?

Limitation of Value-BasedHow Policy-Based Solves It
Discrete actions only(argmax\arg\max over actions)Continuous actions(policy outputs distribution)
Can't represent stochastic optimal policiesNaturally stochastic(π(as)\pi(a \mid s) = distribution)
Small changes in Q → large policy changesSmooth policy updates

REINFORCE (Monte Carlo Policy Gradient)

Directly optimize the policy πθ(as)\pi_\theta(a \mid s) by gradient ascent on expected return:

θJ(θ)=Eπ[θlogπθ(as)Gt]\nabla_\theta J(\theta) = E_\pi\left[\nabla_\theta \log \pi_\theta(a \mid s) \cdot G_t\right]
  • GtG_t = total return from time step tt onwards
  • θlogπθ\nabla_\theta \log \pi_\theta = score function — direction to update policy
  • 直覺:如果 action aa 得到 high return → increase π(as)\pi(a \mid s)。Low return → decrease。

Problem: GtG_t has very high variance(depends on entire future trajectory)→ slow convergence。

Baseline and Advantage

Subtract a baseline b(s)b(s) to reduce variance without introducing bias:

θJ=E[θlogπθ(as)(Gtb(s))]\nabla_\theta J = E\left[\nabla_\theta \log \pi_\theta(a \mid s) \cdot (G_t - b(s))\right]

Natural choice for baseline: V(s)V(s)

Advantage function: A(s,a)=Q(s,a)V(s)A(s, a) = Q(s, a) - V(s) — how much better is action aa compared to the average。

θJ=E[θlogπθ(as)A(s,a)]\nabla_\theta J = E\left[\nabla_\theta \log \pi_\theta(a \mid s) \cdot A(s, a)\right]

A>0A > 0 → action 比平均好 → increase probability。A<0A < 0 → worse than average → decrease。

Actor-Critic

Combine the best of both worlds:

  • Actor (policy πθ\pi_\theta): Decides which action to take
  • Critic (value VϕV_\phi): Evaluates how good the action was
Actor: π_θ(a|s) → action
          ↓
    Environment → reward, next state
          ↓
Critic: V_ϕ(s) → evaluate (compute advantage)
          ↓
    Update actor using advantage
    Update critic using TD error

A2C / A3C

Advantage Actor-Critic (A2C):

  • Critic estimates V(s)V(s)
  • Advantage = r+γV(s)V(s)r + \gamma V(s') - V(s)(TD error as advantage estimate)
  • Actor updates: θJθlogπθ(as)A\nabla_\theta J \propto \nabla_\theta \log \pi_\theta(a \mid s) \cdot A

A3C (Asynchronous): Multiple agents explore in parallel → faster + more diverse experience。

PPO (Proximal Policy Optimization)

The de facto standard for modern RL:

LCLIP=E[min(rt(θ)At,  clip(rt(θ),1ϵ,1+ϵ)At)]\mathcal{L}^{\text{CLIP}} = E\left[\min\left(r_t(\theta) A_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\right)\right]

where rt(θ)=πθ(as)πθold(as)r_t(\theta) = \frac{\pi_\theta(a \mid s)}{\pi_{\theta_{\text{old}}}(a \mid s)} = probability ratio。

核心思想:限制 policy update 的幅度 → 避免 catastrophic policy collapse → 比 vanilla policy gradient 穩定得多。

PPO 被用在 ChatGPT 的 RLHF(Reinforcement Learning from Human Feedback)中。

Exploration vs Exploitation

StrategyHowProsCons
ϵ\epsilon-greedyWith prob ϵ\epsilon: random; else: bestSimpleRandom exploration = inefficient
Softmax / BoltzmannP(a)exp(Q(a)/τ)P(a) \propto \exp(Q(a)/\tau)Smooth, respects Q valuesTemperature τ\tau needs tuning
UCBChoose argmax(Q(a)+clntN(a))\arg\max\left(Q(a) + c\sqrt{\frac{\ln t}{N(a)}}\right)Optimism in face of uncertaintyNeeds count N(a)N(a)
Thompson SamplingSample from posterior, pick bestBayesian, naturally balancesNeeds posterior update
Intrinsic motivationAdd curiosity reward for novel statesExplores systematicallyComplex implementation

面試經典問題

「Exploration vs exploitation tradeoff 是什麼?」— Exploitation: 選已知最好的 action(maximize short-term reward)。Exploration: 嘗試新 action(可能發現更好的)。Too much exploitation → stuck at local optimum。Too much exploration → waste resources on bad actions。最佳策略在兩者之間 — ϵ\epsilon-greedy 和 Thompson Sampling 是最常用的解法。

Reward Design

Reward design 是 RL 中最重要也最困難的部分:

Common Pitfalls

PitfallExampleConsequence
Reward hackingRobot told to walk fast → learns to fall forward very quicklyOptimizes reward but not intent
Sparse rewardsGame: only reward when win/loseAgent explores randomly for long → very slow learning
Misaligned rewardsOptimize clicks → recommend clickbaitShort-term reward ↑ but long-term user satisfaction ↓

Reward Shaping

Add intermediate rewards to guide learning without changing the optimal policy:

R=R+γΦ(s)Φ(s)R' = R + \gamma \Phi(s') - \Phi(s)

where Φ\Phi is a potential function。只要 shaping reward 是 potential-based → optimal policy 不變(只加速 learning)。

Real-World Use Cases

Case 1: 推薦系統 — RL for Sequential Recommendation

推薦不是 one-shot classification — 而是 sequential decisions:

MDP ComponentIn Recommendation
StateUser profile + browsing history + context
ActionWhich item(s) to recommend
RewardClick, purchase, session length, user return
TransitionUser's updated state after seeing recommendation
γ\gammaHow much to value long-term engagement vs immediate clicks

面試 follow-up:「為什麼不用 supervised learning?」— Supervised learning 只 optimize one-step prediction。RL considers long-term: 推薦 clickbait → immediate clicks ↑ but user leaves platform → long-term value ↓。RL 可以 learn to sacrifice short-term clicks for long-term retention。

Case 2: 廣告出價 — Dynamic Bidding

MDP ComponentIn Ad Bidding
StateAd slot features, budget remaining, time of day
ActionBid amount
RewardConversion value - bid cost
ConstraintBudget must last until end of campaign

RL 可以學到 dynamic bidding strategy — 根據 remaining budget 和 expected ROI 調整出價,比 fixed-bid 策略更有效率。

Case 3: RLHF — ChatGPT 的訓練

StepWhat Happens
1. Pre-trainGPT 在大量 text 上做 next-token prediction(supervised)
2. SFTFine-tune on human-written responses(supervised fine-tuning)
3. Reward modelTrain a model to predict human preference(which response is better)
4. RLHFUse PPO to optimize GPT's policy to maximize reward model's score

RLHF 的 reward = human preference(比 next-token prediction 更 aligned with user intent)。PPO 的 clipping 機制確保 GPT 不會偏離 SFT 太多。

Method Comparison

MethodTypeProsConsBest For
Q-LearningValue, off-policySimple, sample efficientDiscrete actions onlySmall discrete action spaces
DQNValue, off-policyHandles large state spacesStill discrete actions, unstableAtari games, discrete control
REINFORCEPolicy, on-policyContinuous actions, simpleHigh variance, slowSimple continuous control
A2C/A3CActor-CriticLower variance than REINFORCEMore complexGeneral continuous control
PPOActor-CriticStable, easy to tuneSample inefficientIndustry standard(robotics, RLHF)

Hands-on: RL in Python

Q-Learning (Tabular)

import numpy as np

class QLearning:
    def __init__(self, n_states, n_actions, lr=0.1, gamma=0.99, epsilon=0.1):
        self.Q = np.zeros((n_states, n_actions))
        self.lr = lr
        self.gamma = gamma
        self.epsilon = epsilon

    def select_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.Q.shape[1])  # explore
        return self.Q[state].argmax()  # exploit

    def update(self, state, action, reward, next_state):
        td_target = reward + self.gamma * self.Q[next_state].max()
        td_error = td_target - self.Q[state, action]
        self.Q[state, action] += self.lr * td_error

DQN (PyTorch)

import torch
import torch.nn as nn
from collections import deque
import random

class ReplayBuffer:
    def __init__(self, capacity=10000):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        return random.sample(self.buffer, batch_size)

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128), nn.ReLU(),
            nn.Linear(128, 128), nn.ReLU(),
            nn.Linear(128, action_dim),
        )

    def forward(self, x):
        return self.net(x)

# Training: sample batch from replay buffer → compute TD loss → update

Interview Signals

What interviewers listen for:

  • 你能用 MDP framework 描述一個 real-world 問題(states, actions, rewards)
  • 你知道 Bellman equation 的直覺(current value = immediate reward + discounted future value)
  • 你能區分 value-based 和 policy-based methods 的適用場景
  • 你理解 exploration vs exploitation 的 tradeoff 和常見策略
  • 你知道 RL 和 supervised learning 的根本差異

Practice

Flashcards

Flashcards (1/10)

RL 和 supervised learning 最根本的差異是什麼?

SL: 有 correct label for each input(i.i.d. data, immediate feedback)。RL: 只有 reward signal(delayed, sparse, sequential decisions, agent generates own data through interaction)。RL 的 credit assignment problem — 哪個 past action 導致了 current reward?— 是核心挑戰。

Click card to flip

Quiz

Question 1/10

MDP 的 Markov property 代表什麼?

Mark as Complete

3/5 — Okay