RL Fundamentals

Interview Context

RL 在 DS 面試中通常不考深入的 algorithm implementation，而是考概念理解 — MDP formulation、exploration vs exploitation、reward design。對 ML Engineer 面試則可能考 DQN 或 policy gradient 的細節。對 DS 來說，最重要的是理解 RL 和 supervised learning 的差異，以及 RL 在推薦系統、廣告出價等場景的應用。

What You Should Understand

能用 MDP framework 描述一個 RL 問題
理解 Bellman equation 和 value function 的意義
知道 value-based vs policy-based methods 的差異
能解釋 exploration vs exploitation tradeoff
了解 RL 和 supervised learning 的根本差異

RL vs Supervised Learning

Aspect	Supervised Learning	Reinforcement Learning
Feedback	Correct label for each input	Reward signal（delayed, sparse）
Data	Fixed dataset, i.i.d.	Agent generates its own data by interacting
Goal	Minimize prediction error	Maximize cumulative reward
Sequential decisions	No	Yes — current action affects future states
Exploration	Not needed	Must explore to discover good actions
Credit assignment	Immediate	Delayed — which past action caused this reward?

為什麼 DS 需要了解 RL？

RL 的概念直接 applicable to：（1）推薦系統（sequence of recommendations = sequential decisions），（2）廣告出價（bid → outcome → learn），（3）A/B test optimization（Thompson Sampling = bandit = simple RL），（4）動態定價。不需要 train AlphaGo，但需要理解 framework。

Markov Decision Process (MDP)

Components

An MDP is defined by a tuple $(S, A, P, R, \gamma)$ :

Component	Symbol	Meaning
States	$S$	All possible situations the agent can be in
Actions	$A$	All possible actions the agent can take
Transition	$P(s' \mid s, a)$	Probability of reaching state $s'$ from $s$ after action $a$
Reward	$R(s, a, s')$	Immediate reward for taking action $a$ in state $s$
Discount factor	$\gamma \in [0, 1]$	How much to value future rewards vs immediate

The Markov Property

P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} \mid s_t, a_t)

The future depends only on the current state, not on how we got here. 這大幅簡化了問題 — 不需要記住完整歷史。

面試中的常見 challenge：「什麼時候 Markov property 不成立？」— 當 state representation 不完整時（例如在推薦系統中只看 current page 而沒有 browsing history → 需要用 history 作為 state 的一部分）。

Discount Factor $\gamma$

G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots = \sum_{k=0}^{\infty} \gamma^k r_{t+k}

$\gamma$	Behavior
$\gamma = 0$	Completely myopic — only care about immediate reward
$\gamma = 0.9$	Balanced — future rewards matter but discounted
$\gamma = 0.99$	Very far-sighted — almost equal weight to far future
$\gamma = 1$	No discounting — can cause divergence in infinite-horizon MDPs

$\gamma < 1$ 確保 infinite sum converges，也反映了 uncertainty（未來越遠越不確定）和 time preference（現在的 reward 比未來的更有價值）。

Policy

A policy $\pi$ maps states to actions:

Deterministic: $a = \pi(s)$
Stochastic: $a \sim \pi(a \mid s)$

RL 的 goal: find the optimal policy $\pi^*$ that maximizes expected cumulative reward。

Value Functions

State-Value Function $V^\pi(s)$

Expected cumulative reward starting from state $s$ , following policy $\pi$ :

V^\pi(s) = E_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \mid s_t = s\right]

「在 state $s$ ，如果我一直按 policy $\pi$ 做決策，期望能拿到多少 total reward？」

Action-Value Function $Q^\pi(s, a)$

Expected cumulative reward starting from state $s$ , taking action $a$ , then following $\pi$ :

Q^\pi(s, a) = E_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \mid s_t = s, a_t = a\right]

「在 state $s$ 做 action $a$ ，然後一直按 $\pi$ 做，期望能拿到多少 total reward？」

Relationship

V^\pi(s) = \sum_a \pi(a \mid s) \cdot Q^\pi(s, a)

$V$ = averaged over all actions according to policy。 $Q$ = specific to a particular action。

Bellman Equations

Bellman Expectation Equation

$V$ and $Q$ satisfy recursive relationships:

V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)\left[R(s,a,s') + \gamma V^\pi(s')\right]

Q^\pi(s, a) = \sum_{s'} P(s' \mid s, a)\left[R(s,a,s') + \gamma \sum_{a'} \pi(a' \mid s') Q^\pi(s', a')\right]

直覺：current value = immediate reward + discounted future value。把 long-term planning 分解成 one-step lookahead。

Bellman Optimality Equation

For the optimal policy $\pi^*$ :

V^*(s) = \max_a \sum_{s'} P(s' \mid s, a)\left[R(s,a,s') + \gamma V^*(s')\right]

Q^*(s, a) = \sum_{s'} P(s' \mid s, a)\left[R(s,a,s') + \gamma \max_{a'} Q^*(s', a')\right]

$V^*$ 取 max（選最好的 action），不是 average over policy。如果知道 $Q^*$ ，optimal policy 就是 $\pi^*(s) = \arg\max_a Q^*(s, a)$ 。

面試 Key Insight

Bellman equation 把 RL 問題從「optimize over entire trajectory」轉化為「optimize one step at a time」。這是 dynamic programming 的核心思想 — 把大問題拆成 overlapping subproblems。

Value-Based Methods

Q-Learning (Off-Policy)

Learn $Q^*$ directly without knowing the transition model:

Q(s, a) \leftarrow Q(s, a) + \alpha \left[r + \gamma \max_{a'} Q(s', a') - Q(s, a)\right]

Term	Meaning
$r + \gamma \max_{a'} Q(s', a')$	TD target — estimated true value
$Q(s, a)$	Current estimate
Difference	TD error — how wrong was our estimate
$\alpha$	Learning rate

Off-policy: Updates use $\max_{a'}$ （best possible action），不管 agent 實際做了什麼 action。所以可以 explore freely（例如用 $\epsilon$ -greedy）但仍然 learn the optimal Q。

SARSA (On-Policy)

Q(s, a) \leftarrow Q(s, a) + \alpha \left[r + \gamma Q(s', a') - Q(s, a)\right]

$a'$ 是 agent 實際選的 action（according to current policy），不是 $\max$ 。

On-policy: Updates based on what the agent actually does → learns the value of the current policy（包括 exploration 的 cost）。

Q-Learning vs SARSA

Aspect	Q-Learning	SARSA
Type	Off-policy	On-policy
Update target	$\max_{a'} Q(s', a')$	$Q(s', a')$ （actual next action）
What it learns	$Q^*$ （optimal Q）	$Q^\pi$ （Q of current policy）
Exploration	Can explore aggressively	Exploration affects learned Q
Safety	May learn risky optimal policy	Learns safer policy（accounts for exploration noise）

Deep Q-Network (DQN)

When state space is too large for a table → use a neural network to approximate $Q(s, a; \theta)$ :

\mathcal{L} = E\left[\left(r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta)\right)^2\right]

Key innovations (Mnih et al., 2015):

Innovation	Why Needed
Experience replay	Store transitions in buffer → sample random batches → breaks correlation between consecutive samples
Target network $\theta^-$	Separate network for TD target → updated periodically → stabilizes training（avoid chasing moving target）
$\epsilon$ -greedy exploration	With prob $\epsilon$ take random action, otherwise $\arg\max Q$

import torch
import torch.nn as nn

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128), nn.ReLU(),
            nn.Linear(128, 128), nn.ReLU(),
            nn.Linear(128, action_dim),  # output Q(s, a) for each action
        )

    def forward(self, state):
        return self.net(state)  # [batch, action_dim]

# Select action: epsilon-greedy
def select_action(state, q_net, epsilon):
    if random.random() < epsilon:
        return random.randint(0, action_dim - 1)  # explore
    with torch.no_grad():
        return q_net(state).argmax().item()  # exploit

Policy-Based Methods

Why Not Just Q-Learning?

Limitation of Value-Based	How Policy-Based Solves It
Discrete actions only（ $\arg\max$ over actions）	Continuous actions（policy outputs distribution）
Can't represent stochastic optimal policies	Naturally stochastic（ $\pi(a \mid s)$ = distribution）
Small changes in Q → large policy changes	Smooth policy updates

REINFORCE (Monte Carlo Policy Gradient)

Directly optimize the policy $\pi_\theta(a \mid s)$ by gradient ascent on expected return:

\nabla_\theta J(\theta) = E_\pi\left[\nabla_\theta \log \pi_\theta(a \mid s) \cdot G_t\right]

$G_t$ = total return from time step $t$ onwards
$\nabla_\theta \log \pi_\theta$ = score function — direction to update policy
直覺：如果 action $a$ 得到 high return → increase $\pi(a \mid s)$ 。Low return → decrease。

Problem: $G_t$ has very high variance（depends on entire future trajectory）→ slow convergence。

Baseline and Advantage

Subtract a baseline $b(s)$ to reduce variance without introducing bias:

\nabla_\theta J = E\left[\nabla_\theta \log \pi_\theta(a \mid s) \cdot (G_t - b(s))\right]

Natural choice for baseline: $V(s)$ 。

Advantage function: $A(s, a) = Q(s, a) - V(s)$ — how much better is action $a$ compared to the average。

\nabla_\theta J = E\left[\nabla_\theta \log \pi_\theta(a \mid s) \cdot A(s, a)\right]

$A > 0$ → action 比平均好 → increase probability。 $A < 0$ → worse than average → decrease。

Actor-Critic

Combine the best of both worlds:

Actor (policy $\pi_\theta$ ): Decides which action to take
Critic (value $V_\phi$ ): Evaluates how good the action was

Actor: π_θ(a|s) → action
          ↓
    Environment → reward, next state
          ↓
Critic: V_ϕ(s) → evaluate (compute advantage)
          ↓
    Update actor using advantage
    Update critic using TD error

A2C / A3C

Advantage Actor-Critic (A2C):

Critic estimates $V(s)$
Advantage = $r + \gamma V(s') - V(s)$ （TD error as advantage estimate）
Actor updates: $\nabla_\theta J \propto \nabla_\theta \log \pi_\theta(a \mid s) \cdot A$

A3C (Asynchronous): Multiple agents explore in parallel → faster + more diverse experience。

PPO (Proximal Policy Optimization)

The de facto standard for modern RL:

\mathcal{L}^{\text{CLIP}} = E\left[\min\left(r_t(\theta) A_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\right)\right]

where $r_t(\theta) = \frac{\pi_\theta(a \mid s)}{\pi_{\theta_{\text{old}}}(a \mid s)}$ = probability ratio。

核心思想：限制 policy update 的幅度 → 避免 catastrophic policy collapse → 比 vanilla policy gradient 穩定得多。

PPO 被用在 ChatGPT 的 RLHF（Reinforcement Learning from Human Feedback）中。

Exploration vs Exploitation

Strategy	How	Pros	Cons
$\epsilon$ -greedy	With prob $\epsilon$ : random; else: best	Simple	Random exploration = inefficient
Softmax / Boltzmann	$P(a) \propto \exp(Q(a)/\tau)$	Smooth, respects Q values	Temperature $\tau$ needs tuning
UCB	Choose $\arg\max\left(Q(a) + c\sqrt{\frac{\ln t}{N(a)}}\right)$	Optimism in face of uncertainty	Needs count $N(a)$
Thompson Sampling	Sample from posterior, pick best	Bayesian, naturally balances	Needs posterior update
Intrinsic motivation	Add curiosity reward for novel states	Explores systematically	Complex implementation

面試經典問題

「Exploration vs exploitation tradeoff 是什麼？」— Exploitation: 選已知最好的 action（maximize short-term reward）。Exploration: 嘗試新 action（可能發現更好的）。Too much exploitation → stuck at local optimum。Too much exploration → waste resources on bad actions。最佳策略在兩者之間 — $\epsilon$ -greedy 和 Thompson Sampling 是最常用的解法。

Reward Design

Reward design 是 RL 中最重要也最困難的部分：

Common Pitfalls

Pitfall	Example	Consequence
Reward hacking	Robot told to walk fast → learns to fall forward very quickly	Optimizes reward but not intent
Sparse rewards	Game: only reward when win/lose	Agent explores randomly for long → very slow learning
Misaligned rewards	Optimize clicks → recommend clickbait	Short-term reward ↑ but long-term user satisfaction ↓

Reward Shaping

Add intermediate rewards to guide learning without changing the optimal policy:

R' = R + \gamma \Phi(s') - \Phi(s)

where $\Phi$ is a potential function。只要 shaping reward 是 potential-based → optimal policy 不變（只加速 learning）。

Real-World Use Cases

Case 1: 推薦系統 — RL for Sequential Recommendation

推薦不是 one-shot classification — 而是 sequential decisions：

MDP Component	In Recommendation
State	User profile + browsing history + context
Action	Which item(s) to recommend
Reward	Click, purchase, session length, user return
Transition	User's updated state after seeing recommendation
$\gamma$	How much to value long-term engagement vs immediate clicks

面試 follow-up：「為什麼不用 supervised learning？」— Supervised learning 只 optimize one-step prediction。RL considers long-term: 推薦 clickbait → immediate clicks ↑ but user leaves platform → long-term value ↓。RL 可以 learn to sacrifice short-term clicks for long-term retention。

Case 2: 廣告出價 — Dynamic Bidding

MDP Component	In Ad Bidding
State	Ad slot features, budget remaining, time of day
Action	Bid amount
Reward	Conversion value - bid cost
Constraint	Budget must last until end of campaign

RL 可以學到 dynamic bidding strategy — 根據 remaining budget 和 expected ROI 調整出價，比 fixed-bid 策略更有效率。

Case 3: RLHF — ChatGPT 的訓練

Step	What Happens
1. Pre-train	GPT 在大量 text 上做 next-token prediction（supervised）
2. SFT	Fine-tune on human-written responses（supervised fine-tuning）
3. Reward model	Train a model to predict human preference（which response is better）
4. RLHF	Use PPO to optimize GPT's policy to maximize reward model's score

RLHF 的 reward = human preference（比 next-token prediction 更 aligned with user intent）。PPO 的 clipping 機制確保 GPT 不會偏離 SFT 太多。

Method Comparison

Method	Type	Pros	Cons	Best For
Q-Learning	Value, off-policy	Simple, sample efficient	Discrete actions only	Small discrete action spaces
DQN	Value, off-policy	Handles large state spaces	Still discrete actions, unstable	Atari games, discrete control
REINFORCE	Policy, on-policy	Continuous actions, simple	High variance, slow	Simple continuous control
A2C/A3C	Actor-Critic	Lower variance than REINFORCE	More complex	General continuous control
PPO	Actor-Critic	Stable, easy to tune	Sample inefficient	Industry standard（robotics, RLHF）

Hands-on: RL in Python

Q-Learning (Tabular)

import numpy as np

class QLearning:
    def __init__(self, n_states, n_actions, lr=0.1, gamma=0.99, epsilon=0.1):
        self.Q = np.zeros((n_states, n_actions))
        self.lr = lr
        self.gamma = gamma
        self.epsilon = epsilon

    def select_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.Q.shape[1])  # explore
        return self.Q[state].argmax()  # exploit

    def update(self, state, action, reward, next_state):
        td_target = reward + self.gamma * self.Q[next_state].max()
        td_error = td_target - self.Q[state, action]
        self.Q[state, action] += self.lr * td_error

DQN (PyTorch)

import torch
import torch.nn as nn
from collections import deque
import random

class ReplayBuffer:
    def __init__(self, capacity=10000):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        return random.sample(self.buffer, batch_size)

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128), nn.ReLU(),
            nn.Linear(128, 128), nn.ReLU(),
            nn.Linear(128, action_dim),
        )

    def forward(self, x):
        return self.net(x)

# Training: sample batch from replay buffer → compute TD loss → update

Interview Signals

What interviewers listen for:

你能用 MDP framework 描述一個 real-world 問題（states, actions, rewards）
你知道 Bellman equation 的直覺（current value = immediate reward + discounted future value）
你能區分 value-based 和 policy-based methods 的適用場景
你理解 exploration vs exploitation 的 tradeoff 和常見策略
你知道 RL 和 supervised learning 的根本差異

Practice

Flashcards

Flashcards (1/10)

RL 和 supervised learning 最根本的差異是什麼？

SL: 有 correct label for each input（i.i.d. data, immediate feedback）。RL: 只有 reward signal（delayed, sparse, sequential decisions, agent generates own data through interaction）。RL 的 credit assignment problem — 哪個 past action 導致了 current reward？— 是核心挑戰。

Click card to flip

Quiz

Question 1/10

MDP 的 Markov property 代表什麼？

Mark as Complete

How confident are you with this topic?

3/5 — Okay