RL Fundamentals
Interview Context
RL 在 DS 面試中通常不考深入的 algorithm implementation,而是考概念理解 — MDP formulation、exploration vs exploitation、reward design。對 ML Engineer 面試則可能考 DQN 或 policy gradient 的細節。對 DS 來說,最重要的是理解 RL 和 supervised learning 的差異,以及 RL 在推薦系統、廣告出價等場景的應用。
What You Should Understand
- 能用 MDP framework 描述一個 RL 問題
- 理解 Bellman equation 和 value function 的意義
- 知道 value-based vs policy-based methods 的差異
- 能解釋 exploration vs exploitation tradeoff
- 了解 RL 和 supervised learning 的根本差異
RL vs Supervised Learning
| Aspect | Supervised Learning | Reinforcement Learning |
|---|---|---|
| Feedback | Correct label for each input | Reward signal(delayed, sparse) |
| Data | Fixed dataset, i.i.d. | Agent generates its own data by interacting |
| Goal | Minimize prediction error | Maximize cumulative reward |
| Sequential decisions | No | Yes — current action affects future states |
| Exploration | Not needed | Must explore to discover good actions |
| Credit assignment | Immediate | Delayed — which past action caused this reward? |
為什麼 DS 需要了解 RL?
RL 的概念直接 applicable to:(1)推薦系統(sequence of recommendations = sequential decisions),(2)廣告出價(bid → outcome → learn),(3)A/B test optimization(Thompson Sampling = bandit = simple RL),(4)動態定價。不需要 train AlphaGo,但需要理解 framework。
Markov Decision Process (MDP)
Components
An MDP is defined by a tuple :
| Component | Symbol | Meaning |
|---|---|---|
| States | All possible situations the agent can be in | |
| Actions | All possible actions the agent can take | |
| Transition | Probability of reaching state from after action | |
| Reward | Immediate reward for taking action in state | |
| Discount factor | How much to value future rewards vs immediate |
The Markov Property
The future depends only on the current state, not on how we got here. 這大幅簡化了問題 — 不需要記住完整歷史。
面試中的常見 challenge:「什麼時候 Markov property 不成立?」— 當 state representation 不完整時(例如在推薦系統中只看 current page 而沒有 browsing history → 需要用 history 作為 state 的一部分)。
Discount Factor
| Behavior | |
|---|---|
| Completely myopic — only care about immediate reward | |
| Balanced — future rewards matter but discounted | |
| Very far-sighted — almost equal weight to far future | |
| No discounting — can cause divergence in infinite-horizon MDPs |
確保 infinite sum converges,也反映了 uncertainty(未來越遠越不確定)和 time preference(現在的 reward 比未來的更有價值)。
Policy
A policy maps states to actions:
- Deterministic:
- Stochastic:
RL 的 goal: find the optimal policy that maximizes expected cumulative reward。
Value Functions
State-Value Function
Expected cumulative reward starting from state , following policy :
「在 state ,如果我一直按 policy 做決策,期望能拿到多少 total reward?」
Action-Value Function
Expected cumulative reward starting from state , taking action , then following :
「在 state 做 action ,然後一直按 做,期望能拿到多少 total reward?」
Relationship
= averaged over all actions according to policy。 = specific to a particular action。
Bellman Equations
Bellman Expectation Equation
and satisfy recursive relationships:
直覺:current value = immediate reward + discounted future value。把 long-term planning 分解成 one-step lookahead。
Bellman Optimality Equation
For the optimal policy :
取 max(選最好的 action),不是 average over policy。如果知道 ,optimal policy 就是 。
面試 Key Insight
Bellman equation 把 RL 問題從「optimize over entire trajectory」轉化為「optimize one step at a time」。這是 dynamic programming 的核心思想 — 把大問題拆成 overlapping subproblems。
Value-Based Methods
Q-Learning (Off-Policy)
Learn directly without knowing the transition model:
| Term | Meaning |
|---|---|
| TD target — estimated true value | |
| Current estimate | |
| Difference | TD error — how wrong was our estimate |
| Learning rate |
Off-policy: Updates use (best possible action),不管 agent 實際做了什麼 action。所以可以 explore freely(例如用 -greedy)但仍然 learn the optimal Q。
SARSA (On-Policy)
是 agent 實際選的 action(according to current policy),不是 。
On-policy: Updates based on what the agent actually does → learns the value of the current policy(包括 exploration 的 cost)。
Q-Learning vs SARSA
| Aspect | Q-Learning | SARSA |
|---|---|---|
| Type | Off-policy | On-policy |
| Update target | (actual next action) | |
| What it learns | (optimal Q) | (Q of current policy) |
| Exploration | Can explore aggressively | Exploration affects learned Q |
| Safety | May learn risky optimal policy | Learns safer policy(accounts for exploration noise) |
Deep Q-Network (DQN)
When state space is too large for a table → use a neural network to approximate :
Key innovations (Mnih et al., 2015):
| Innovation | Why Needed |
|---|---|
| Experience replay | Store transitions in buffer → sample random batches → breaks correlation between consecutive samples |
| Target network | Separate network for TD target → updated periodically → stabilizes training(avoid chasing moving target) |
| -greedy exploration | With prob take random action, otherwise |
import torch
import torch.nn as nn
class DQN(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, 128), nn.ReLU(),
nn.Linear(128, 128), nn.ReLU(),
nn.Linear(128, action_dim), # output Q(s, a) for each action
)
def forward(self, state):
return self.net(state) # [batch, action_dim]
# Select action: epsilon-greedy
def select_action(state, q_net, epsilon):
if random.random() < epsilon:
return random.randint(0, action_dim - 1) # explore
with torch.no_grad():
return q_net(state).argmax().item() # exploit
Policy-Based Methods
Why Not Just Q-Learning?
| Limitation of Value-Based | How Policy-Based Solves It |
|---|---|
| Discrete actions only( over actions) | Continuous actions(policy outputs distribution) |
| Can't represent stochastic optimal policies | Naturally stochastic( = distribution) |
| Small changes in Q → large policy changes | Smooth policy updates |
REINFORCE (Monte Carlo Policy Gradient)
Directly optimize the policy by gradient ascent on expected return:
- = total return from time step onwards
- = score function — direction to update policy
- 直覺:如果 action 得到 high return → increase 。Low return → decrease。
Problem: has very high variance(depends on entire future trajectory)→ slow convergence。
Baseline and Advantage
Subtract a baseline to reduce variance without introducing bias:
Natural choice for baseline: 。
Advantage function: — how much better is action compared to the average。
→ action 比平均好 → increase probability。 → worse than average → decrease。
Actor-Critic
Combine the best of both worlds:
- Actor (policy ): Decides which action to take
- Critic (value ): Evaluates how good the action was
Actor: π_θ(a|s) → action
↓
Environment → reward, next state
↓
Critic: V_ϕ(s) → evaluate (compute advantage)
↓
Update actor using advantage
Update critic using TD error
A2C / A3C
Advantage Actor-Critic (A2C):
- Critic estimates
- Advantage = (TD error as advantage estimate)
- Actor updates:
A3C (Asynchronous): Multiple agents explore in parallel → faster + more diverse experience。
PPO (Proximal Policy Optimization)
The de facto standard for modern RL:
where = probability ratio。
核心思想:限制 policy update 的幅度 → 避免 catastrophic policy collapse → 比 vanilla policy gradient 穩定得多。
PPO 被用在 ChatGPT 的 RLHF(Reinforcement Learning from Human Feedback)中。
Exploration vs Exploitation
| Strategy | How | Pros | Cons |
|---|---|---|---|
| -greedy | With prob : random; else: best | Simple | Random exploration = inefficient |
| Softmax / Boltzmann | Smooth, respects Q values | Temperature needs tuning | |
| UCB | Choose | Optimism in face of uncertainty | Needs count |
| Thompson Sampling | Sample from posterior, pick best | Bayesian, naturally balances | Needs posterior update |
| Intrinsic motivation | Add curiosity reward for novel states | Explores systematically | Complex implementation |
面試經典問題
「Exploration vs exploitation tradeoff 是什麼?」— Exploitation: 選已知最好的 action(maximize short-term reward)。Exploration: 嘗試新 action(可能發現更好的)。Too much exploitation → stuck at local optimum。Too much exploration → waste resources on bad actions。最佳策略在兩者之間 — -greedy 和 Thompson Sampling 是最常用的解法。
Reward Design
Reward design 是 RL 中最重要也最困難的部分:
Common Pitfalls
| Pitfall | Example | Consequence |
|---|---|---|
| Reward hacking | Robot told to walk fast → learns to fall forward very quickly | Optimizes reward but not intent |
| Sparse rewards | Game: only reward when win/lose | Agent explores randomly for long → very slow learning |
| Misaligned rewards | Optimize clicks → recommend clickbait | Short-term reward ↑ but long-term user satisfaction ↓ |
Reward Shaping
Add intermediate rewards to guide learning without changing the optimal policy:
where is a potential function。只要 shaping reward 是 potential-based → optimal policy 不變(只加速 learning)。
Real-World Use Cases
Case 1: 推薦系統 — RL for Sequential Recommendation
推薦不是 one-shot classification — 而是 sequential decisions:
| MDP Component | In Recommendation |
|---|---|
| State | User profile + browsing history + context |
| Action | Which item(s) to recommend |
| Reward | Click, purchase, session length, user return |
| Transition | User's updated state after seeing recommendation |
| How much to value long-term engagement vs immediate clicks |
面試 follow-up:「為什麼不用 supervised learning?」— Supervised learning 只 optimize one-step prediction。RL considers long-term: 推薦 clickbait → immediate clicks ↑ but user leaves platform → long-term value ↓。RL 可以 learn to sacrifice short-term clicks for long-term retention。
Case 2: 廣告出價 — Dynamic Bidding
| MDP Component | In Ad Bidding |
|---|---|
| State | Ad slot features, budget remaining, time of day |
| Action | Bid amount |
| Reward | Conversion value - bid cost |
| Constraint | Budget must last until end of campaign |
RL 可以學到 dynamic bidding strategy — 根據 remaining budget 和 expected ROI 調整出價,比 fixed-bid 策略更有效率。
Case 3: RLHF — ChatGPT 的訓練
| Step | What Happens |
|---|---|
| 1. Pre-train | GPT 在大量 text 上做 next-token prediction(supervised) |
| 2. SFT | Fine-tune on human-written responses(supervised fine-tuning) |
| 3. Reward model | Train a model to predict human preference(which response is better) |
| 4. RLHF | Use PPO to optimize GPT's policy to maximize reward model's score |
RLHF 的 reward = human preference(比 next-token prediction 更 aligned with user intent)。PPO 的 clipping 機制確保 GPT 不會偏離 SFT 太多。
Method Comparison
| Method | Type | Pros | Cons | Best For |
|---|---|---|---|---|
| Q-Learning | Value, off-policy | Simple, sample efficient | Discrete actions only | Small discrete action spaces |
| DQN | Value, off-policy | Handles large state spaces | Still discrete actions, unstable | Atari games, discrete control |
| REINFORCE | Policy, on-policy | Continuous actions, simple | High variance, slow | Simple continuous control |
| A2C/A3C | Actor-Critic | Lower variance than REINFORCE | More complex | General continuous control |
| PPO | Actor-Critic | Stable, easy to tune | Sample inefficient | Industry standard(robotics, RLHF) |
Hands-on: RL in Python
Q-Learning (Tabular)
import numpy as np
class QLearning:
def __init__(self, n_states, n_actions, lr=0.1, gamma=0.99, epsilon=0.1):
self.Q = np.zeros((n_states, n_actions))
self.lr = lr
self.gamma = gamma
self.epsilon = epsilon
def select_action(self, state):
if np.random.random() < self.epsilon:
return np.random.randint(self.Q.shape[1]) # explore
return self.Q[state].argmax() # exploit
def update(self, state, action, reward, next_state):
td_target = reward + self.gamma * self.Q[next_state].max()
td_error = td_target - self.Q[state, action]
self.Q[state, action] += self.lr * td_error
DQN (PyTorch)
import torch
import torch.nn as nn
from collections import deque
import random
class ReplayBuffer:
def __init__(self, capacity=10000):
self.buffer = deque(maxlen=capacity)
def push(self, state, action, reward, next_state, done):
self.buffer.append((state, action, reward, next_state, done))
def sample(self, batch_size):
return random.sample(self.buffer, batch_size)
class DQN(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, 128), nn.ReLU(),
nn.Linear(128, 128), nn.ReLU(),
nn.Linear(128, action_dim),
)
def forward(self, x):
return self.net(x)
# Training: sample batch from replay buffer → compute TD loss → update
Interview Signals
What interviewers listen for:
- 你能用 MDP framework 描述一個 real-world 問題(states, actions, rewards)
- 你知道 Bellman equation 的直覺(current value = immediate reward + discounted future value)
- 你能區分 value-based 和 policy-based methods 的適用場景
- 你理解 exploration vs exploitation 的 tradeoff 和常見策略
- 你知道 RL 和 supervised learning 的根本差異
Practice
Flashcards
Flashcards (1/10)
RL 和 supervised learning 最根本的差異是什麼?
SL: 有 correct label for each input(i.i.d. data, immediate feedback)。RL: 只有 reward signal(delayed, sparse, sequential decisions, agent generates own data through interaction)。RL 的 credit assignment problem — 哪個 past action 導致了 current reward?— 是核心挑戰。
Quiz
MDP 的 Markov property 代表什麼?