Backpropagation
Interview Focus
面試官常要求你手動計算小 network 的 gradients、在 computational graph 上解釋 chain rule、或比較 optimizers。深度理解 backprop — 不只是 call loss.backward() — 是區分強弱 candidate 的關鍵。
The Chain Rule
Backpropagation is the chain rule of calculus applied systematically to a computational graph.
For a composition :
For deeper composition :
每個中間 derivative 是 local gradient — 一個 node 對其 immediate input 的 gradient。Backprop 的精髓就是把這些 local gradients 串起來。
Computational Graphs
Any neural network computation can be represented as a DAG (directed acyclic graph):
- Nodes = operations (add, multiply, activation)
- Edges = tensor values (forward) and gradients (backward)
Worked Example
Forward pass:
Backward pass (reverse order, chain rule):
Fan-out Rule (Multivariate Chain Rule)
當一個 variable 被用在多個 downstream operations 時,它的 gradient 是所有路徑 gradients 的加總:。面試中常被問到。
Common Local Gradients
面試中可能要你快速說出某個 operation 的 local gradient:
| Operation | Forward | Local Gradient |
|---|---|---|
| Add: | — | , |
| Multiply: | — | , |
| ReLU: | — | |
| Sigmoid: | — | |
| MatMul: | — | , |
Forward and Backward Pass
Forward Pass
Compute output layer by layer, caching intermediate values(backward pass 需要用):
Cache: (用來算 )和 (用來算 )。
Backward Pass
Starting from loss, compute gradients layer by layer in reverse:
是 layer 的 error signal, 是 element-wise multiplication。
Memory Cost of Backprop
Forward pass 必須 cache 所有 intermediate activations → memory 和 model depth 成正比。這就是為什麼 training 比 inference 吃更多 memory。解法:gradient checkpointing — 只 cache 部分 layers,需要時 recompute(用計算換 memory)。
Loss Functions
Comparison
| Loss | Formula | Task | Gradient | Properties |
|---|---|---|---|---|
| MSE | Regression | 對大誤差 penalty 更重 | ||
| MAE | Regression | Robust to outliers, non-smooth at 0 | ||
| Huber | MSE if small, MAE if large | Regression | Smooth transition | Best of both worlds |
| BCE | Binary clf | (with sigmoid) | Output layer 用 sigmoid | |
| CE | Multi-class | (with softmax) | Output layer 用 softmax |
Softmax + Cross-Entropy: The Perfect Pair
Gradient 簡化成 — 非常 clean。這就是為什麼 frameworks 提供 fused CrossEntropyLoss(internal 先做 log_softmax 避免 numerical issues,再算 NLL loss)。
面試經典問題
「為什麼不先 softmax 再 log 再 cross-entropy?」— 因為 softmax 的值可能 underflow to 0 → 。Fused 計算用 log-sum-exp trick 避免這個問題,numerically 更穩定。
Learning Rate
| Effect | |
|---|---|
| Too large | Overshoots minima, loss oscillates or diverges |
| Too small | Convergence painfully slow, may get trapped in poor local minimum |
| Just right | Smooth convergence to good minimum |
Learning Rate Schedules
| Schedule | Formula / Behavior | Use Case |
|---|---|---|
| Step decay | every epochs | Classic, simple |
| Cosine annealing | Modern default(smooth decay) | |
| Warmup + decay | Linear increase for first steps, then decay | Transformers(Adam + warmup) |
| One-cycle | Increase to max, then decrease to near zero | Fast training(Leslie Smith) |
| ReduceOnPlateau | Reduce when validation loss plateaus | Adaptive, easy to use |
from torch.optim.lr_scheduler import CosineAnnealingLR, OneCycleLR
# Cosine annealing
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)
# One-cycle policy
scheduler = OneCycleLR(optimizer, max_lr=0.01, total_steps=1000)
# In training loop:
for epoch in range(num_epochs):
train_one_epoch()
scheduler.step()
Warmup for Transformers
Transformer training 通常需要 warmup — 前幾千步 linearly increase lr from 0 to target。因為 Adam 的 moment estimates 在初期不準確,large lr + inaccurate moments → unstable updates。Warmup 讓 model 在 moments stabilize 後再用 full lr。
Gradient Descent Variants
Batch vs Stochastic vs Mini-Batch
| Variant | Batch Size | Gradient Quality | Speed | GPU Utilization |
|---|---|---|---|---|
| Batch GD | (all data) | Exact, stable | Very slow per update | Full |
| SGD | 1 | Very noisy | Fastest updates | Poor |
| Mini-Batch | (32-256) | Balanced noise | Balanced | Best |
Mini-Batch GD 是 practical standard:
- Noise from sampling acts as implicit regularization(helps escape sharp minima)
- Exploits GPU parallelism(vectorized batch operations)
- Typical batch sizes: 32, 64, 128, 256
Batch Size Effects
| Small Batch | Large Batch |
|---|---|
| More noisy gradient → better generalization | Less noise → potentially sharp minima(worse generalization) |
| More updates per epoch → faster convergence in wall time? | Fewer updates but higher GPU utilization |
| Needs smaller learning rate | Can scale learning rate linearly(linear scaling rule) |
Large Batch Training Challenge
Large batch 容易 converge to sharp minima → generalize worse。解法:linear scaling rule(batch size ×2 → lr ×2)、warmup、LARS/LAMB optimizer。這是為什麼 Google 和 Facebook 做 distributed training 時需要特殊技術。
Gradient Descent Visualization
Explore how different optimizers navigate a loss surface:
Gradient Descent Visualization
Loss surface: f(x, y) = x² + 3y² + 2xy + 5 — click the contour to set a starting point, then press Start.
Click the canvas to set a starting point before running.
Optimizers
SGD with Momentum
Accumulates past gradients to smooth updates and accelerate through narrow valleys:
Momentum (): 像一個重球在 loss surface 上滾動 — 在 consistent gradient direction 上加速,在 oscillating directions 上 dampen。
RMSProp
Adapts learning rate per-parameter using running average of squared gradients:
Large gradient → smaller effective lr(避免 overshoot)。Small gradient → larger effective lr(加速更新)。解決 Adagrad 的 learning rate 不斷衰減到零的問題。
Adam (Adaptive Moment Estimation)
Combines momentum (first moment) and RMSProp (second moment) with bias correction:
Defaults: , , .
AdamW (Decoupled Weight Decay)
Standard Adam 的 weight decay 和 adaptive learning rate 耦合 → 在 large lr 的 parameters 上 weight decay 效果不一致。
AdamW decouples weight decay from the adaptive update:
AdamW 是 Transformer training 的 de facto standard(BERT, GPT 都用 AdamW)。
Optimizer Comparison
| Optimizer | Pros | Cons | Best For |
|---|---|---|---|
| SGD + Momentum | Finds flatter minima → better generalization | Needs careful lr tuning + schedule | CNNs (ResNet), when best generalization matters |
| Adam | Fast convergence, adaptive, less tuning | May converge to sharper minima | Quick prototyping, default starting point |
| AdamW | Better weight decay behavior than Adam | — | Transformers (standard), most modern DL |
| LARS/LAMB | Handles very large batch training | Complex | Distributed training |
Adam vs SGD Generalization
Adam converges faster,但 SGD with momentum 常找到 flatter minima → generalize better。很多 SOTA 結果(尤其 image classification)用 SGD + well-tuned cosine schedule。AdamW 在 Transformer 上比 vanilla Adam 好。面試中能說出這個 tradeoff 是加分項。
Gradient Clipping
Prevents exploding gradients by capping gradient magnitude:
Clip by norm (more common):
# PyTorch: clip gradients before optimizer.step()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
Clip by value: Clamp each gradient element to [-threshold, threshold]. Less common because it changes gradient direction.
面試中:gradient clipping 在 RNN/LSTM training 中幾乎 always used(因為 BPTT 容易 exploding gradients),在 Transformer training 中也常見。
Hands-on: Backpropagation in PyTorch
Training with Autograd
import torch
import torch.nn as nn
# XOR — not linearly separable, needs hidden layer
X = torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype=torch.float32)
y = torch.tensor([[0],[1],[1],[0]], dtype=torch.float32)
model = nn.Sequential(
nn.Linear(2, 4),
nn.Sigmoid(),
nn.Linear(4, 1),
nn.Sigmoid(),
)
optimizer = torch.optim.SGD(model.parameters(), lr=2.0)
loss_fn = nn.MSELoss()
for epoch in range(3000):
pred = model(X)
loss = loss_fn(pred, y)
optimizer.zero_grad() # clear previous gradients
loss.backward() # compute gradients via chain rule
optimizer.step() # update parameters
# [0,0]→0.02, [0,1]→0.97, [1,0]→0.97, [1,1]→0.03 ✓
Inspecting Gradients
# Check gradient values (useful for debugging)
for name, param in model.named_parameters():
if param.grad is not None:
print(f"{name}: grad norm = {param.grad.norm():.4f}")
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Learning rate scheduler
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
for epoch in range(100):
train()
scheduler.step()
print(f"lr = {scheduler.get_last_lr()[0]:.6f}")
Manual Gradient Computation (Interview Exercise)
import torch
# Manual backprop for a single neuron: y = sigmoid(w*x + b)
x = torch.tensor(2.0)
w = torch.tensor(0.5, requires_grad=True)
b = torch.tensor(0.1, requires_grad=True)
y_true = torch.tensor(1.0)
# Forward
z = w * x + b
y_pred = torch.sigmoid(z)
loss = (y_pred - y_true) ** 2
# Autograd
loss.backward()
print(f"dL/dw (autograd) = {w.grad:.6f}")
print(f"dL/db (autograd) = {b.grad:.6f}")
# Manual: dL/dw = 2(y_pred - y_true) * sigmoid'(z) * x
# = 2(y_pred - y_true) * y_pred * (1 - y_pred) * x
manual_dw = 2 * (y_pred - y_true) * y_pred * (1 - y_pred) * x
print(f"dL/dw (manual) = {manual_dw.item():.6f}")
# Should match!
Interview Signals
What interviewers listen for:
- 你能在 computational graph 上手動計算 gradients(不只是背公式)
- 你知道 forward pass 為什麼要 cache intermediate values(memory cost of backprop)
- 你能比較 SGD / Adam / AdamW 的 tradeoffs
- 你理解 learning rate schedule 的重要性(warmup for Transformers)
- 你知道 gradient clipping 的作用和使用場景
Practice
Flashcards
Flashcards (1/10)
什麼是 backpropagation 中的 fan-out rule?
當一個 variable 被用在多個 downstream operations 時,它的 gradient 是所有路徑 gradients 的加總。這是 multivariate chain rule 的直接應用。面試中常在 computational graph 題目出現。
Quiz
Mini-batch GD 相比 full-batch GD 的主要優勢?