Backpropagation

Interview Focus

面試官常要求你手動計算小 network 的 gradients、在 computational graph 上解釋 chain rule、或比較 optimizers。深度理解 backprop — 不只是 call loss.backward() — 是區分強弱 candidate 的關鍵。

The Chain Rule

Backpropagation is the chain rule of calculus applied systematically to a computational graph.

For a composition $f(g(x))$ :

\frac{\partial f}{\partial x} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial x}

For deeper composition $f(g(h(x)))$ :

\frac{\partial f}{\partial x} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial h} \cdot \frac{\partial h}{\partial x}

每個中間 derivative 是 local gradient — 一個 node 對其 immediate input 的 gradient。Backprop 的精髓就是把這些 local gradients 串起來。

Computational Graphs

Any neural network computation can be represented as a DAG (directed acyclic graph):

Nodes = operations (add, multiply, activation)
Edges = tensor values (forward) and gradients (backward)

Worked Example

$\mathcal{L} = (wx + b - y)^2$

Forward pass:

$z_1 = wx$
$z_2 = z_1 + b$
$z_3 = z_2 - y$
$\mathcal{L} = z_3^2$

Backward pass (reverse order, chain rule):

$\frac{\partial \mathcal{L}}{\partial z_3} = 2z_3$
$\frac{\partial \mathcal{L}}{\partial z_2} = 2z_3 \cdot 1 = 2z_3$
$\frac{\partial \mathcal{L}}{\partial w} = 2z_3 \cdot 1 \cdot x = 2(wx + b - y) \cdot x$
$\frac{\partial \mathcal{L}}{\partial b} = 2z_3 \cdot 1 \cdot 1 = 2(wx + b - y)$

Fan-out Rule (Multivariate Chain Rule)

當一個 variable 被用在多個 downstream operations 時，它的 gradient 是所有路徑 gradients 的加總： $\frac{\partial \mathcal{L}}{\partial x} = \sum_i \frac{\partial \mathcal{L}}{\partial z_i} \cdot \frac{\partial z_i}{\partial x}$ 。面試中常被問到。

Common Local Gradients

面試中可能要你快速說出某個 operation 的 local gradient：

Operation	Forward	Local Gradient
Add: $z = x + y$	—	$\frac{\partial z}{\partial x} = 1$ , $\frac{\partial z}{\partial y} = 1$
Multiply: $z = xy$	—	$\frac{\partial z}{\partial x} = y$ , $\frac{\partial z}{\partial y} = x$
ReLU: $z = \max(0, x)$	—	$\frac{\partial z}{\partial x} = \mathbf{1}_{x > 0}$
Sigmoid: $z = \sigma(x)$	—	$\frac{\partial z}{\partial x} = z(1-z)$
MatMul: $\mathbf{Z} = \mathbf{X}\mathbf{W}$	—	$\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \mathbf{X}^\top \delta$ , $\frac{\partial \mathcal{L}}{\partial \mathbf{X}} = \delta \mathbf{W}^\top$

Forward and Backward Pass

Forward Pass

Compute output layer by layer, caching intermediate values（backward pass 需要用）:

\mathbf{z}^{[l]} = \mathbf{W}^{[l]}\mathbf{a}^{[l-1]} + \mathbf{b}^{[l]} \qquad \mathbf{a}^{[l]} = \sigma(\mathbf{z}^{[l]})

Cache: $\mathbf{a}^{[l-1]}$ （用來算 $\partial \mathcal{L}/\partial \mathbf{W}^{[l]}$ ）和 $\mathbf{z}^{[l]}$ （用來算 $\sigma'$ ）。

Backward Pass

Starting from loss, compute gradients layer by layer in reverse:

\delta^{[l]} = \left((\mathbf{W}^{[l+1]})^\top \delta^{[l+1]}\right) \odot \sigma'(\mathbf{z}^{[l]})

\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}} = \delta^{[l]} (\mathbf{a}^{[l-1]})^\top \qquad \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[l]}} = \delta^{[l]}

$\delta^{[l]}$ 是 layer $l$ 的 error signal， $\odot$ 是 element-wise multiplication。

Memory Cost of Backprop

Forward pass 必須 cache 所有 intermediate activations → memory 和 model depth 成正比。這就是為什麼 training 比 inference 吃更多 memory。解法：gradient checkpointing — 只 cache 部分 layers，需要時 recompute（用計算換 memory）。

Loss Functions

Comparison

Loss	Formula	Task	Gradient	Properties
MSE	$\frac{1}{n}\sum(y-\hat{y})^2$	Regression	$\frac{2}{n}(\hat{y}-y)$	對大誤差 penalty 更重
MAE	$\frac{1}{n}\sum\\|y-\hat{y}\\|$	Regression	$\text{sign}(\hat{y}-y)/n$	Robust to outliers, non-smooth at 0
Huber	MSE if small, MAE if large	Regression	Smooth transition	Best of both worlds
BCE	$-[y\log\hat{y}+(1-y)\log(1-\hat{y})]$	Binary clf	$\hat{y}-y$ (with sigmoid)	Output layer 用 sigmoid
CE	$-\sum y_c \log \hat{y}_c$	Multi-class	$\hat{y}-y$ (with softmax)	Output layer 用 softmax

Softmax + Cross-Entropy: The Perfect Pair

\frac{\partial \mathcal{L}_{\text{CE}}}{\partial z_i} = \hat{y}_i - y_i

Gradient 簡化成 $\hat{y} - y$ — 非常 clean。這就是為什麼 frameworks 提供 fused CrossEntropyLoss（internal 先做 log_softmax 避免 numerical issues，再算 NLL loss）。

面試經典問題

「為什麼不先 softmax 再 log 再 cross-entropy？」— 因為 softmax 的值可能 underflow to 0 → $\log(0) = -\infty$ 。Fused 計算用 log-sum-exp trick 避免這個問題，numerically 更穩定。

Learning Rate

\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}

$\eta$	Effect
Too large	Overshoots minima, loss oscillates or diverges
Too small	Convergence painfully slow, may get trapped in poor local minimum
Just right	Smooth convergence to good minimum

Learning Rate Schedules

Schedule	Formula / Behavior	Use Case
Step decay	$\eta \to \eta/10$ every $k$ epochs	Classic, simple
Cosine annealing	$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\frac{t}{T}\pi))$	Modern default（smooth decay）
Warmup + decay	Linear increase for first $k$ steps, then decay	Transformers（Adam + warmup）
One-cycle	Increase to max, then decrease to near zero	Fast training（Leslie Smith）
ReduceOnPlateau	Reduce when validation loss plateaus	Adaptive, easy to use

from torch.optim.lr_scheduler import CosineAnnealingLR, OneCycleLR

# Cosine annealing
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

# One-cycle policy
scheduler = OneCycleLR(optimizer, max_lr=0.01, total_steps=1000)

# In training loop:
for epoch in range(num_epochs):
    train_one_epoch()
    scheduler.step()

Warmup for Transformers

Transformer training 通常需要 warmup — 前幾千步 linearly increase lr from 0 to target。因為 Adam 的 moment estimates 在初期不準確，large lr + inaccurate moments → unstable updates。Warmup 讓 model 在 moments stabilize 後再用 full lr。

Gradient Descent Variants

Batch vs Stochastic vs Mini-Batch

Variant	Batch Size	Gradient Quality	Speed	GPU Utilization
Batch GD	$N$ (all data)	Exact, stable	Very slow per update	Full
SGD	1	Very noisy	Fastest updates	Poor
Mini-Batch	$B$ (32-256)	Balanced noise	Balanced	Best

Mini-Batch GD 是 practical standard：

Noise from sampling acts as implicit regularization（helps escape sharp minima）
Exploits GPU parallelism（vectorized batch operations）
Typical batch sizes: 32, 64, 128, 256

Batch Size Effects

Small Batch	Large Batch
More noisy gradient → better generalization	Less noise → potentially sharp minima（worse generalization）
More updates per epoch → faster convergence in wall time?	Fewer updates but higher GPU utilization
Needs smaller learning rate	Can scale learning rate linearly（linear scaling rule）

Large Batch Training Challenge

Large batch 容易 converge to sharp minima → generalize worse。解法：linear scaling rule（batch size ×2 → lr ×2）、warmup、LARS/LAMB optimizer。這是為什麼 Google 和 Facebook 做 distributed training 時需要特殊技術。

Gradient Descent Visualization

Explore how different optimizers navigate a loss surface:

Gradient Descent Visualization

Loss surface: f(x, y) = x² + 3y² + 2xy + 5 — click the contour to set a starting point, then press Start.

Iter: 0

Loss: 101.0000

x: 4.0000

y: 4.0000

Optimizer:

LR: 0.050

Click the canvas to set a starting point before running.

Optimizers

SGD with Momentum

Accumulates past gradients to smooth updates and accelerate through narrow valleys:

v_t = \beta v_{t-1} + \nabla_\theta \mathcal{L} \qquad \theta_{t+1} = \theta_t - \eta v_t

Momentum ( $\beta \approx 0.9$ ): 像一個重球在 loss surface 上滾動 — 在 consistent gradient direction 上加速，在 oscillating directions 上 dampen。

RMSProp

Adapts learning rate per-parameter using running average of squared gradients:

s_t = \beta s_{t-1} + (1 - \beta)(\nabla_\theta \mathcal{L})^2 \qquad \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{s_t + \epsilon}} \nabla_\theta \mathcal{L}

Large gradient → smaller effective lr（避免 overshoot）。Small gradient → larger effective lr（加速更新）。解決 Adagrad 的 learning rate 不斷衰減到零的問題。

Adam (Adaptive Moment Estimation)

Combines momentum (first moment) and RMSProp (second moment) with bias correction:

m_t = \beta_1 m_{t-1} + (1 - \beta_1)\nabla_\theta \mathcal{L} \quad \text{(first moment / mean)}

v_t = \beta_2 v_{t-1} + (1 - \beta_2)(\nabla_\theta \mathcal{L})^2 \quad \text{(second moment / variance)}

\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \quad \text{(bias correction)}

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon}\hat{m}_t

Defaults: $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\epsilon = 10^{-8}$ .

AdamW (Decoupled Weight Decay)

Standard Adam 的 weight decay 和 adaptive learning rate 耦合 → 在 large lr 的 parameters 上 weight decay 效果不一致。

AdamW decouples weight decay from the adaptive update:

\theta_{t+1} = \theta_t - \eta\left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t\right)

AdamW 是 Transformer training 的 de facto standard（BERT, GPT 都用 AdamW）。

Optimizer Comparison

Optimizer	Pros	Cons	Best For
SGD + Momentum	Finds flatter minima → better generalization	Needs careful lr tuning + schedule	CNNs (ResNet), when best generalization matters
Adam	Fast convergence, adaptive, less tuning	May converge to sharper minima	Quick prototyping, default starting point
AdamW	Better weight decay behavior than Adam	—	Transformers (standard), most modern DL
LARS/LAMB	Handles very large batch training	Complex	Distributed training

Adam vs SGD Generalization

Adam converges faster，但 SGD with momentum 常找到 flatter minima → generalize better。很多 SOTA 結果（尤其 image classification）用 SGD + well-tuned cosine schedule。AdamW 在 Transformer 上比 vanilla Adam 好。面試中能說出這個 tradeoff 是加分項。

Gradient Clipping

Prevents exploding gradients by capping gradient magnitude:

Clip by norm (more common):

\text{if } \|\nabla\| > \text{threshold}: \quad \nabla \leftarrow \text{threshold} \cdot \frac{\nabla}{\|\nabla\|}

# PyTorch: clip gradients before optimizer.step()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Clip by value: Clamp each gradient element to [-threshold, threshold]. Less common because it changes gradient direction.

面試中：gradient clipping 在 RNN/LSTM training 中幾乎 always used（因為 BPTT 容易 exploding gradients），在 Transformer training 中也常見。

Hands-on: Backpropagation in PyTorch

Training with Autograd

import torch
import torch.nn as nn

# XOR — not linearly separable, needs hidden layer
X = torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype=torch.float32)
y = torch.tensor([[0],[1],[1],[0]], dtype=torch.float32)

model = nn.Sequential(
    nn.Linear(2, 4),
    nn.Sigmoid(),
    nn.Linear(4, 1),
    nn.Sigmoid(),
)
optimizer = torch.optim.SGD(model.parameters(), lr=2.0)
loss_fn = nn.MSELoss()

for epoch in range(3000):
    pred = model(X)
    loss = loss_fn(pred, y)

    optimizer.zero_grad()  # clear previous gradients
    loss.backward()        # compute gradients via chain rule
    optimizer.step()       # update parameters

# [0,0]→0.02, [0,1]→0.97, [1,0]→0.97, [1,1]→0.03 ✓

Inspecting Gradients

# Check gradient values (useful for debugging)
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: grad norm = {param.grad.norm():.4f}")

# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Learning rate scheduler
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
for epoch in range(100):
    train()
    scheduler.step()
    print(f"lr = {scheduler.get_last_lr()[0]:.6f}")

Manual Gradient Computation (Interview Exercise)

import torch

# Manual backprop for a single neuron: y = sigmoid(w*x + b)
x = torch.tensor(2.0)
w = torch.tensor(0.5, requires_grad=True)
b = torch.tensor(0.1, requires_grad=True)
y_true = torch.tensor(1.0)

# Forward
z = w * x + b
y_pred = torch.sigmoid(z)
loss = (y_pred - y_true) ** 2

# Autograd
loss.backward()
print(f"dL/dw (autograd) = {w.grad:.6f}")
print(f"dL/db (autograd) = {b.grad:.6f}")

# Manual: dL/dw = 2(y_pred - y_true) * sigmoid'(z) * x
#                = 2(y_pred - y_true) * y_pred * (1 - y_pred) * x
manual_dw = 2 * (y_pred - y_true) * y_pred * (1 - y_pred) * x
print(f"dL/dw (manual)   = {manual_dw.item():.6f}")
# Should match!

Interview Signals

What interviewers listen for:

你能在 computational graph 上手動計算 gradients（不只是背公式）
你知道 forward pass 為什麼要 cache intermediate values（memory cost of backprop）
你能比較 SGD / Adam / AdamW 的 tradeoffs
你理解 learning rate schedule 的重要性（warmup for Transformers）
你知道 gradient clipping 的作用和使用場景

Practice

Flashcards

Flashcards (1/10)

什麼是 backpropagation 中的 fan-out rule？

當一個 variable 被用在多個 downstream operations 時，它的 gradient 是所有路徑 gradients 的加總。這是 multivariate chain rule 的直接應用。面試中常在 computational graph 題目出現。

Click card to flip

Quiz

Question 1/10

Mini-batch GD 相比 full-batch GD 的主要優勢？

Mark as Complete

How confident are you with this topic?

3/5 — Okay