Backpropagation

Interview Focus

面試官常要求你手動計算小 network 的 gradients、在 computational graph 上解釋 chain rule、或比較 optimizers。深度理解 backprop — 不只是 call loss.backward() — 是區分強弱 candidate 的關鍵。

The Chain Rule

Backpropagation is the chain rule of calculus applied systematically to a computational graph.

For a composition f(g(x))f(g(x)):

fx=fggx\frac{\partial f}{\partial x} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial x}

For deeper composition f(g(h(x)))f(g(h(x))):

fx=fgghhx\frac{\partial f}{\partial x} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial h} \cdot \frac{\partial h}{\partial x}

每個中間 derivative 是 local gradient — 一個 node 對其 immediate input 的 gradient。Backprop 的精髓就是把這些 local gradients 串起來。

Computational Graphs

Any neural network computation can be represented as a DAG (directed acyclic graph):

  • Nodes = operations (add, multiply, activation)
  • Edges = tensor values (forward) and gradients (backward)

Worked Example

L=(wx+by)2\mathcal{L} = (wx + b - y)^2

Forward pass:

  1. z1=wxz_1 = wx
  2. z2=z1+bz_2 = z_1 + b
  3. z3=z2yz_3 = z_2 - y
  4. L=z32\mathcal{L} = z_3^2

Backward pass (reverse order, chain rule):

  1. Lz3=2z3\frac{\partial \mathcal{L}}{\partial z_3} = 2z_3
  2. Lz2=2z31=2z3\frac{\partial \mathcal{L}}{\partial z_2} = 2z_3 \cdot 1 = 2z_3
  3. Lw=2z31x=2(wx+by)x\frac{\partial \mathcal{L}}{\partial w} = 2z_3 \cdot 1 \cdot x = 2(wx + b - y) \cdot x
  4. Lb=2z311=2(wx+by)\frac{\partial \mathcal{L}}{\partial b} = 2z_3 \cdot 1 \cdot 1 = 2(wx + b - y)

Fan-out Rule (Multivariate Chain Rule)

當一個 variable 被用在多個 downstream operations 時,它的 gradient 是所有路徑 gradients 的加總Lx=iLzizix\frac{\partial \mathcal{L}}{\partial x} = \sum_i \frac{\partial \mathcal{L}}{\partial z_i} \cdot \frac{\partial z_i}{\partial x}。面試中常被問到。

Common Local Gradients

面試中可能要你快速說出某個 operation 的 local gradient:

OperationForwardLocal Gradient
Add: z=x+yz = x + yzx=1\frac{\partial z}{\partial x} = 1, zy=1\frac{\partial z}{\partial y} = 1
Multiply: z=xyz = xyzx=y\frac{\partial z}{\partial x} = y, zy=x\frac{\partial z}{\partial y} = x
ReLU: z=max(0,x)z = \max(0, x)zx=1x>0\frac{\partial z}{\partial x} = \mathbf{1}_{x > 0}
Sigmoid: z=σ(x)z = \sigma(x)zx=z(1z)\frac{\partial z}{\partial x} = z(1-z)
MatMul: Z=XW\mathbf{Z} = \mathbf{X}\mathbf{W}LW=Xδ\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \mathbf{X}^\top \delta, LX=δW\frac{\partial \mathcal{L}}{\partial \mathbf{X}} = \delta \mathbf{W}^\top

Forward and Backward Pass

Forward Pass

Compute output layer by layer, caching intermediate values(backward pass 需要用):

z[l]=W[l]a[l1]+b[l]a[l]=σ(z[l])\mathbf{z}^{[l]} = \mathbf{W}^{[l]}\mathbf{a}^{[l-1]} + \mathbf{b}^{[l]} \qquad \mathbf{a}^{[l]} = \sigma(\mathbf{z}^{[l]})

Cache: a[l1]\mathbf{a}^{[l-1]}(用來算 L/W[l]\partial \mathcal{L}/\partial \mathbf{W}^{[l]})和 z[l]\mathbf{z}^{[l]}(用來算 σ\sigma')。

Backward Pass

Starting from loss, compute gradients layer by layer in reverse:

δ[l]=((W[l+1])δ[l+1])σ(z[l])\delta^{[l]} = \left((\mathbf{W}^{[l+1]})^\top \delta^{[l+1]}\right) \odot \sigma'(\mathbf{z}^{[l]}) LW[l]=δ[l](a[l1])Lb[l]=δ[l]\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}} = \delta^{[l]} (\mathbf{a}^{[l-1]})^\top \qquad \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[l]}} = \delta^{[l]}

δ[l]\delta^{[l]} 是 layer ll 的 error signal,\odot 是 element-wise multiplication。

Memory Cost of Backprop

Forward pass 必須 cache 所有 intermediate activations → memory 和 model depth 成正比。這就是為什麼 training 比 inference 吃更多 memory。解法:gradient checkpointing — 只 cache 部分 layers,需要時 recompute(用計算換 memory)。

Loss Functions

Comparison

LossFormulaTaskGradientProperties
MSE1n(yy^)2\frac{1}{n}\sum(y-\hat{y})^2Regression2n(y^y)\frac{2}{n}(\hat{y}-y)對大誤差 penalty 更重
MAE1nyy^\frac{1}{n}\sum\|y-\hat{y}\|Regressionsign(y^y)/n\text{sign}(\hat{y}-y)/nRobust to outliers, non-smooth at 0
HuberMSE if small, MAE if largeRegressionSmooth transitionBest of both worlds
BCE[ylogy^+(1y)log(1y^)]-[y\log\hat{y}+(1-y)\log(1-\hat{y})]Binary clfy^y\hat{y}-y (with sigmoid)Output layer 用 sigmoid
CEyclogy^c-\sum y_c \log \hat{y}_cMulti-classy^y\hat{y}-y (with softmax)Output layer 用 softmax

Softmax + Cross-Entropy: The Perfect Pair

LCEzi=y^iyi\frac{\partial \mathcal{L}_{\text{CE}}}{\partial z_i} = \hat{y}_i - y_i

Gradient 簡化成 y^y\hat{y} - y — 非常 clean。這就是為什麼 frameworks 提供 fused CrossEntropyLoss(internal 先做 log_softmax 避免 numerical issues,再算 NLL loss)。

面試經典問題

「為什麼不先 softmax 再 log 再 cross-entropy?」— 因為 softmax 的值可能 underflow to 0 → log(0)=\log(0) = -\infty。Fused 計算用 log-sum-exp trick 避免這個問題,numerically 更穩定。

Learning Rate

θt+1=θtηθL\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}
η\etaEffect
Too largeOvershoots minima, loss oscillates or diverges
Too smallConvergence painfully slow, may get trapped in poor local minimum
Just rightSmooth convergence to good minimum

Learning Rate Schedules

ScheduleFormula / BehaviorUse Case
Step decayηη/10\eta \to \eta/10 every kk epochsClassic, simple
Cosine annealingηt=ηmin+12(ηmaxηmin)(1+cos(tTπ))\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\frac{t}{T}\pi))Modern default(smooth decay)
Warmup + decayLinear increase for first kk steps, then decayTransformers(Adam + warmup)
One-cycleIncrease to max, then decrease to near zeroFast training(Leslie Smith)
ReduceOnPlateauReduce when validation loss plateausAdaptive, easy to use
from torch.optim.lr_scheduler import CosineAnnealingLR, OneCycleLR

# Cosine annealing
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

# One-cycle policy
scheduler = OneCycleLR(optimizer, max_lr=0.01, total_steps=1000)

# In training loop:
for epoch in range(num_epochs):
    train_one_epoch()
    scheduler.step()

Warmup for Transformers

Transformer training 通常需要 warmup — 前幾千步 linearly increase lr from 0 to target。因為 Adam 的 moment estimates 在初期不準確,large lr + inaccurate moments → unstable updates。Warmup 讓 model 在 moments stabilize 後再用 full lr。

Gradient Descent Variants

Batch vs Stochastic vs Mini-Batch

VariantBatch SizeGradient QualitySpeedGPU Utilization
Batch GDNN (all data)Exact, stableVery slow per updateFull
SGD1Very noisyFastest updatesPoor
Mini-BatchBB (32-256)Balanced noiseBalancedBest

Mini-Batch GD 是 practical standard:

  • Noise from sampling acts as implicit regularization(helps escape sharp minima)
  • Exploits GPU parallelism(vectorized batch operations)
  • Typical batch sizes: 32, 64, 128, 256

Batch Size Effects

Small BatchLarge Batch
More noisy gradient → better generalizationLess noise → potentially sharp minima(worse generalization)
More updates per epoch → faster convergence in wall time?Fewer updates but higher GPU utilization
Needs smaller learning rateCan scale learning rate linearly(linear scaling rule)

Large Batch Training Challenge

Large batch 容易 converge to sharp minima → generalize worse。解法:linear scaling rule(batch size ×2 → lr ×2)、warmup、LARS/LAMB optimizer。這是為什麼 Google 和 Facebook 做 distributed training 時需要特殊技術。

Gradient Descent Visualization

Explore how different optimizers navigate a loss surface:

Gradient Descent Visualization

Loss surface: f(x, y) = x² + 3y² + 2xy + 5 — click the contour to set a starting point, then press Start.

Iter: 0
Loss: 101.0000
x: 4.0000
y: 4.0000
Optimizer:

Click the canvas to set a starting point before running.

Optimizers

SGD with Momentum

Accumulates past gradients to smooth updates and accelerate through narrow valleys:

vt=βvt1+θLθt+1=θtηvtv_t = \beta v_{t-1} + \nabla_\theta \mathcal{L} \qquad \theta_{t+1} = \theta_t - \eta v_t

Momentum (β0.9\beta \approx 0.9): 像一個重球在 loss surface 上滾動 — 在 consistent gradient direction 上加速,在 oscillating directions 上 dampen。

RMSProp

Adapts learning rate per-parameter using running average of squared gradients:

st=βst1+(1β)(θL)2θt+1=θtηst+ϵθLs_t = \beta s_{t-1} + (1 - \beta)(\nabla_\theta \mathcal{L})^2 \qquad \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{s_t + \epsilon}} \nabla_\theta \mathcal{L}

Large gradient → smaller effective lr(避免 overshoot)。Small gradient → larger effective lr(加速更新)。解決 Adagrad 的 learning rate 不斷衰減到零的問題。

Adam (Adaptive Moment Estimation)

Combines momentum (first moment) and RMSProp (second moment) with bias correction:

mt=β1mt1+(1β1)θL(first moment / mean)m_t = \beta_1 m_{t-1} + (1 - \beta_1)\nabla_\theta \mathcal{L} \quad \text{(first moment / mean)} vt=β2vt1+(1β2)(θL)2(second moment / variance)v_t = \beta_2 v_{t-1} + (1 - \beta_2)(\nabla_\theta \mathcal{L})^2 \quad \text{(second moment / variance)} m^t=mt1β1t,v^t=vt1β2t(bias correction)\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \quad \text{(bias correction)} θt+1=θtηv^t+ϵm^t\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon}\hat{m}_t

Defaults: β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, ϵ=108\epsilon = 10^{-8}.

AdamW (Decoupled Weight Decay)

Standard Adam 的 weight decay 和 adaptive learning rate 耦合 → 在 large lr 的 parameters 上 weight decay 效果不一致。

AdamW decouples weight decay from the adaptive update:

θt+1=θtη(m^tv^t+ϵ+λθt)\theta_{t+1} = \theta_t - \eta\left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t\right)

AdamW 是 Transformer training 的 de facto standard(BERT, GPT 都用 AdamW)。

Optimizer Comparison

OptimizerProsConsBest For
SGD + MomentumFinds flatter minima → better generalizationNeeds careful lr tuning + scheduleCNNs (ResNet), when best generalization matters
AdamFast convergence, adaptive, less tuningMay converge to sharper minimaQuick prototyping, default starting point
AdamWBetter weight decay behavior than AdamTransformers (standard), most modern DL
LARS/LAMBHandles very large batch trainingComplexDistributed training

Adam vs SGD Generalization

Adam converges faster,但 SGD with momentum 常找到 flatter minima → generalize better。很多 SOTA 結果(尤其 image classification)用 SGD + well-tuned cosine schedule。AdamW 在 Transformer 上比 vanilla Adam 好。面試中能說出這個 tradeoff 是加分項。

Gradient Clipping

Prevents exploding gradients by capping gradient magnitude:

Clip by norm (more common):

if >threshold:threshold\text{if } \|\nabla\| > \text{threshold}: \quad \nabla \leftarrow \text{threshold} \cdot \frac{\nabla}{\|\nabla\|}
# PyTorch: clip gradients before optimizer.step()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Clip by value: Clamp each gradient element to [-threshold, threshold]. Less common because it changes gradient direction.

面試中:gradient clipping 在 RNN/LSTM training 中幾乎 always used(因為 BPTT 容易 exploding gradients),在 Transformer training 中也常見。

Hands-on: Backpropagation in PyTorch

Training with Autograd

import torch
import torch.nn as nn

# XOR — not linearly separable, needs hidden layer
X = torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype=torch.float32)
y = torch.tensor([[0],[1],[1],[0]], dtype=torch.float32)

model = nn.Sequential(
    nn.Linear(2, 4),
    nn.Sigmoid(),
    nn.Linear(4, 1),
    nn.Sigmoid(),
)
optimizer = torch.optim.SGD(model.parameters(), lr=2.0)
loss_fn = nn.MSELoss()

for epoch in range(3000):
    pred = model(X)
    loss = loss_fn(pred, y)

    optimizer.zero_grad()  # clear previous gradients
    loss.backward()        # compute gradients via chain rule
    optimizer.step()       # update parameters

# [0,0]→0.02, [0,1]→0.97, [1,0]→0.97, [1,1]→0.03 ✓

Inspecting Gradients

# Check gradient values (useful for debugging)
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: grad norm = {param.grad.norm():.4f}")

# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Learning rate scheduler
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
for epoch in range(100):
    train()
    scheduler.step()
    print(f"lr = {scheduler.get_last_lr()[0]:.6f}")

Manual Gradient Computation (Interview Exercise)

import torch

# Manual backprop for a single neuron: y = sigmoid(w*x + b)
x = torch.tensor(2.0)
w = torch.tensor(0.5, requires_grad=True)
b = torch.tensor(0.1, requires_grad=True)
y_true = torch.tensor(1.0)

# Forward
z = w * x + b
y_pred = torch.sigmoid(z)
loss = (y_pred - y_true) ** 2

# Autograd
loss.backward()
print(f"dL/dw (autograd) = {w.grad:.6f}")
print(f"dL/db (autograd) = {b.grad:.6f}")

# Manual: dL/dw = 2(y_pred - y_true) * sigmoid'(z) * x
#                = 2(y_pred - y_true) * y_pred * (1 - y_pred) * x
manual_dw = 2 * (y_pred - y_true) * y_pred * (1 - y_pred) * x
print(f"dL/dw (manual)   = {manual_dw.item():.6f}")
# Should match!

Interview Signals

What interviewers listen for:

  • 你能在 computational graph 上手動計算 gradients(不只是背公式)
  • 你知道 forward pass 為什麼要 cache intermediate values(memory cost of backprop)
  • 你能比較 SGD / Adam / AdamW 的 tradeoffs
  • 你理解 learning rate schedule 的重要性(warmup for Transformers)
  • 你知道 gradient clipping 的作用和使用場景

Practice

Flashcards

Flashcards (1/10)

什麼是 backpropagation 中的 fan-out rule?

當一個 variable 被用在多個 downstream operations 時,它的 gradient 是所有路徑 gradients 的加總。這是 multivariate chain rule 的直接應用。面試中常在 computational graph 題目出現。

Click card to flip

Quiz

Question 1/10

Mini-batch GD 相比 full-batch GD 的主要優勢?

Mark as Complete

3/5 — Okay