Neural Networks Fundamentals
Interview Priority
Neural network fundamentals 幾乎出現在每場 deep learning 面試中。面試官期待你能解釋每個技術為什麼存在,而不只是它做什麼。Focus on activation functions、initialization、normalization、和 regularization 的原理和 tradeoffs。
The Perceptron
The simplest neural network unit — computes a weighted sum and applies a step function:
A single perceptron can only learn linearly separable functions. 它無法學會 XOR — 這個限制推動了 multi-layer networks 的發展。
Multi-Layer Perceptron (MLP)
An MLP stacks multiple layers of neurons with nonlinear activations:
where is a nonlinear activation function.
Why nonlinearity is essential: Without it, stacking layers is pointless — (just a single linear transformation)。Nonlinearity 讓 network 能 approximate 任意複雜的函數。
Activation Functions
Comparison Table
| Activation | Formula | Output Range | Derivative | Usage |
|---|---|---|---|---|
| ReLU | 0 or 1 | Hidden layers (default) | ||
| Leaky ReLU | or 1 | Hidden layers (avoids dying ReLU) | ||
| GELU | Smooth | Transformers (BERT, GPT) | ||
| Sigmoid | , max=0.25 | Output layer (binary classification) | ||
| Tanh | , max=1 | Hidden layers (legacy), LSTM gates | ||
| Softmax | , sum=1 | — | Output layer (multi-class) |
ReLU and Its Variants
ReLU:
- Pros: 計算高效、mitigates vanishing gradient(gradient = 1 for positive values)、promotes sparsity(很多 neurons output 0)
- Cons: Dying ReLU — 如果某 neuron 的 input 永遠為負 → gradient = 0 → 永遠不更新 → 這個 neuron「死了」
Leaky ReLU: ,
解決 dying ReLU — negative 側有小的 gradient(),讓 neuron 還有機會 recover。
GELU (Gaussian Error Linear Unit):
Smooth approximation of ReLU。在 Transformers(BERT, GPT)中幾乎是 standard。比 ReLU smooth → training 更穩定。
Sigmoid: Why Not in Hidden Layers
Sigmoid's maximum derivative is 0.25. After layers:
10 layers → → vanishing gradients。所以 sigmoid 只在 output layer(binary classification)或 gating mechanisms(LSTM forget gate)中使用,不在 hidden layers。
Softmax: Details
Converts logits → probability distribution(all values > 0, sum = 1)。
Numerical Stability
直接計算 可能 overflow( 很大時)。實務中先減去 max:。結果不變(numerator 和 denominator 同乘常數),但避免 overflow。面試常見 follow-up。
Choosing Activation Functions
| Layer | Recommended | Why |
|---|---|---|
| Hidden layers (general) | ReLU or Leaky ReLU | Fast, avoids vanishing gradient |
| Hidden layers (Transformer) | GELU | Smooth, empirically better for attention |
| Output (binary classification) | Sigmoid | Maps to probability [0, 1] |
| Output (multi-class) | Softmax | Probability distribution over K classes |
| Output (regression) | None (linear) | Unbounded real-valued output |
| LSTM/GRU gates | Sigmoid (0-1 range = gate) | Controls flow of information |
| LSTM cell candidate | Tanh | Zero-centered, bounded |
Universal Approximation Theorem
A feedforward network with a single hidden layer and a finite number of neurons can approximate any continuous function on a compact subset of .
面試中的 key nuance — 定理保證存在性但不保證:
- 需要多少 neurons(可能需要 exponentially many)
- Gradient descent 能不能找到正確的 weights
- Network 能不能泛化到 unseen data
這就是為什麼 depth matters — deep networks 可以用 exponentially fewer parameters 表示某些函數。
Vanishing and Exploding Gradients
In a deep network with layers, gradients involve a product of factors:
每個 factor 涉及 weight matrix 和 activation derivative。如果 factors consistently < 1 → gradients vanish exponentially。Consistently > 1 → gradients explode。
Solutions
| Problem | Solutions |
|---|---|
| Vanishing | ReLU(gradient=1 for positive), proper initialization (He/Xavier), residual connections (ResNet), LSTM cell state |
| Exploding | Gradient clipping (), proper initialization, BatchNorm |
| Both | Careful architecture design, normalization layers |
Sigmoid + Deep Networks = Disaster
Sigmoid max derivative = 0.25。10 layers → gradient ≤ 。早期 layers 幾乎不更新。這就是為什麼 2010 年前 deep networks 很難訓練 — 直到 ReLU 和 proper initialization 出現。
Weight Initialization
Proper initialization keeps the variance of activations and gradients stable across layers — 太大會 explode,太小會 vanish。
Xavier / Glorot Initialization
Designed for sigmoid and tanh:
確保每層 output 的 variance ≈ input 的 variance → signal 不會隨 depth shrink or grow。
He / Kaiming Initialization
Designed for ReLU:
ReLU 把一半的值 zero out → output variance halved。Factor of 2 compensates for this。
Why Initialization Matters
| Init Strategy | Activation | What Happens If Wrong |
|---|---|---|
| All zeros | Any | All neurons compute same function → no symmetry breaking → useless |
| Too large | Any | Activations explode → sigmoid saturates → vanishing gradient |
| Too small | Any | Activations → 0 → gradients → 0 → nothing learns |
| Xavier | ReLU | Variance shrinks by half each layer(doesn't account for ReLU) |
| He | Sigmoid/Tanh | Variance grows each layer(too aggressive for bounded activations) |
面試經典問題
「為什麼不能全部初始化為零?」— 所有 neurons 會計算相同的函數、得到相同的 gradient、做相同的 update → 所有 neurons 永遠相同(symmetry breaking 不發生)。必須 random initialization 讓 neurons 各自不同。
Batch Normalization
Normalize the input to each layer across the mini-batch:
where , are mini-batch statistics, and are learnable parameters.
Why BatchNorm Works
| Benefit | Explanation |
|---|---|
| Reduces internal covariate shift | 每層 input 的分布在 training 過程中不斷變化 → BatchNorm 穩定它 |
| Allows higher learning rates | Normalized inputs → gradients 更穩定 → 可以用更大的 lr |
| Mild regularization | Mini-batch statistics 有 noise → 類似 dropout 的效果 |
| Smooths loss landscape | 讓 optimization surface 更 smooth → easier to optimize |
Train vs Inference
| Phase | Statistics Used |
|---|---|
| Training | Mini-batch mean and variance() |
| Inference | Running averages accumulated during training(deterministic, batch-independent) |
BatchNorm vs LayerNorm
| BatchNorm | LayerNorm | |
|---|---|---|
| Normalizes across | Batch dimension(across samples) | Feature dimension(within single sample) |
| Depends on batch size? | Yes(小 batch 不穩定) | No |
| Best for | CNNs, large batch training | RNNs, Transformers(variable-length sequences) |
| At inference | Uses running stats | Same as training |
LayerNorm 在 Transformers 中是 standard — 因為 sequence length 可變,batch statistics 不可靠。
Dropout
During training, randomly set each neuron's output to zero with probability :
Scaling factor = inverted dropout — 確保 expected output 不變,inference 時不需要調整。
Why Dropout Works
- Prevents co-adaptation: Each neuron must learn to be useful independently
- Implicit ensemble: Training with dropout ≈ averaging exponentially many sub-networks
- Regularization: 和 L2 regularization 有理論上的等價關係(Gaussian dropout ≈ L2)
Dropout in Practice
| Context | Typical | Note |
|---|---|---|
| FC layers | 0.5 | Standard rate |
| Conv layers | 0.1-0.3 or none | Conv 的 spatial structure 已經有 regularization 效果 |
| After embedding | 0.1-0.3 | Transformers 中常見 |
| At inference | 0 (off) | 所有 neurons active |
Modern Alternatives to Dropout
在現代 architectures 中,dropout 逐漸被其他 regularization 取代:BatchNorm(implicit regularization)、data augmentation、weight decay、early stopping。在 Transformers 中仍然使用 dropout(attention dropout + residual dropout),但 rate 通常很小(0.1)。
Other Regularization Techniques
| Technique | How It Works | When to Use |
|---|---|---|
| Weight decay (L2) | Add to loss | Almost always(standard in Adam/SGD) |
| Early stopping | Stop training when validation loss stops improving | 最簡單且常被低估的 regularization |
| Data augmentation | 對 training data 做 random transformations | Images(flip, crop, rotate), text(synonym replacement) |
| Label smoothing | Replace hard targets with | 防止 model 過度 confident |
| Mixup | Train on convex combinations of input pairs | Creates virtual training examples |
Hands-on: Neural Networks in PyTorch
MLP
import torch
import torch.nn as nn
class MLP(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim, dropout=0.5):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.BatchNorm1d(hidden_dim),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, hidden_dim),
nn.BatchNorm1d(hidden_dim),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, output_dim),
)
def forward(self, x):
return self.net(x)
model = MLP(input_dim=784, hidden_dim=256, output_dim=10)
# Note: no activation after last layer — CrossEntropyLoss includes softmax
Training Loop
import torch.optim as optim
criterion = nn.CrossEntropyLoss() # includes softmax internally
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
for epoch in range(num_epochs):
model.train() # enable dropout + batch norm training mode
for X_batch, y_batch in train_loader:
logits = model(X_batch)
loss = criterion(logits, y_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
model.eval() # disable dropout + use running stats for batch norm
with torch.no_grad():
val_logits = model(X_val)
val_loss = criterion(val_logits, y_val)
Initialization
# He initialization for ReLU layers
def init_weights(m):
if isinstance(m, nn.Linear):
nn.init.kaiming_normal_(m.weight, nonlinearity="relu")
nn.init.zeros_(m.bias)
model.apply(init_weights)
Real-World Use Cases
Case 1: 信用卡詐欺偵測 — MLP vs Traditional ML
| Approach | Pros | Cons |
|---|---|---|
| GBM (LightGBM) | Handles tabular data best, interpretable, fast | Less flexible for raw features |
| MLP | Can learn from raw features, flexible | Needs more data, harder to interpret |
| Autoencoder | Unsupervised anomaly detection(learn normal → detect abnormal) | Doesn't directly optimize for fraud detection |
面試 follow-up:「什麼時候 NN 比 GBM 好?」— 當 features 是 raw/unstructured(images, text, sequences)或資料量非常大(> 10M rows, NN 的 capacity 才能被利用)。對 moderate-size tabular data,GBM 幾乎總是更好。
Case 2: 推薦系統 — Embedding Layers
Neural networks 在推薦系統中的核心貢獻是 embedding layers — 把 high-cardinality categorical features(user ID, item ID)映射到 dense low-dimensional vectors:
class RecModel(nn.Module):
def __init__(self, n_users, n_items, embed_dim=64):
super().__init__()
self.user_embed = nn.Embedding(n_users, embed_dim)
self.item_embed = nn.Embedding(n_items, embed_dim)
self.fc = nn.Sequential(
nn.Linear(embed_dim * 2, 128),
nn.ReLU(),
nn.Linear(128, 1),
)
def forward(self, user_ids, item_ids):
u = self.user_embed(user_ids)
i = self.item_embed(item_ids)
return self.fc(torch.cat([u, i], dim=1)).squeeze()
Embedding layer 本質上是 a lookup table — 但 weights 透過 backprop 學習,每個 user/item 得到一個 meaningful vector。
Case 3: NLP — From Bag-of-Words to Embeddings
NN 的 word embeddings(Word2Vec → BERT → GPT)revolutionized NLP:
- One-hot: 10K vocab → 10K sparse vector(no semantic info)
- Word2Vec embedding: 10K vocab → 300-dim dense vector(king - man + woman ≈ queen)
- BERT contextual embedding: Same word 在不同 context 中有不同 embedding
面試中知道這個 progression 很重要 — 從 sparse representation 到 learned dense embeddings 是 neural network 在 NLP 中的核心貢獻。
Interview Signals
What interviewers listen for:
- 你知道為什麼需要 nonlinearity(沒有 → stacking layers 等於 single linear layer)
- 你能解釋 vanishing gradient 的原因和解法(ReLU, initialization, skip connections)
- 你知道 Xavier vs He initialization 各自對應什麼 activation
- 你能比較 BatchNorm vs LayerNorm 的適用場景
- 你理解 dropout 的直覺(implicit ensemble, prevent co-adaptation)
Practice
Flashcards
Flashcards (1/10)
Why can't a single perceptron learn XOR?
XOR is not linearly separable — no single hyperplane can separate positive from negative examples. 需要至少一個 hidden layer(two perceptrons)才能創建 nonlinear decision boundary。
Quiz
Which activation is most likely to cause vanishing gradients in a deep network?