Neural Networks Fundamentals

Interview Priority

Neural network fundamentals 幾乎出現在每場 deep learning 面試中。面試官期待你能解釋每個技術為什麼存在，而不只是它做什麼。Focus on activation functions、initialization、normalization、和 regularization 的原理和 tradeoffs。

The Perceptron

The simplest neural network unit — computes a weighted sum and applies a step function:

y = \begin{cases} 1 & \text{if } \mathbf{w}^T\mathbf{x} + b \geq 0 \\ 0 & \text{otherwise} \end{cases}

A single perceptron can only learn linearly separable functions. 它無法學會 XOR — 這個限制推動了 multi-layer networks 的發展。

Multi-Layer Perceptron (MLP)

An MLP stacks multiple layers of neurons with nonlinear activations:

\mathbf{h} = \sigma(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) \qquad \mathbf{y} = \mathbf{W}_2 \mathbf{h} + \mathbf{b}_2

where $\sigma$ is a nonlinear activation function.

Why nonlinearity is essential: Without it, stacking layers is pointless — $\mathbf{W}_2(\mathbf{W}_1\mathbf{x}) = \mathbf{W}'\mathbf{x}$ （just a single linear transformation）。Nonlinearity 讓 network 能 approximate 任意複雜的函數。

Activation Functions

Comparison Table

Activation	Formula	Output Range	Derivative	Usage
ReLU	$\max(0, x)$	$[0, \infty)$	0 or 1	Hidden layers (default)
Leaky ReLU	$\max(\alpha x, x)$	$(-\infty, \infty)$	$\alpha$ or 1	Hidden layers (avoids dying ReLU)
GELU	$x \cdot \Phi(x)$	$\approx (-0.17, \infty)$	Smooth	Transformers (BERT, GPT)
Sigmoid	$\frac{1}{1+e^{-x}}$	$(0, 1)$	$\sigma(1-\sigma)$ , max=0.25	Output layer (binary classification)
Tanh	$\frac{e^x - e^{-x}}{e^x + e^{-x}}$	$(-1, 1)$	$1 - \tanh^2(x)$ , max=1	Hidden layers (legacy), LSTM gates
Softmax	$\frac{e^{z_i}}{\sum_j e^{z_j}}$	$(0, 1)$ , sum=1	—	Output layer (multi-class)

ReLU and Its Variants

ReLU: $\max(0, x)$

Pros: 計算高效、mitigates vanishing gradient（gradient = 1 for positive values）、promotes sparsity（很多 neurons output 0）
Cons: Dying ReLU — 如果某 neuron 的 input 永遠為負 → gradient = 0 → 永遠不更新 → 這個 neuron「死了」

Leaky ReLU: $\max(\alpha x, x)$ , $\alpha \approx 0.01$

解決 dying ReLU — negative 側有小的 gradient（ $\alpha$ ），讓 neuron 還有機會 recover。

GELU (Gaussian Error Linear Unit): $x \cdot \Phi(x)$

Smooth approximation of ReLU。在 Transformers（BERT, GPT）中幾乎是 standard。比 ReLU smooth → training 更穩定。

Sigmoid: Why Not in Hidden Layers

Sigmoid's maximum derivative is 0.25. After $L$ layers:

\text{gradient} \leq 0.25^L

10 layers → $0.25^{10} \approx 10^{-6}$ → vanishing gradients。所以 sigmoid 只在 output layer（binary classification）或 gating mechanisms（LSTM forget gate）中使用，不在 hidden layers。

Softmax: Details

\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

Converts logits → probability distribution（all values > 0, sum = 1）。

Numerical Stability

直接計算 $e^{z_i}$ 可能 overflow（ $z$ 很大時）。實務中先減去 max： $\text{softmax}(z_i) = \frac{e^{z_i - \max(\mathbf{z})}}{\sum_j e^{z_j - \max(\mathbf{z})}}$ 。結果不變（numerator 和 denominator 同乘常數），但避免 overflow。面試常見 follow-up。

Choosing Activation Functions

Layer	Recommended	Why
Hidden layers (general)	ReLU or Leaky ReLU	Fast, avoids vanishing gradient
Hidden layers (Transformer)	GELU	Smooth, empirically better for attention
Output (binary classification)	Sigmoid	Maps to probability [0, 1]
Output (multi-class)	Softmax	Probability distribution over K classes
Output (regression)	None (linear)	Unbounded real-valued output
LSTM/GRU gates	Sigmoid (0-1 range = gate)	Controls flow of information
LSTM cell candidate	Tanh	Zero-centered, bounded

Universal Approximation Theorem

A feedforward network with a single hidden layer and a finite number of neurons can approximate any continuous function on a compact subset of $\mathbb{R}^n$ .

面試中的 key nuance — 定理保證存在性但不保證：

需要多少 neurons（可能需要 exponentially many）
Gradient descent 能不能找到正確的 weights
Network 能不能泛化到 unseen data

這就是為什麼 depth matters — deep networks 可以用 exponentially fewer parameters 表示某些函數。

Vanishing and Exploding Gradients

In a deep network with $L$ layers, gradients involve a product of $L$ factors:

\frac{\partial \mathcal{L}}{\partial \mathbf{W}_1} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_L} \prod_{l=2}^{L} \frac{\partial \mathbf{h}_l}{\partial \mathbf{h}_{l-1}} \cdot \frac{\partial \mathbf{h}_1}{\partial \mathbf{W}_1}

每個 factor 涉及 weight matrix 和 activation derivative。如果 factors consistently < 1 → gradients vanish exponentially。Consistently > 1 → gradients explode。

Solutions

Problem	Solutions
Vanishing	ReLU（gradient=1 for positive）, proper initialization (He/Xavier), residual connections (ResNet), LSTM cell state
Exploding	Gradient clipping ( $\\|\nabla\\| \leq \text{threshold}$ ), proper initialization, BatchNorm
Both	Careful architecture design, normalization layers

Sigmoid + Deep Networks = Disaster

Sigmoid max derivative = 0.25。10 layers → gradient ≤ $0.25^{10} \approx 10^{-6}$ 。早期 layers 幾乎不更新。這就是為什麼 2010 年前 deep networks 很難訓練 — 直到 ReLU 和 proper initialization 出現。

Weight Initialization

Proper initialization keeps the variance of activations and gradients stable across layers — 太大會 explode，太小會 vanish。

Xavier / Glorot Initialization

Designed for sigmoid and tanh:

W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)

確保每層 output 的 variance ≈ input 的 variance → signal 不會隨 depth shrink or grow。

He / Kaiming Initialization

Designed for ReLU:

W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)

ReLU 把一半的值 zero out → output variance halved。Factor of 2 compensates for this。

Why Initialization Matters

Init Strategy	Activation	What Happens If Wrong
All zeros	Any	All neurons compute same function → no symmetry breaking → useless
Too large	Any	Activations explode → sigmoid saturates → vanishing gradient
Too small	Any	Activations → 0 → gradients → 0 → nothing learns
Xavier	ReLU	Variance shrinks by half each layer（doesn't account for ReLU）
He	Sigmoid/Tanh	Variance grows each layer（too aggressive for bounded activations）

面試經典問題

「為什麼不能全部初始化為零？」— 所有 neurons 會計算相同的函數、得到相同的 gradient、做相同的 update → 所有 neurons 永遠相同（symmetry breaking 不發生）。必須 random initialization 讓 neurons 各自不同。

Batch Normalization

Normalize the input to each layer across the mini-batch:

\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \qquad y_i = \gamma \hat{x}_i + \beta

where $\mu_B$ , $\sigma_B^2$ are mini-batch statistics, $\gamma$ and $\beta$ are learnable parameters.

Why BatchNorm Works

Benefit	Explanation
Reduces internal covariate shift	每層 input 的分布在 training 過程中不斷變化 → BatchNorm 穩定它
Allows higher learning rates	Normalized inputs → gradients 更穩定 → 可以用更大的 lr
Mild regularization	Mini-batch statistics 有 noise → 類似 dropout 的效果
Smooths loss landscape	讓 optimization surface 更 smooth → easier to optimize

Train vs Inference

Phase	Statistics Used
Training	Mini-batch mean and variance（ $\mu_B, \sigma_B^2$ ）
Inference	Running averages accumulated during training（deterministic, batch-independent）

BatchNorm vs LayerNorm

	BatchNorm	LayerNorm
Normalizes across	Batch dimension（across samples）	Feature dimension（within single sample）
Depends on batch size?	Yes（小 batch 不穩定）	No
Best for	CNNs, large batch training	RNNs, Transformers（variable-length sequences）
At inference	Uses running stats	Same as training

LayerNorm 在 Transformers 中是 standard — 因為 sequence length 可變，batch statistics 不可靠。

Dropout

During training, randomly set each neuron's output to zero with probability $p$ :

\tilde{h}_i = \frac{1}{1-p} \cdot h_i \cdot m_i, \quad m_i \sim \text{Bernoulli}(1-p)

Scaling factor $\frac{1}{1-p}$ = inverted dropout — 確保 expected output 不變，inference 時不需要調整。

Why Dropout Works

Prevents co-adaptation: Each neuron must learn to be useful independently
Implicit ensemble: Training with dropout ≈ averaging exponentially many sub-networks
Regularization: 和 L2 regularization 有理論上的等價關係（Gaussian dropout ≈ L2）

Dropout in Practice

Context	Typical $p$	Note
FC layers	0.5	Standard rate
Conv layers	0.1-0.3 or none	Conv 的 spatial structure 已經有 regularization 效果
After embedding	0.1-0.3	Transformers 中常見
At inference	0 (off)	所有 neurons active

Modern Alternatives to Dropout

在現代 architectures 中，dropout 逐漸被其他 regularization 取代：BatchNorm（implicit regularization）、data augmentation、weight decay、early stopping。在 Transformers 中仍然使用 dropout（attention dropout + residual dropout），但 rate 通常很小（0.1）。

Other Regularization Techniques

Technique	How It Works	When to Use
Weight decay (L2)	Add $\lambda \sum w_i^2$ to loss	Almost always（standard in Adam/SGD）
Early stopping	Stop training when validation loss stops improving	最簡單且常被低估的 regularization
Data augmentation	對 training data 做 random transformations	Images（flip, crop, rotate）, text（synonym replacement）
Label smoothing	Replace hard targets with $y = (1-\epsilon) \cdot y + \epsilon/K$	防止 model 過度 confident
Mixup	Train on convex combinations of input pairs	Creates virtual training examples

Hands-on: Neural Networks in PyTorch

MLP

import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, dropout=0.5):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, output_dim),
        )

    def forward(self, x):
        return self.net(x)

model = MLP(input_dim=784, hidden_dim=256, output_dim=10)
# Note: no activation after last layer — CrossEntropyLoss includes softmax

Training Loop

import torch.optim as optim

criterion = nn.CrossEntropyLoss()  # includes softmax internally
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)

for epoch in range(num_epochs):
    model.train()  # enable dropout + batch norm training mode
    for X_batch, y_batch in train_loader:
        logits = model(X_batch)
        loss = criterion(logits, y_batch)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    model.eval()  # disable dropout + use running stats for batch norm
    with torch.no_grad():
        val_logits = model(X_val)
        val_loss = criterion(val_logits, y_val)

Initialization

# He initialization for ReLU layers
def init_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.kaiming_normal_(m.weight, nonlinearity="relu")
        nn.init.zeros_(m.bias)

model.apply(init_weights)

Real-World Use Cases

Case 1: 信用卡詐欺偵測 — MLP vs Traditional ML

Approach	Pros	Cons
GBM (LightGBM)	Handles tabular data best, interpretable, fast	Less flexible for raw features
MLP	Can learn from raw features, flexible	Needs more data, harder to interpret
Autoencoder	Unsupervised anomaly detection（learn normal → detect abnormal）	Doesn't directly optimize for fraud detection

面試 follow-up：「什麼時候 NN 比 GBM 好？」— 當 features 是 raw/unstructured（images, text, sequences）或資料量非常大（> 10M rows, NN 的 capacity 才能被利用）。對 moderate-size tabular data，GBM 幾乎總是更好。

Case 2: 推薦系統 — Embedding Layers

Neural networks 在推薦系統中的核心貢獻是 embedding layers — 把 high-cardinality categorical features（user ID, item ID）映射到 dense low-dimensional vectors：

class RecModel(nn.Module):
    def __init__(self, n_users, n_items, embed_dim=64):
        super().__init__()
        self.user_embed = nn.Embedding(n_users, embed_dim)
        self.item_embed = nn.Embedding(n_items, embed_dim)
        self.fc = nn.Sequential(
            nn.Linear(embed_dim * 2, 128),
            nn.ReLU(),
            nn.Linear(128, 1),
        )

    def forward(self, user_ids, item_ids):
        u = self.user_embed(user_ids)
        i = self.item_embed(item_ids)
        return self.fc(torch.cat([u, i], dim=1)).squeeze()

Embedding layer 本質上是 a lookup table — 但 weights 透過 backprop 學習，每個 user/item 得到一個 meaningful vector。

Case 3: NLP — From Bag-of-Words to Embeddings

NN 的 word embeddings（Word2Vec → BERT → GPT）revolutionized NLP：

One-hot: 10K vocab → 10K sparse vector（no semantic info）
Word2Vec embedding: 10K vocab → 300-dim dense vector（king - man + woman ≈ queen）
BERT contextual embedding: Same word 在不同 context 中有不同 embedding

面試中知道這個 progression 很重要 — 從 sparse representation 到 learned dense embeddings 是 neural network 在 NLP 中的核心貢獻。

Interview Signals

What interviewers listen for:

你知道為什麼需要 nonlinearity（沒有 → stacking layers 等於 single linear layer）
你能解釋 vanishing gradient 的原因和解法（ReLU, initialization, skip connections）
你知道 Xavier vs He initialization 各自對應什麼 activation
你能比較 BatchNorm vs LayerNorm 的適用場景
你理解 dropout 的直覺（implicit ensemble, prevent co-adaptation）

Practice

Flashcards

Flashcards (1/10)

Why can't a single perceptron learn XOR?

XOR is not linearly separable — no single hyperplane can separate positive from negative examples. 需要至少一個 hidden layer（two perceptrons）才能創建 nonlinear decision boundary。

Click card to flip

Quiz

Question 1/10

Which activation is most likely to cause vanishing gradients in a deep network?

Mark as Complete

How confident are you with this topic?

3/5 — Okay