Neural Networks Fundamentals

Interview Priority

Neural network fundamentals 幾乎出現在每場 deep learning 面試中。面試官期待你能解釋每個技術為什麼存在,而不只是它做什麼。Focus on activation functions、initialization、normalization、和 regularization 的原理和 tradeoffs。

The Perceptron

The simplest neural network unit — computes a weighted sum and applies a step function:

y={1if wTx+b00otherwisey = \begin{cases} 1 & \text{if } \mathbf{w}^T\mathbf{x} + b \geq 0 \\ 0 & \text{otherwise} \end{cases}

A single perceptron can only learn linearly separable functions. 它無法學會 XOR — 這個限制推動了 multi-layer networks 的發展。

Multi-Layer Perceptron (MLP)

An MLP stacks multiple layers of neurons with nonlinear activations:

h=σ(W1x+b1)y=W2h+b2\mathbf{h} = \sigma(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) \qquad \mathbf{y} = \mathbf{W}_2 \mathbf{h} + \mathbf{b}_2

where σ\sigma is a nonlinear activation function.

Why nonlinearity is essential: Without it, stacking layers is pointless — W2(W1x)=Wx\mathbf{W}_2(\mathbf{W}_1\mathbf{x}) = \mathbf{W}'\mathbf{x}(just a single linear transformation)。Nonlinearity 讓 network 能 approximate 任意複雜的函數。

Activation Functions

Comparison Table

ActivationFormulaOutput RangeDerivativeUsage
ReLUmax(0,x)\max(0, x)[0,)[0, \infty)0 or 1Hidden layers (default)
Leaky ReLUmax(αx,x)\max(\alpha x, x)(,)(-\infty, \infty)α\alpha or 1Hidden layers (avoids dying ReLU)
GELUxΦ(x)x \cdot \Phi(x)(0.17,)\approx (-0.17, \infty)SmoothTransformers (BERT, GPT)
Sigmoid11+ex\frac{1}{1+e^{-x}}(0,1)(0, 1)σ(1σ)\sigma(1-\sigma), max=0.25Output layer (binary classification)
Tanhexexex+ex\frac{e^x - e^{-x}}{e^x + e^{-x}}(1,1)(-1, 1)1tanh2(x)1 - \tanh^2(x), max=1Hidden layers (legacy), LSTM gates
Softmaxezijezj\frac{e^{z_i}}{\sum_j e^{z_j}}(0,1)(0, 1), sum=1Output layer (multi-class)

ReLU and Its Variants

ReLU: max(0,x)\max(0, x)

  • Pros: 計算高效、mitigates vanishing gradient(gradient = 1 for positive values)、promotes sparsity(很多 neurons output 0)
  • Cons: Dying ReLU — 如果某 neuron 的 input 永遠為負 → gradient = 0 → 永遠不更新 → 這個 neuron「死了」

Leaky ReLU: max(αx,x)\max(\alpha x, x), α0.01\alpha \approx 0.01

解決 dying ReLU — negative 側有小的 gradient(α\alpha),讓 neuron 還有機會 recover。

GELU (Gaussian Error Linear Unit): xΦ(x)x \cdot \Phi(x)

Smooth approximation of ReLU。在 Transformers(BERT, GPT)中幾乎是 standard。比 ReLU smooth → training 更穩定。

Sigmoid: Why Not in Hidden Layers

Sigmoid's maximum derivative is 0.25. After LL layers:

gradient0.25L\text{gradient} \leq 0.25^L

10 layers → 0.25101060.25^{10} \approx 10^{-6}vanishing gradients。所以 sigmoid 只在 output layer(binary classification)或 gating mechanisms(LSTM forget gate)中使用,不在 hidden layers。

Softmax: Details

softmax(zi)=ezij=1Kezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

Converts logits → probability distribution(all values > 0, sum = 1)。

Numerical Stability

直接計算 ezie^{z_i} 可能 overflow(zz 很大時)。實務中先減去 max:softmax(zi)=ezimax(z)jezjmax(z)\text{softmax}(z_i) = \frac{e^{z_i - \max(\mathbf{z})}}{\sum_j e^{z_j - \max(\mathbf{z})}}。結果不變(numerator 和 denominator 同乘常數),但避免 overflow。面試常見 follow-up。

Choosing Activation Functions

LayerRecommendedWhy
Hidden layers (general)ReLU or Leaky ReLUFast, avoids vanishing gradient
Hidden layers (Transformer)GELUSmooth, empirically better for attention
Output (binary classification)SigmoidMaps to probability [0, 1]
Output (multi-class)SoftmaxProbability distribution over K classes
Output (regression)None (linear)Unbounded real-valued output
LSTM/GRU gatesSigmoid (0-1 range = gate)Controls flow of information
LSTM cell candidateTanhZero-centered, bounded

Universal Approximation Theorem

A feedforward network with a single hidden layer and a finite number of neurons can approximate any continuous function on a compact subset of Rn\mathbb{R}^n.

面試中的 key nuance — 定理保證存在性但不保證:

  • 需要多少 neurons(可能需要 exponentially many)
  • Gradient descent 能不能找到正確的 weights
  • Network 能不能泛化到 unseen data

這就是為什麼 depth matters — deep networks 可以用 exponentially fewer parameters 表示某些函數。

Vanishing and Exploding Gradients

In a deep network with LL layers, gradients involve a product of LL factors:

LW1=LhLl=2Lhlhl1h1W1\frac{\partial \mathcal{L}}{\partial \mathbf{W}_1} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_L} \prod_{l=2}^{L} \frac{\partial \mathbf{h}_l}{\partial \mathbf{h}_{l-1}} \cdot \frac{\partial \mathbf{h}_1}{\partial \mathbf{W}_1}

每個 factor 涉及 weight matrix 和 activation derivative。如果 factors consistently < 1 → gradients vanish exponentially。Consistently > 1 → gradients explode

Solutions

ProblemSolutions
VanishingReLU(gradient=1 for positive), proper initialization (He/Xavier), residual connections (ResNet), LSTM cell state
ExplodingGradient clipping (threshold\|\nabla\| \leq \text{threshold}), proper initialization, BatchNorm
BothCareful architecture design, normalization layers

Sigmoid + Deep Networks = Disaster

Sigmoid max derivative = 0.25。10 layers → gradient ≤ 0.25101060.25^{10} \approx 10^{-6}。早期 layers 幾乎不更新。這就是為什麼 2010 年前 deep networks 很難訓練 — 直到 ReLU 和 proper initialization 出現。

Weight Initialization

Proper initialization keeps the variance of activations and gradients stable across layers — 太大會 explode,太小會 vanish。

Xavier / Glorot Initialization

Designed for sigmoid and tanh:

WN(0,2nin+nout)W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)

確保每層 output 的 variance ≈ input 的 variance → signal 不會隨 depth shrink or grow。

He / Kaiming Initialization

Designed for ReLU:

WN(0,2nin)W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)

ReLU 把一半的值 zero out → output variance halved。Factor of 2 compensates for this。

Why Initialization Matters

Init StrategyActivationWhat Happens If Wrong
All zerosAnyAll neurons compute same function → no symmetry breaking → useless
Too largeAnyActivations explode → sigmoid saturates → vanishing gradient
Too smallAnyActivations → 0 → gradients → 0 → nothing learns
XavierReLUVariance shrinks by half each layer(doesn't account for ReLU)
HeSigmoid/TanhVariance grows each layer(too aggressive for bounded activations)

面試經典問題

「為什麼不能全部初始化為零?」— 所有 neurons 會計算相同的函數、得到相同的 gradient、做相同的 update → 所有 neurons 永遠相同(symmetry breaking 不發生)。必須 random initialization 讓 neurons 各自不同。

Batch Normalization

Normalize the input to each layer across the mini-batch:

x^i=xiμBσB2+ϵyi=γx^i+β\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \qquad y_i = \gamma \hat{x}_i + \beta

where μB\mu_B, σB2\sigma_B^2 are mini-batch statistics, γ\gamma and β\beta are learnable parameters.

Why BatchNorm Works

BenefitExplanation
Reduces internal covariate shift每層 input 的分布在 training 過程中不斷變化 → BatchNorm 穩定它
Allows higher learning ratesNormalized inputs → gradients 更穩定 → 可以用更大的 lr
Mild regularizationMini-batch statistics 有 noise → 類似 dropout 的效果
Smooths loss landscape讓 optimization surface 更 smooth → easier to optimize

Train vs Inference

PhaseStatistics Used
TrainingMini-batch mean and variance(μB,σB2\mu_B, \sigma_B^2
InferenceRunning averages accumulated during training(deterministic, batch-independent)

BatchNorm vs LayerNorm

BatchNormLayerNorm
Normalizes acrossBatch dimension(across samples)Feature dimension(within single sample)
Depends on batch size?Yes(小 batch 不穩定)No
Best forCNNs, large batch trainingRNNs, Transformers(variable-length sequences)
At inferenceUses running statsSame as training

LayerNorm 在 Transformers 中是 standard — 因為 sequence length 可變,batch statistics 不可靠。

Dropout

During training, randomly set each neuron's output to zero with probability pp:

h~i=11phimi,miBernoulli(1p)\tilde{h}_i = \frac{1}{1-p} \cdot h_i \cdot m_i, \quad m_i \sim \text{Bernoulli}(1-p)

Scaling factor 11p\frac{1}{1-p} = inverted dropout — 確保 expected output 不變,inference 時不需要調整。

Why Dropout Works

  • Prevents co-adaptation: Each neuron must learn to be useful independently
  • Implicit ensemble: Training with dropout ≈ averaging exponentially many sub-networks
  • Regularization: 和 L2 regularization 有理論上的等價關係(Gaussian dropout ≈ L2)

Dropout in Practice

ContextTypical ppNote
FC layers0.5Standard rate
Conv layers0.1-0.3 or noneConv 的 spatial structure 已經有 regularization 效果
After embedding0.1-0.3Transformers 中常見
At inference0 (off)所有 neurons active

Modern Alternatives to Dropout

在現代 architectures 中,dropout 逐漸被其他 regularization 取代:BatchNorm(implicit regularization)、data augmentation、weight decay、early stopping。在 Transformers 中仍然使用 dropout(attention dropout + residual dropout),但 rate 通常很小(0.1)。

Other Regularization Techniques

TechniqueHow It WorksWhen to Use
Weight decay (L2)Add λwi2\lambda \sum w_i^2 to lossAlmost always(standard in Adam/SGD)
Early stoppingStop training when validation loss stops improving最簡單且常被低估的 regularization
Data augmentation對 training data 做 random transformationsImages(flip, crop, rotate), text(synonym replacement)
Label smoothingReplace hard targets with y=(1ϵ)y+ϵ/Ky = (1-\epsilon) \cdot y + \epsilon/K防止 model 過度 confident
MixupTrain on convex combinations of input pairsCreates virtual training examples

Hands-on: Neural Networks in PyTorch

MLP

import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, dropout=0.5):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, output_dim),
        )

    def forward(self, x):
        return self.net(x)

model = MLP(input_dim=784, hidden_dim=256, output_dim=10)
# Note: no activation after last layer — CrossEntropyLoss includes softmax

Training Loop

import torch.optim as optim

criterion = nn.CrossEntropyLoss()  # includes softmax internally
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)

for epoch in range(num_epochs):
    model.train()  # enable dropout + batch norm training mode
    for X_batch, y_batch in train_loader:
        logits = model(X_batch)
        loss = criterion(logits, y_batch)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    model.eval()  # disable dropout + use running stats for batch norm
    with torch.no_grad():
        val_logits = model(X_val)
        val_loss = criterion(val_logits, y_val)

Initialization

# He initialization for ReLU layers
def init_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.kaiming_normal_(m.weight, nonlinearity="relu")
        nn.init.zeros_(m.bias)

model.apply(init_weights)

Real-World Use Cases

Case 1: 信用卡詐欺偵測 — MLP vs Traditional ML

ApproachProsCons
GBM (LightGBM)Handles tabular data best, interpretable, fastLess flexible for raw features
MLPCan learn from raw features, flexibleNeeds more data, harder to interpret
AutoencoderUnsupervised anomaly detection(learn normal → detect abnormal)Doesn't directly optimize for fraud detection

面試 follow-up:「什麼時候 NN 比 GBM 好?」— 當 features 是 raw/unstructured(images, text, sequences)或資料量非常大(> 10M rows, NN 的 capacity 才能被利用)。對 moderate-size tabular data,GBM 幾乎總是更好。

Case 2: 推薦系統 — Embedding Layers

Neural networks 在推薦系統中的核心貢獻是 embedding layers — 把 high-cardinality categorical features(user ID, item ID)映射到 dense low-dimensional vectors:

class RecModel(nn.Module):
    def __init__(self, n_users, n_items, embed_dim=64):
        super().__init__()
        self.user_embed = nn.Embedding(n_users, embed_dim)
        self.item_embed = nn.Embedding(n_items, embed_dim)
        self.fc = nn.Sequential(
            nn.Linear(embed_dim * 2, 128),
            nn.ReLU(),
            nn.Linear(128, 1),
        )

    def forward(self, user_ids, item_ids):
        u = self.user_embed(user_ids)
        i = self.item_embed(item_ids)
        return self.fc(torch.cat([u, i], dim=1)).squeeze()

Embedding layer 本質上是 a lookup table — 但 weights 透過 backprop 學習,每個 user/item 得到一個 meaningful vector。

Case 3: NLP — From Bag-of-Words to Embeddings

NN 的 word embeddings(Word2Vec → BERT → GPT)revolutionized NLP:

  • One-hot: 10K vocab → 10K sparse vector(no semantic info)
  • Word2Vec embedding: 10K vocab → 300-dim dense vector(king - man + woman ≈ queen)
  • BERT contextual embedding: Same word 在不同 context 中有不同 embedding

面試中知道這個 progression 很重要 — 從 sparse representation 到 learned dense embeddings 是 neural network 在 NLP 中的核心貢獻。

Interview Signals

What interviewers listen for:

  • 你知道為什麼需要 nonlinearity(沒有 → stacking layers 等於 single linear layer)
  • 你能解釋 vanishing gradient 的原因和解法(ReLU, initialization, skip connections)
  • 你知道 Xavier vs He initialization 各自對應什麼 activation
  • 你能比較 BatchNorm vs LayerNorm 的適用場景
  • 你理解 dropout 的直覺(implicit ensemble, prevent co-adaptation)

Practice

Flashcards

Flashcards (1/10)

Why can't a single perceptron learn XOR?

XOR is not linearly separable — no single hyperplane can separate positive from negative examples. 需要至少一個 hidden layer(two perceptrons)才能創建 nonlinear decision boundary。

Click card to flip

Quiz

Question 1/10

Which activation is most likely to cause vanishing gradients in a deep network?

Mark as Complete

3/5 — Okay