Convolutional Neural Networks
Interview Essentials
CNN 面試 focus on:(1)convolution operation 和 output dimension 計算,(2)classic architectures 的 key innovations(AlexNet → VGG → ResNet),(3)transfer learning 策略。要能解釋為什麼 convolution 比 FC 更好 for images。
Why Convolutions?
FC layers treat every input pixel independently. For a image, a single FC layer with 1000 neurons needs:
就一層就 150M — 不可行。
Convolutions exploit two key inductive biases of images:
| Property | Meaning | How Conv Exploits It |
|---|---|---|
| Translation invariance | 貓不管在圖片哪裡都是貓 | Same kernel applied everywhere(weight sharing) |
| Locality | 相近的 pixels 更 related | Small kernel size — only look at local neighborhood |
| Compositionality | 複雜 features 由簡單 features 組成 | Hierarchical layers: edges → textures → parts → objects |
The Convolution Operation
A 2D convolution slides a kernel (filter) across input and computes element-wise products:
For multi-channel input (e.g., RGB), each filter spans all input channels:
每個 filter 產出一個 feature map(output channel)。 個 filters → 個 feature maps。
Output Dimensions
For input size , kernel size , padding , stride :
Quick Reference
| Config | Input 32×32, K=3 | Result |
|---|---|---|
| Valid (P=0, S=1) | (32-3+0)/1+1 = 30 | 30×30 |
| Same (P=1, S=1) | (32-3+2)/1+1 = 32 | 32×32(保持大小) |
| Stride 2 (P=1, S=2) | (32-3+2)/2+1 = 16 | 16×16(downsampled) |
'Same' vs 'Valid' Padding
Valid (): output 縮小 pixels。Same (, ): output 和 input 同大小。Modern architectures 幾乎都用 same padding。
Parameter Count
For a conv layer with kernel , input channels, output channels:
是 bias per filter。
vs FC: Conv(3→32, 3×3) = 32×(3×3×3+1) = 896 params。等價 FC: 32×32×3×32 = 98,304 params — 109x more。
Key Building Blocks
Stride and Padding
- Stride: Kernel 每次移動的 distance。Stride 2 把 spatial dimensions halve(取代 pooling)
- Padding: 在 input 邊緣加 zeros。沒有 padding → 每層 spatial dimension 縮小 → 限制 depth
Pooling
| Type | Operation | Use |
|---|---|---|
| Max Pooling | 取 window 中的最大值 | 保留最強 activation,丟棄位置資訊 |
| Average Pooling | 取 window 的平均值 | 更 smooth,保留更多 spatial info |
| Global Average Pooling (GAP) | 整個 feature map → 1 個值 | 取代 FC layers → 大幅減少 parameters |
GAP 在 modern architectures 中取代了最後的 FC layers(ResNet, EfficientNet)— 一個 feature map 一個 class 的 average → 直接接 softmax。
Receptive Field
The region of input that influences a particular output neuron. For stacked convolutions:
| Layers | Receptive Field | Equivalent Kernel |
|---|---|---|
| 1 | 3×3 | — |
| 2 | 5×5 | Fewer params than single 5×5 |
| 3 | 7×7 | Fewer params than single 7×7 |
每加一層 3×3 → receptive field +2。這就是為什麼 stack small kernels 比 large kernels 好 — 相同 receptive field 但更少 parameters + 更多 nonlinearity(VGGNet 的核心 insight)。
1×1 Convolutions
Operates only across channels at each spatial position — like a per-pixel FC layer:
Uses:
- Bottleneck: Reduce → fewer channels → reduce computation
- Channel mixing: Combine information across feature maps
- Add nonlinearity: 1×1 conv + ReLU = per-pixel nonlinear transform
Popularized by Network-in-Network and used extensively in GoogLeNet/Inception and ResNet bottleneck blocks.
Classic Architectures
面試中能說出每個 architecture 的 key innovation 和它解決的問題是很強的信號:
Architecture Timeline
| Year | Model | Key Innovation | Depth | ImageNet Top-5 |
|---|---|---|---|---|
| 1998 | LeNet-5 | Conv-Pool-Conv-Pool-FC pattern | 5 | — (MNIST) |
| 2012 | AlexNet | ReLU, Dropout, GPU training, data augmentation | 8 | 16.4% |
| 2014 | VGGNet | Only 3×3 convs, depth matters | 16-19 | 7.3% |
| 2014 | GoogLeNet | Inception module, 1×1 bottleneck | 22 | 6.7% |
| 2015 | ResNet | Skip connections | 50-152 | 3.6% |
| 2017 | DenseNet | Dense connections(each layer connects to all subsequent) | 121+ | — |
| 2019 | EfficientNet | Compound scaling(width × depth × resolution) | — | 2.9% |
| 2020 | ViT | Vision Transformer — patches as tokens | — | Competitive |
ResNet: The Game Changer
Introduced skip connections (residual connections):
Network learns the residual instead of the full mapping.
Why it works:
- Gradient highway: Identity shortcut → has a +1 term → gradients never vanish through shortcuts
- Easy identity: If extra depth is unnecessary, is easy to learn → (identity mapping)
- Deeper = better: 前 ResNet 時代,plain networks 加 depth → performance degraded(optimization difficulty, not overfitting)。Skip connections 解決了這個問題
Why ResNet Changed Everything
ResNet 之前,VGG-style 的 plain networks 加更多 layers 反而 performance 下降 — 不是因為 overfitting,而是 optimization 太難。Skip connections 讓 gradient 可以直接 flow to early layers → 100+ layers 成為可能。
ResNet Variants
| Block | Structure | When |
|---|---|---|
| Basic Block | 3×3 → BN → ReLU → 3×3 → BN → + → ReLU | ResNet-18, 34 |
| Bottleneck Block | 1×1 → 3×3 → 1×1(reduce → compute → restore channels) | ResNet-50, 101, 152 |
Bottleneck 用 1×1 conv 先降 channels → 3×3 conv 在低 channels 做 → 1×1 restore channels → 更少計算量 → 可以更深。
Vision Transformer (ViT)
Vision Transformer 把 image 切成 fixed-size patches,每個 patch 當作一個 token feed 進 standard Transformer:
- Split image into patches
- Flatten each patch → linear projection to embedding
- Add positional embedding
- Feed into standard Transformer encoder
- Use [CLS] token output for classification
ViT vs CNN:
| Aspect | CNN | ViT |
|---|---|---|
| Inductive bias | Translation invariance, locality | Minimal(learned from data) |
| Data efficiency | Better with small datasets | Needs large datasets(or pre-training) |
| Global context | Limited by receptive field | Attention sees all patches from layer 1 |
| Scalability | Saturates with more data | Keeps improving with more data |
面試中的 ViT
ViT 不是說 CNN 沒用了。CNN 在 small-to-medium datasets 上仍然更好(因為 inductive bias)。ViT 在大量 data + compute 時超越 CNN(learned inductive bias > hand-designed inductive bias)。Hybrid approaches(ConvNeXt, CoAtNet)結合兩者的優勢。
Data Augmentation
CNN 的 regularization 主力 — 通過對 training images 做 random transformations來增加 effective dataset size:
| Augmentation | How | Effect |
|---|---|---|
| Random crop | Randomly crop a region | Translation invariance |
| Horizontal flip | Mirror left-right | Doubles effective data |
| Color jitter | Random brightness, contrast, saturation | Robustness to lighting |
| Random rotation | Rotate by small angles | Rotation invariance |
| Cutout / Random erasing | Mask random patches | Forces model to use more features |
| Mixup | Blend two images + their labels | Smoother decision boundaries |
| CutMix | Paste patch from one image onto another | Combine Cutout + Mixup benefits |
from torchvision import transforms
train_transform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(brightness=0.2, contrast=0.2),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
# ImageNet mean/std are standard for pre-trained models
Transfer Learning
Pre-trained CNNs (typically on ImageNet) serve as powerful feature extractors.
Why It Works
Lower layers learn universal features (edges, textures, colors) → transfer well across tasks. Higher layers learn task-specific features.
Two Strategies
| Strategy | How | When |
|---|---|---|
| Feature extraction | Freeze conv base, train only classifier head | Small dataset, similar domain |
| Fine-tuning | Unfreeze top layers, train with small lr | Large dataset or different domain |
Transfer Learning Decision Matrix
| Your Dataset | Similar Domain | Different Domain |
|---|---|---|
| Small | Feature extraction only | Fine-tune top layers + aggressive augmentation |
| Large | Fine-tune entire network (small lr) | Fine-tune entire network (small lr) |
import torchvision.models as models
import torch.nn as nn
# Load pre-trained ResNet50
model = models.resnet50(pretrained=True)
# Strategy 1: Feature extraction — freeze all conv layers
for param in model.parameters():
param.requires_grad = False
# Replace classifier head for your task
model.fc = nn.Linear(2048, num_classes) # only this trains
# Strategy 2: Fine-tune — unfreeze top layers
for param in model.layer4.parameters():
param.requires_grad = True
# Use smaller lr for pre-trained layers, larger for new head
面試中的 Transfer Learning
面試官問「你有 500 張醫學影像要做分類」→ 答案幾乎一定是 transfer learning。從 ImageNet pre-trained ResNet/EfficientNet 開始,freeze conv layers,只 train 最後的 classifier。500 張資料不夠從頭訓練 CNN,但 pre-trained features(edges, textures)在醫學影像中仍然有用。
Real-World Use Cases
Case 1: 信用卡詐欺 — CNN 用在哪?
直覺上信用卡詐欺和 image 無關,但 CNN 可以用在:
- Transaction sequence as image: 把使用者的交易序列(amount × time)轉成 2D representation → 用 1D CNN 抓 temporal patterns
- Signature verification: CNN 比對簽名圖片
- Document fraud: CNN 偵測偽造文件(altered text, inconsistent fonts)
面試 follow-up:「tabular fraud data 用 CNN 嗎?」— 通常不用。Tabular data → GBM > MLP > CNN。CNN 的 inductive bias(translation invariance, locality)在 tabular data 上沒有用。
Case 2: 推薦系統 — Visual Features
CNN 在推薦系統中提取 visual features:
- 電商:商品圖片 → CNN embedding → 用 visual similarity 推薦(「和你看的風格相似的商品」)
- Fashion:CNN 學到 color, pattern, style → visual recommendation
- Real estate:房屋照片 → CNN features 加入 pricing model
Case 3: 醫學影像 — Transfer Learning
Medical imaging 是 transfer learning 的典型場景:
- 資料量小(幾百到幾千張 labeled images)
- 但 ImageNet pre-trained features(edges, textures, shapes)仍然有用
- Fine-tune top layers + heavy augmentation 通常表現很好
- 需要 interpretability → Grad-CAM 顯示 model 關注的區域
Hands-on: CNN in PyTorch
Complete CNN Architecture
import torch
import torch.nn as nn
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
# Block 1: [B, 3, 32, 32] → [B, 32, 16, 16]
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(2),
# Block 2: → [B, 64, 8, 8]
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2),
# Block 3: → [B, 128, 4, 4]
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
nn.MaxPool2d(2),
)
self.classifier = nn.Sequential(
nn.AdaptiveAvgPool2d(1), # GAP: → [B, 128, 1, 1]
nn.Flatten(), # → [B, 128]
nn.Dropout(0.5),
nn.Linear(128, num_classes),
)
def forward(self, x):
return self.classifier(self.features(x))
Transfer Learning with ResNet
import torchvision.models as models
# Pre-trained ResNet50
model = models.resnet50(weights="IMAGENET1K_V2")
# Freeze all layers
for param in model.parameters():
param.requires_grad = False
# Replace classifier
model.fc = nn.Sequential(
nn.Dropout(0.3),
nn.Linear(2048, num_classes),
)
# Only model.fc parameters will be updated
Interview Signals
What interviewers listen for:
- 你能計算 output dimension 和 parameter count
- 你知道每個 classic architecture 的 key innovation
- 你能解釋 skip connections 為什麼解決了 deep networks 的 training 問題
- 你知道 transfer learning 的策略和 decision matrix
- 你能比較 CNN vs ViT 的 tradeoffs
Practice
Flashcards
Flashcards (1/10)
How to compute output spatial dimension of a conv layer?
W_out = floor((W_in - K + 2P) / S) + 1. Same padding (S=1): P = floor(K/2) → W_out = W_in. 這個公式面試中一定要能秒答。
Quiz
32×32 input, 5×5 kernel, stride 1, no padding. Output size?