Convolutional Neural Networks

Interview Essentials

CNN 面試 focus on：（1）convolution operation 和 output dimension 計算，（2）classic architectures 的 key innovations（AlexNet → VGG → ResNet），（3）transfer learning 策略。要能解釋為什麼 convolution 比 FC 更好 for images。

Why Convolutions?

FC layers treat every input pixel independently. For a $224 \times 224 \times 3$ image, a single FC layer with 1000 neurons needs:

224 \times 224 \times 3 \times 1000 \approx 150\text{M parameters}

就一層就 150M — 不可行。

Convolutions exploit two key inductive biases of images:

Property	Meaning	How Conv Exploits It
Translation invariance	貓不管在圖片哪裡都是貓	Same kernel applied everywhere（weight sharing）
Locality	相近的 pixels 更 related	Small kernel size — only look at local neighborhood
Compositionality	複雜 features 由簡單 features 組成	Hierarchical layers: edges → textures → parts → objects

The Convolution Operation

A 2D convolution slides a kernel (filter) across input and computes element-wise products:

(I * K)(i, j) = \sum_{m}\sum_{n} I(i+m, j+n) \cdot K(m, n)

For multi-channel input (e.g., RGB), each filter spans all input channels:

\text{output}(i, j) = \sum_{c=1}^{C_{\text{in}}} \sum_{m}\sum_{n} I_c(i+m, j+n) \cdot K_c(m, n) + b

每個 filter 產出一個 feature map（output channel）。 $C_{\text{out}}$ 個 filters → $C_{\text{out}}$ 個 feature maps。

Output Dimensions

For input size $W$ , kernel size $K$ , padding $P$ , stride $S$ :

W_{\text{out}} = \left\lfloor\frac{W - K + 2P}{S}\right\rfloor + 1

Quick Reference

Config	Input 32×32, K=3	Result
Valid (P=0, S=1)	(32-3+0)/1+1 = 30	30×30
Same (P=1, S=1)	(32-3+2)/1+1 = 32	32×32（保持大小）
Stride 2 (P=1, S=2)	(32-3+2)/2+1 = 16	16×16（downsampled）

'Same' vs 'Valid' Padding

Valid ( $P=0$ ): output 縮小 $K-1$ pixels。Same ( $P = \lfloor K/2 \rfloor$ , $S=1$ ): output 和 input 同大小。Modern architectures 幾乎都用 same padding。

Parameter Count

For a conv layer with kernel $K \times K$ , $C_{\text{in}}$ input channels, $C_{\text{out}}$ output channels:

\text{Parameters} = C_{\text{out}} \times (K \times K \times C_{\text{in}} + 1)

$+1$ 是 bias per filter。

vs FC: Conv(3→32, 3×3) = 32×(3×3×3+1) = 896 params。等價 FC: 32×32×3×32 = 98,304 params — 109x more。

Key Building Blocks

Stride and Padding

Stride: Kernel 每次移動的 distance。Stride 2 把 spatial dimensions halve（取代 pooling）
Padding: 在 input 邊緣加 zeros。沒有 padding → 每層 spatial dimension 縮小 → 限制 depth

Pooling

Type	Operation	Use
Max Pooling	取 window 中的最大值	保留最強 activation，丟棄位置資訊
Average Pooling	取 window 的平均值	更 smooth，保留更多 spatial info
Global Average Pooling (GAP)	整個 feature map → 1 個值	取代 FC layers → 大幅減少 parameters

GAP 在 modern architectures 中取代了最後的 FC layers（ResNet, EfficientNet）— 一個 feature map 一個 class 的 average → 直接接 softmax。

Receptive Field

The region of input that influences a particular output neuron. For stacked $3 \times 3$ convolutions:

Layers	Receptive Field	Equivalent Kernel
1	3×3	—
2	5×5	Fewer params than single 5×5
3	7×7	Fewer params than single 7×7

每加一層 3×3 → receptive field +2。這就是為什麼 stack small kernels 比 large kernels 好 — 相同 receptive field 但更少 parameters + 更多 nonlinearity（VGGNet 的核心 insight）。

1×1 Convolutions

Operates only across channels at each spatial position — like a per-pixel FC layer:

\text{output}(i, j, k) = \sum_{c=1}^{C_{\text{in}}} W_{k,c} \cdot \text{input}(i, j, c) + b_k

Uses:

Bottleneck: Reduce $C_{\text{in}}$ → fewer $C_{\text{out}}$ channels → reduce computation
Channel mixing: Combine information across feature maps
Add nonlinearity: 1×1 conv + ReLU = per-pixel nonlinear transform

Popularized by Network-in-Network and used extensively in GoogLeNet/Inception and ResNet bottleneck blocks.

Classic Architectures

面試中能說出每個 architecture 的 key innovation 和它解決的問題是很強的信號：

Architecture Timeline

Year	Model	Key Innovation	Depth	ImageNet Top-5
1998	LeNet-5	Conv-Pool-Conv-Pool-FC pattern	5	— (MNIST)
2012	AlexNet	ReLU, Dropout, GPU training, data augmentation	8	16.4%
2014	VGGNet	Only 3×3 convs, depth matters	16-19	7.3%
2014	GoogLeNet	Inception module, 1×1 bottleneck	22	6.7%
2015	ResNet	Skip connections	50-152	3.6%
2017	DenseNet	Dense connections（each layer connects to all subsequent）	121+	—
2019	EfficientNet	Compound scaling（width × depth × resolution）	—	2.9%
2020	ViT	Vision Transformer — patches as tokens	—	Competitive

ResNet: The Game Changer

Introduced skip connections (residual connections):

\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}

Network learns the residual $\mathcal{F}(\mathbf{x}) = \mathbf{y} - \mathbf{x}$ instead of the full mapping.

Why it works:

Gradient highway: Identity shortcut → $\partial \mathbf{y}/\partial \mathbf{x}$ has a +1 term → gradients never vanish through shortcuts
Easy identity: If extra depth is unnecessary, $\mathcal{F}(\mathbf{x}) = 0$ is easy to learn → $\mathbf{y} = \mathbf{x}$ （identity mapping）
Deeper = better: 前 ResNet 時代，plain networks 加 depth → performance degraded（optimization difficulty, not overfitting）。Skip connections 解決了這個問題

Why ResNet Changed Everything

ResNet 之前，VGG-style 的 plain networks 加更多 layers 反而 performance 下降 — 不是因為 overfitting，而是 optimization 太難。Skip connections 讓 gradient 可以直接 flow to early layers → 100+ layers 成為可能。

ResNet Variants

Block	Structure	When
Basic Block	3×3 → BN → ReLU → 3×3 → BN → + → ReLU	ResNet-18, 34
Bottleneck Block	1×1 → 3×3 → 1×1（reduce → compute → restore channels）	ResNet-50, 101, 152

Bottleneck 用 1×1 conv 先降 channels → 3×3 conv 在低 channels 做 → 1×1 restore channels → 更少計算量 → 可以更深。

Vision Transformer (ViT)

Vision Transformer 把 image 切成 fixed-size patches，每個 patch 當作一個 token feed 進 standard Transformer：

Split image into $16 \times 16$ patches
Flatten each patch → linear projection to embedding
Add positional embedding
Feed into standard Transformer encoder
Use [CLS] token output for classification

ViT vs CNN:

Aspect	CNN	ViT
Inductive bias	Translation invariance, locality	Minimal（learned from data）
Data efficiency	Better with small datasets	Needs large datasets（or pre-training）
Global context	Limited by receptive field	Attention sees all patches from layer 1
Scalability	Saturates with more data	Keeps improving with more data

面試中的 ViT

ViT 不是說 CNN 沒用了。CNN 在 small-to-medium datasets 上仍然更好（因為 inductive bias）。ViT 在大量 data + compute 時超越 CNN（learned inductive bias > hand-designed inductive bias）。Hybrid approaches（ConvNeXt, CoAtNet）結合兩者的優勢。

Data Augmentation

CNN 的 regularization 主力 — 通過對 training images 做 random transformations來增加 effective dataset size：

Augmentation	How	Effect
Random crop	Randomly crop a region	Translation invariance
Horizontal flip	Mirror left-right	Doubles effective data
Color jitter	Random brightness, contrast, saturation	Robustness to lighting
Random rotation	Rotate by small angles	Rotation invariance
Cutout / Random erasing	Mask random patches	Forces model to use more features
Mixup	Blend two images + their labels	Smoother decision boundaries
CutMix	Paste patch from one image onto another	Combine Cutout + Mixup benefits

from torchvision import transforms

train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
# ImageNet mean/std are standard for pre-trained models

Transfer Learning

Pre-trained CNNs (typically on ImageNet) serve as powerful feature extractors.

Why It Works

Lower layers learn universal features (edges, textures, colors) → transfer well across tasks. Higher layers learn task-specific features.

Two Strategies

Strategy	How	When
Feature extraction	Freeze conv base, train only classifier head	Small dataset, similar domain
Fine-tuning	Unfreeze top layers, train with small lr	Large dataset or different domain

Transfer Learning Decision Matrix

Your Dataset	Similar Domain	Different Domain
Small	Feature extraction only	Fine-tune top layers + aggressive augmentation
Large	Fine-tune entire network (small lr)	Fine-tune entire network (small lr)

import torchvision.models as models
import torch.nn as nn

# Load pre-trained ResNet50
model = models.resnet50(pretrained=True)

# Strategy 1: Feature extraction — freeze all conv layers
for param in model.parameters():
    param.requires_grad = False

# Replace classifier head for your task
model.fc = nn.Linear(2048, num_classes)  # only this trains

# Strategy 2: Fine-tune — unfreeze top layers
for param in model.layer4.parameters():
    param.requires_grad = True
# Use smaller lr for pre-trained layers, larger for new head

面試中的 Transfer Learning

面試官問「你有 500 張醫學影像要做分類」→ 答案幾乎一定是 transfer learning。從 ImageNet pre-trained ResNet/EfficientNet 開始，freeze conv layers，只 train 最後的 classifier。500 張資料不夠從頭訓練 CNN，但 pre-trained features（edges, textures）在醫學影像中仍然有用。

Real-World Use Cases

Case 1: 信用卡詐欺 — CNN 用在哪？

直覺上信用卡詐欺和 image 無關，但 CNN 可以用在：

Transaction sequence as image: 把使用者的交易序列（amount × time）轉成 2D representation → 用 1D CNN 抓 temporal patterns
Signature verification: CNN 比對簽名圖片
Document fraud: CNN 偵測偽造文件（altered text, inconsistent fonts）

面試 follow-up：「tabular fraud data 用 CNN 嗎？」— 通常不用。Tabular data → GBM > MLP > CNN。CNN 的 inductive bias（translation invariance, locality）在 tabular data 上沒有用。

Case 2: 推薦系統 — Visual Features

CNN 在推薦系統中提取 visual features：

電商：商品圖片 → CNN embedding → 用 visual similarity 推薦（「和你看的風格相似的商品」）
Fashion：CNN 學到 color, pattern, style → visual recommendation
Real estate：房屋照片 → CNN features 加入 pricing model

Case 3: 醫學影像 — Transfer Learning

Medical imaging 是 transfer learning 的典型場景：

資料量小（幾百到幾千張 labeled images）
但 ImageNet pre-trained features（edges, textures, shapes）仍然有用
Fine-tune top layers + heavy augmentation 通常表現很好
需要 interpretability → Grad-CAM 顯示 model 關注的區域

Hands-on: CNN in PyTorch

Complete CNN Architecture

import torch
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            # Block 1: [B, 3, 32, 32] → [B, 32, 16, 16]
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),
            # Block 2: → [B, 64, 8, 8]
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),
            # Block 3: → [B, 128, 4, 4]
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),  # GAP: → [B, 128, 1, 1]
            nn.Flatten(),             # → [B, 128]
            nn.Dropout(0.5),
            nn.Linear(128, num_classes),
        )

    def forward(self, x):
        return self.classifier(self.features(x))

Transfer Learning with ResNet

import torchvision.models as models

# Pre-trained ResNet50
model = models.resnet50(weights="IMAGENET1K_V2")

# Freeze all layers
for param in model.parameters():
    param.requires_grad = False

# Replace classifier
model.fc = nn.Sequential(
    nn.Dropout(0.3),
    nn.Linear(2048, num_classes),
)
# Only model.fc parameters will be updated

Interview Signals

What interviewers listen for:

你能計算 output dimension 和 parameter count
你知道每個 classic architecture 的 key innovation
你能解釋 skip connections 為什麼解決了 deep networks 的 training 問題
你知道 transfer learning 的策略和 decision matrix
你能比較 CNN vs ViT 的 tradeoffs

Practice

Flashcards

Flashcards (1/10)

How to compute output spatial dimension of a conv layer?

W_out = floor((W_in - K + 2P) / S) + 1. Same padding (S=1): P = floor(K/2) → W_out = W_in. 這個公式面試中一定要能秒答。

Click card to flip

Quiz

Question 1/10

32×32 input, 5×5 kernel, stride 1, no padding. Output size?

Mark as Complete

How confident are you with this topic?

3/5 — Okay