Convolutional Neural Networks

Interview Essentials

CNN 面試 focus on:(1)convolution operation 和 output dimension 計算,(2)classic architectures 的 key innovations(AlexNet → VGG → ResNet),(3)transfer learning 策略。要能解釋為什麼 convolution 比 FC 更好 for images。

Why Convolutions?

FC layers treat every input pixel independently. For a 224×224×3224 \times 224 \times 3 image, a single FC layer with 1000 neurons needs:

224×224×3×1000150M parameters224 \times 224 \times 3 \times 1000 \approx 150\text{M parameters}

就一層就 150M — 不可行。

Convolutions exploit two key inductive biases of images:

PropertyMeaningHow Conv Exploits It
Translation invariance貓不管在圖片哪裡都是貓Same kernel applied everywhere(weight sharing)
Locality相近的 pixels 更 relatedSmall kernel size — only look at local neighborhood
Compositionality複雜 features 由簡單 features 組成Hierarchical layers: edges → textures → parts → objects

The Convolution Operation

A 2D convolution slides a kernel (filter) across input and computes element-wise products:

(IK)(i,j)=mnI(i+m,j+n)K(m,n)(I * K)(i, j) = \sum_{m}\sum_{n} I(i+m, j+n) \cdot K(m, n)

For multi-channel input (e.g., RGB), each filter spans all input channels:

output(i,j)=c=1CinmnIc(i+m,j+n)Kc(m,n)+b\text{output}(i, j) = \sum_{c=1}^{C_{\text{in}}} \sum_{m}\sum_{n} I_c(i+m, j+n) \cdot K_c(m, n) + b

每個 filter 產出一個 feature map(output channel)。CoutC_{\text{out}} 個 filters → CoutC_{\text{out}} 個 feature maps。

Output Dimensions

For input size WW, kernel size KK, padding PP, stride SS:

Wout=WK+2PS+1W_{\text{out}} = \left\lfloor\frac{W - K + 2P}{S}\right\rfloor + 1

Quick Reference

ConfigInput 32×32, K=3Result
Valid (P=0, S=1)(32-3+0)/1+1 = 3030×30
Same (P=1, S=1)(32-3+2)/1+1 = 3232×32(保持大小)
Stride 2 (P=1, S=2)(32-3+2)/2+1 = 1616×16(downsampled)

'Same' vs 'Valid' Padding

Valid (P=0P=0): output 縮小 K1K-1 pixels。Same (P=K/2P = \lfloor K/2 \rfloor, S=1S=1): output 和 input 同大小。Modern architectures 幾乎都用 same padding。

Parameter Count

For a conv layer with kernel K×KK \times K, CinC_{\text{in}} input channels, CoutC_{\text{out}} output channels:

Parameters=Cout×(K×K×Cin+1)\text{Parameters} = C_{\text{out}} \times (K \times K \times C_{\text{in}} + 1)

+1+1 是 bias per filter。

vs FC: Conv(3→32, 3×3) = 32×(3×3×3+1) = 896 params。等價 FC: 32×32×3×32 = 98,304 params — 109x more。

Key Building Blocks

Stride and Padding

  • Stride: Kernel 每次移動的 distance。Stride 2 把 spatial dimensions halve(取代 pooling)
  • Padding: 在 input 邊緣加 zeros。沒有 padding → 每層 spatial dimension 縮小 → 限制 depth

Pooling

TypeOperationUse
Max Pooling取 window 中的最大值保留最強 activation,丟棄位置資訊
Average Pooling取 window 的平均值更 smooth,保留更多 spatial info
Global Average Pooling (GAP)整個 feature map → 1 個值取代 FC layers → 大幅減少 parameters

GAP 在 modern architectures 中取代了最後的 FC layers(ResNet, EfficientNet)— 一個 feature map 一個 class 的 average → 直接接 softmax。

Receptive Field

The region of input that influences a particular output neuron. For stacked 3×33 \times 3 convolutions:

LayersReceptive FieldEquivalent Kernel
13×3
25×5Fewer params than single 5×5
37×7Fewer params than single 7×7

每加一層 3×3 → receptive field +2。這就是為什麼 stack small kernels 比 large kernels 好 — 相同 receptive field 但更少 parameters + 更多 nonlinearity(VGGNet 的核心 insight)。

1×1 Convolutions

Operates only across channels at each spatial position — like a per-pixel FC layer:

output(i,j,k)=c=1CinWk,cinput(i,j,c)+bk\text{output}(i, j, k) = \sum_{c=1}^{C_{\text{in}}} W_{k,c} \cdot \text{input}(i, j, c) + b_k

Uses:

  • Bottleneck: Reduce CinC_{\text{in}} → fewer CoutC_{\text{out}} channels → reduce computation
  • Channel mixing: Combine information across feature maps
  • Add nonlinearity: 1×1 conv + ReLU = per-pixel nonlinear transform

Popularized by Network-in-Network and used extensively in GoogLeNet/Inception and ResNet bottleneck blocks.

Classic Architectures

面試中能說出每個 architecture 的 key innovation 和它解決的問題是很強的信號:

Architecture Timeline

YearModelKey InnovationDepthImageNet Top-5
1998LeNet-5Conv-Pool-Conv-Pool-FC pattern5— (MNIST)
2012AlexNetReLU, Dropout, GPU training, data augmentation816.4%
2014VGGNetOnly 3×3 convs, depth matters16-197.3%
2014GoogLeNetInception module, 1×1 bottleneck226.7%
2015ResNetSkip connections50-1523.6%
2017DenseNetDense connections(each layer connects to all subsequent)121+
2019EfficientNetCompound scaling(width × depth × resolution)2.9%
2020ViTVision Transformer — patches as tokensCompetitive

ResNet: The Game Changer

Introduced skip connections (residual connections):

y=F(x,{Wi})+x\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}

Network learns the residual F(x)=yx\mathcal{F}(\mathbf{x}) = \mathbf{y} - \mathbf{x} instead of the full mapping.

Why it works:

  • Gradient highway: Identity shortcut → y/x\partial \mathbf{y}/\partial \mathbf{x} has a +1 term → gradients never vanish through shortcuts
  • Easy identity: If extra depth is unnecessary, F(x)=0\mathcal{F}(\mathbf{x}) = 0 is easy to learn → y=x\mathbf{y} = \mathbf{x}(identity mapping)
  • Deeper = better: 前 ResNet 時代,plain networks 加 depth → performance degraded(optimization difficulty, not overfitting)。Skip connections 解決了這個問題

Why ResNet Changed Everything

ResNet 之前,VGG-style 的 plain networks 加更多 layers 反而 performance 下降 — 不是因為 overfitting,而是 optimization 太難。Skip connections 讓 gradient 可以直接 flow to early layers → 100+ layers 成為可能。

ResNet Variants

BlockStructureWhen
Basic Block3×3 → BN → ReLU → 3×3 → BN → + → ReLUResNet-18, 34
Bottleneck Block1×1 → 3×3 → 1×1(reduce → compute → restore channels)ResNet-50, 101, 152

Bottleneck 用 1×1 conv 先降 channels → 3×3 conv 在低 channels 做 → 1×1 restore channels → 更少計算量 → 可以更深。

Vision Transformer (ViT)

Vision Transformer 把 image 切成 fixed-size patches,每個 patch 當作一個 token feed 進 standard Transformer:

  1. Split image into 16×1616 \times 16 patches
  2. Flatten each patch → linear projection to embedding
  3. Add positional embedding
  4. Feed into standard Transformer encoder
  5. Use [CLS] token output for classification

ViT vs CNN:

AspectCNNViT
Inductive biasTranslation invariance, localityMinimal(learned from data)
Data efficiencyBetter with small datasetsNeeds large datasets(or pre-training)
Global contextLimited by receptive fieldAttention sees all patches from layer 1
ScalabilitySaturates with more dataKeeps improving with more data

面試中的 ViT

ViT 不是說 CNN 沒用了。CNN 在 small-to-medium datasets 上仍然更好(因為 inductive bias)。ViT 在大量 data + compute 時超越 CNN(learned inductive bias > hand-designed inductive bias)。Hybrid approaches(ConvNeXt, CoAtNet)結合兩者的優勢。

Data Augmentation

CNN 的 regularization 主力 — 通過對 training images 做 random transformations來增加 effective dataset size:

AugmentationHowEffect
Random cropRandomly crop a regionTranslation invariance
Horizontal flipMirror left-rightDoubles effective data
Color jitterRandom brightness, contrast, saturationRobustness to lighting
Random rotationRotate by small anglesRotation invariance
Cutout / Random erasingMask random patchesForces model to use more features
MixupBlend two images + their labelsSmoother decision boundaries
CutMixPaste patch from one image onto anotherCombine Cutout + Mixup benefits
from torchvision import transforms

train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
# ImageNet mean/std are standard for pre-trained models

Transfer Learning

Pre-trained CNNs (typically on ImageNet) serve as powerful feature extractors.

Why It Works

Lower layers learn universal features (edges, textures, colors) → transfer well across tasks. Higher layers learn task-specific features.

Two Strategies

StrategyHowWhen
Feature extractionFreeze conv base, train only classifier headSmall dataset, similar domain
Fine-tuningUnfreeze top layers, train with small lrLarge dataset or different domain

Transfer Learning Decision Matrix

Your DatasetSimilar DomainDifferent Domain
SmallFeature extraction onlyFine-tune top layers + aggressive augmentation
LargeFine-tune entire network (small lr)Fine-tune entire network (small lr)
import torchvision.models as models
import torch.nn as nn

# Load pre-trained ResNet50
model = models.resnet50(pretrained=True)

# Strategy 1: Feature extraction — freeze all conv layers
for param in model.parameters():
    param.requires_grad = False

# Replace classifier head for your task
model.fc = nn.Linear(2048, num_classes)  # only this trains

# Strategy 2: Fine-tune — unfreeze top layers
for param in model.layer4.parameters():
    param.requires_grad = True
# Use smaller lr for pre-trained layers, larger for new head

面試中的 Transfer Learning

面試官問「你有 500 張醫學影像要做分類」→ 答案幾乎一定是 transfer learning。從 ImageNet pre-trained ResNet/EfficientNet 開始,freeze conv layers,只 train 最後的 classifier。500 張資料不夠從頭訓練 CNN,但 pre-trained features(edges, textures)在醫學影像中仍然有用。

Real-World Use Cases

Case 1: 信用卡詐欺 — CNN 用在哪?

直覺上信用卡詐欺和 image 無關,但 CNN 可以用在:

  • Transaction sequence as image: 把使用者的交易序列(amount × time)轉成 2D representation → 用 1D CNN 抓 temporal patterns
  • Signature verification: CNN 比對簽名圖片
  • Document fraud: CNN 偵測偽造文件(altered text, inconsistent fonts)

面試 follow-up:「tabular fraud data 用 CNN 嗎?」— 通常不用。Tabular data → GBM > MLP > CNN。CNN 的 inductive bias(translation invariance, locality)在 tabular data 上沒有用。

Case 2: 推薦系統 — Visual Features

CNN 在推薦系統中提取 visual features

  • 電商:商品圖片 → CNN embedding → 用 visual similarity 推薦(「和你看的風格相似的商品」)
  • Fashion:CNN 學到 color, pattern, style → visual recommendation
  • Real estate:房屋照片 → CNN features 加入 pricing model

Case 3: 醫學影像 — Transfer Learning

Medical imaging 是 transfer learning 的典型場景:

  • 資料量小(幾百到幾千張 labeled images)
  • 但 ImageNet pre-trained features(edges, textures, shapes)仍然有用
  • Fine-tune top layers + heavy augmentation 通常表現很好
  • 需要 interpretability → Grad-CAM 顯示 model 關注的區域

Hands-on: CNN in PyTorch

Complete CNN Architecture

import torch
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            # Block 1: [B, 3, 32, 32] → [B, 32, 16, 16]
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),
            # Block 2: → [B, 64, 8, 8]
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),
            # Block 3: → [B, 128, 4, 4]
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),  # GAP: → [B, 128, 1, 1]
            nn.Flatten(),             # → [B, 128]
            nn.Dropout(0.5),
            nn.Linear(128, num_classes),
        )

    def forward(self, x):
        return self.classifier(self.features(x))

Transfer Learning with ResNet

import torchvision.models as models

# Pre-trained ResNet50
model = models.resnet50(weights="IMAGENET1K_V2")

# Freeze all layers
for param in model.parameters():
    param.requires_grad = False

# Replace classifier
model.fc = nn.Sequential(
    nn.Dropout(0.3),
    nn.Linear(2048, num_classes),
)
# Only model.fc parameters will be updated

Interview Signals

What interviewers listen for:

  • 你能計算 output dimension 和 parameter count
  • 你知道每個 classic architecture 的 key innovation
  • 你能解釋 skip connections 為什麼解決了 deep networks 的 training 問題
  • 你知道 transfer learning 的策略和 decision matrix
  • 你能比較 CNN vs ViT 的 tradeoffs

Practice

Flashcards

Flashcards (1/10)

How to compute output spatial dimension of a conv layer?

W_out = floor((W_in - K + 2P) / S) + 1. Same padding (S=1): P = floor(K/2) → W_out = W_in. 這個公式面試中一定要能秒答。

Click card to flip

Quiz

Question 1/10

32×32 input, 5×5 kernel, stride 1, no padding. Output size?

Mark as Complete

3/5 — Okay