Support Vector Machines

Interview Context

SVM 在實務中已不如 tree-based methods 常用於 tabular data,但仍是面試高頻題 — 因為它考驗你對 optimization、geometry、kernel trick 的理解。要知道理論,也要知道什麼時候該(和不該)用 SVM。

Maximum Margin Classifier

The core idea of SVM: find the hyperplane that separates two classes with the largest possible margin.

Given training data {(xi,yi)}\{(\mathbf{x}_i, y_i)\} where yi{1,+1}y_i \in \{-1, +1\}, the decision boundary is:

wx+b=0\mathbf{w}^\top \mathbf{x} + b = 0

The margin is the distance between the two parallel hyperplanes wx+b=±1\mathbf{w}^\top \mathbf{x} + b = \pm 1:

Margin=2w\text{Margin} = \frac{2}{\|\mathbf{w}\|}

最大化 margin = 最小化 w\|\mathbf{w}\|。直覺上,wider margin → model 對 noise 更 robust → better generalization。

Hard Margin SVM (Linearly Separable Case)

minw,b12w2subject toyi(wxi+b)1    i\min_{\mathbf{w}, b} \frac{1}{2} \|\mathbf{w}\|^2 \quad \text{subject to} \quad y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 \;\; \forall i

This is a convex quadratic program — has a unique global optimum, no local minima issues.

Constraint 確保所有 points 被正確分類且在 margin 外面。

Support Vectors

Only the data points lying exactly on the margin boundary (yi(wxi+b)=1y_i(\mathbf{w}^\top \mathbf{x}_i + b) = 1) determine the decision boundary. These are the support vectors.

所有其他 points 可以被移動或移除而不改變 solution — 這是 SVM 獨特的 sparsity property。

為什麼這很重要

SVM 只依賴 support vectors 意味著:(1)prediction 時只需要儲存 support vectors,不需要整個 training set(memory efficient),(2)對離 boundary 很遠的 outliers robust。但也意味著:如果一個 noisy point 剛好在 boundary 上,它會對 model 有巨大影響。

Soft Margin SVM (C Parameter)

Real data is rarely linearly separable. Soft margin introduces slack variables ξi0\xi_i \geq 0 that allow misclassifications at a penalty:

minw,b,ξ12w2+Ci=1nξis.t.yi(wxi+b)1ξi,    ξi0\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^{n} \xi_i \quad \text{s.t.} \quad y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 - \xi_i, \;\; \xi_i \geq 0

Understanding C

CC controls the tradeoff between margin width and misclassification:

CCMarginMisclassificationsBehavior
Very largeNarrowAlmost noneLike hard margin → overfitting risk
MediumBalancedSome allowedTypical operating point
Very smallWideMany allowedStrong regularization → underfitting risk

C Is Inverse Regularization

注意 C 的方向和 Ridge/Lasso 的 λ\lambda 相反:CC 越大 = regularization 越弱(more penalty on errors, less on w\|\mathbf{w}\|)。λ\lambda 越大 = regularization 越強。這是面試中常見的混淆點。

等價關係:SVM 的 CC 大致對應 Ridge/Lasso 的 1/λ1/\lambda

Hinge Loss Interpretation

Soft margin SVM 可以被重寫成 unconstrained optimization with hinge loss:

minw,bi=1nmax(0,1yi(wxi+b))+λ2w2\min_{\mathbf{w}, b} \sum_{i=1}^{n} \max(0, 1 - y_i(\mathbf{w}^\top \mathbf{x}_i + b)) + \frac{\lambda}{2}\|\mathbf{w}\|^2

where λ=1/C\lambda = 1/C

Hinge loss = max(0,1yf(x))\max(0, 1 - y \cdot f(x))

  • Correctly classified with margin ≥ 1 → loss = 0
  • Correctly classified but inside margin → loss > 0
  • Misclassified → loss grows linearly

和 logistic loss(cross-entropy)的比較:hinge loss 在 margin 外直接為零(不 penalize well-classified points),logistic loss 永遠有小的 gradient。

The Kernel Trick

Why Kernels Are Needed

Linear SVM 只能畫直線/平面做分類。但很多真實資料的 decision boundary 是非線性的。

Naive approach: 手動把 features 映射到高維空間(例如加 x12,x22,x1x2x_1^2, x_2^2, x_1 x_2),然後在高維空間做 linear SVM。問題:高維映射的計算成本隨維度爆炸。

Kernel trick: 不需要真的做映射,只需要計算映射後的 dot product — 而某些 kernel function 可以 implicitly 計算這個 dot product。

Dual Formulation

SVM 的 dual form using Lagrange multipliers αi\alpha_i:

maxαi=1nαi12i,jαiαjyiyjxixj\max_{\boldsymbol{\alpha}} \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top \mathbf{x}_j

subject to 0αiC0 \leq \alpha_i \leq C and iαiyi=0\sum_i \alpha_i y_i = 0.

關鍵觀察:data 只以 dot products xixj\mathbf{x}_i^\top \mathbf{x}_j 出現。把 dot product 換成 kernel function K(xi,xj)K(\mathbf{x}_i, \mathbf{x}_j) → 就在高維空間做 SVM 了。

Kernel Functions

K(xi,xj)=ϕ(xi)ϕ(xj)K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^\top \phi(\mathbf{x}_j)

在高維空間計算 dot product without explicitly computing ϕ(x)\phi(\mathbf{x})

KernelFormulaImplicit Feature SpaceBest For
Linearxixj\mathbf{x}_i^\top \mathbf{x}_jOriginal spaceHigh-dim sparse data (text)
Polynomial(γxixj+r)d(\gamma \mathbf{x}_i^\top \mathbf{x}_j + r)^dAll polynomials up to degree dd低維 + 多項式關係
RBF (Gaussian)exp(γxixj2)\exp(-\gamma\|\mathbf{x}_i - \mathbf{x}_j\|^2)Infinite-dimensionalGeneral nonlinear, default choice
Sigmoidtanh(γxixj+r)\tanh(\gamma \mathbf{x}_i^\top \mathbf{x}_j + r)Not always valid PD kernel幾乎不用

Understanding RBF Gamma

γ\gamma 控制每個 support vector 的「影響半徑」:

γ\gammaInfluence RadiusDecision BoundaryRisk
LargeLocal onlyComplex, follows each pointOverfitting
SmallWide reachSmoothUnderfitting

直覺:γ\gamma 大 = 每個 support vector 只影響附近的小區域 → boundary 很曲折。γ\gamma 小 = 每個 support vector 影響很大的範圍 → boundary 很平滑。

Kernel Selection Heuristic

Linear kernel when pnp \gg n(text classification, genomics — 幾千個 features 但只有幾百個 samples)。RBF when nn is manageable and relationship is non-linear(default starting point)。Polynomial 很少用(高 degree 時 numerical instability)。

Mercer's Condition

Not every function can be a valid kernel. A function KK is a valid kernel if and only if the resulting Gram matrix Kij=K(xi,xj)K_{ij} = K(\mathbf{x}_i, \mathbf{x}_j) is positive semi-definite for any set of points.

面試中不常問推導,但知道 Mercer's condition 的存在表示你理解 kernel 不是隨便定義的。

SVM for Regression (SVR)

Support Vector Regression uses epsilon-insensitive loss:

minw,b12w2+Ci=1n(ξi+ξi)\min_{\mathbf{w}, b} \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{i=1}^{n} (\xi_i + \xi_i^*)

subject to:

yiwxibϵ+ξi|y_i - \mathbf{w}^\top \mathbf{x}_i - b| \leq \epsilon + \xi_i

ϵ\epsilon-tube 裡面的 error 不被 penalize — 只有 tube 外面的 points 才是 support vectors。和 MAE 不同的是,SVR 的 ϵ\epsilon 提供了一個 "don't care" zone。

SVM vs Other Methods

ScenarioBest ChoiceWhy
High-dimensional, sparse (pnp \gg n)Linear SVMEffective margins in high-dim, fast with LinearSVC
Tabular data, moderate nnGradient boostingHandles mixed types, interactions, missing values natively
Small dataset, non-linearSVM with RBFKernel trick powerful with limited data
Very large nn (millions)Tree-based or linearSVM training is O(n2)O(n^2)-O(n3)O(n^3), too slow
Need probability outputsLogistic regression or treesSVM outputs are not probabilities(Platt scaling 是 workaround)
Need interpretabilityLogistic regression or treesSVM 的 decision function 不直覺(尤其 kernel SVM)

Computational Complexity

OperationComplexityNote
Training (kernel SVM)O(n2)O(n^2) to O(n3)O(n^3)QP solver dominates
Training (LinearSVC)O(np)O(n \cdot p)Uses coordinate descent, scales well
Prediction (kernel)O(nsvp)O(n_{sv} \cdot p)nsvn_{sv} = number of support vectors
Prediction (linear)O(p)O(p)Just dot product with w\mathbf{w}

面試中被問 SVM 的 scalability 時,區分 kernel SVM 和 linear SVM 很重要。sklearn.svm.LinearSVCSVC(kernel='linear') 快很多,因為它用不同的 solver。

Feature Scaling Is Critical

SVM 的 margin 是用 Euclidean distance 定義的。如果一個 feature 範圍 0-1000 另一個 0-1,大 feature 主導 distance → margin 被扭曲。Always standardize before SVM(zero mean, unit variance)。Tree-based methods 不需要 — 因為 splits 是 threshold-based。

Real-World Use Cases

Case 1: 文本分類(Text Classification)

SVM 在 NLP 中曾經是王者(pre-deep learning era)。

Why SVM works well for text:

  • Text features(TF-IDF)通常 high-dimensional(10K+ features)and sparse
  • Linear SVM 在 high-dim space 很有效 — pnp \gg n 時不容易 overfit
  • Margin maximization 在 sparse space 中提供良好的 generalization
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline

text_clf = make_pipeline(
    TfidfVectorizer(max_features=10000),
    LinearSVC(C=1.0)  # not SVC(kernel='linear') — much faster!
)
text_clf.fit(X_train_text, y_train)
# LinearSVC scales to large text datasets; kernel SVM would be too slow

面試 follow-up:「為什麼不用 kernel SVM 做 text classification?」— 因為 nn 通常很大(數萬到數百萬文件),kernel SVM 的 O(n2)O(n^2) training 不可行。而且 high-dim TF-IDF space 中 linear SVM 已經足夠好。

Case 2: 信用卡詐欺偵測

Challenge: SVM 不原生支援 class imbalance。

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Option 1: class_weight
svm = make_pipeline(
    StandardScaler(),
    SVC(kernel="rbf", C=10, class_weight="balanced")
)
# class_weight='balanced' → C_fraud = C × (n_total / (2 × n_fraud))

# Option 2: Need probabilities? Use Platt scaling
svm_prob = make_pipeline(
    StandardScaler(),
    SVC(kernel="rbf", C=10, probability=True)  # enables Platt scaling
)
# .predict_proba() available but adds computational cost

面試 follow-up:「SVM 的 output 是 probability 嗎?」— 不是。SVM output 是 signed distance from hyperplane。用 probability=True 啟用 Platt scaling(fit logistic regression on SVM outputs)可以得到 approximate probabilities,但這是 post-hoc calibration,不是 native 的。

Case 3: 異常偵測(One-Class SVM)

One-Class SVM 用來做 anomaly detection — 只用 normal data 訓練,找出和 normal data 不同的 outliers。

from sklearn.svm import OneClassSVM

# Train on normal transactions only
oc_svm = OneClassSVM(kernel="rbf", gamma="scale", nu=0.05)
oc_svm.fit(X_normal)  # only normal data!

# Predict: +1 = normal, -1 = anomaly
predictions = oc_svm.predict(X_test)
# nu ≈ expected proportion of outliers

nu parameter ≈ expected fraction of outliers。One-Class SVM 在 boundary 內的 normal data 周圍畫一個 tight boundary,boundary 外的就是 anomaly。適合只有 normal data(沒有 labeled anomalies)的場景。

Hands-on: SVM in Python

SVM Kernels & C Parameter

from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

X, y = make_moons(n_samples=300, noise=0.2)

configs = [
    ("Linear, C=1",     SVC(kernel="linear", C=1)),
    ("RBF, C=1",        SVC(kernel="rbf", C=1, gamma="scale")),
    ("RBF, C=100",      SVC(kernel="rbf", C=100, gamma="scale")),
    ("RBF, C=0.01",     SVC(kernel="rbf", C=0.01, gamma="scale")),
]
for name, svm in configs:
    pipe = make_pipeline(StandardScaler(), svm)  # always scale for SVM!
    scores = cross_val_score(pipe, X, y, cv=5)
    pipe.fit(X, y)
    n_sv = pipe.named_steps["svc"].n_support_.sum()
    print(f"{name}: CV Acc={scores.mean():.3f}, Support Vectors={n_sv}")

# Linear fails on non-linear data (moons)
# Large C → fewer SVs, tighter margin, overfitting risk
# Small C → more SVs, wider margin, more regularization

Grid Search for C and Gamma

from sklearn.model_selection import GridSearchCV

pipe = make_pipeline(StandardScaler(), SVC())
param_grid = {
    "svc__C": [0.1, 1, 10, 100],
    "svc__gamma": ["scale", "auto", 0.01, 0.1, 1],
    "svc__kernel": ["rbf"],
}
grid = GridSearchCV(pipe, param_grid, cv=5, scoring="f1")
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Best F1: {grid.best_score_:.4f}")

Interactive: Decision Boundaries

Experiment with SVM and other classifiers to see how decision boundaries change:

Decision Boundary Explorer

See how different classifiers separate data. Requires scikit-learn (~20MB, cached after first load).

Dataset
Classifier
Click Run to generate the decision boundary

Interview Signals

What interviewers listen for:

  • 你能解釋 margin maximization 的直覺和數學
  • 你知道 C 和 λ\lambda 的方向相反(C 大 = regularization 弱)
  • 你能用非技術語言解釋 kernel trick
  • 你知道 SVM 的 scalability 限制和什麼時候用 LinearSVC
  • 你能比較 SVM 和其他方法(logistic regression, trees)

Practice

Flashcards

Flashcards (1/10)

Why does SVM only depend on support vectors?

In the dual formulation, only points with non-zero Lagrange multipliers (α > 0) affect the solution. 這些是在 margin boundary 上或裡面的點。其他所有點的 α = 0,移除它們不改變 decision boundary(KKT complementary slackness)。

Click card to flip

Quiz

Question 1/10

RBF kernel implicitly maps data to what kind of feature space?

Mark as Complete

3/5 — Okay