Support Vector Machines

Interview Context

SVM 在實務中已不如 tree-based methods 常用於 tabular data，但仍是面試高頻題 — 因為它考驗你對 optimization、geometry、kernel trick 的理解。要知道理論，也要知道什麼時候該（和不該）用 SVM。

Maximum Margin Classifier

The core idea of SVM: find the hyperplane that separates two classes with the largest possible margin.

Given training data $\{(\mathbf{x}_i, y_i)\}$ where $y_i \in \{-1, +1\}$ , the decision boundary is:

\mathbf{w}^\top \mathbf{x} + b = 0

The margin is the distance between the two parallel hyperplanes $\mathbf{w}^\top \mathbf{x} + b = \pm 1$ :

\text{Margin} = \frac{2}{\|\mathbf{w}\|}

最大化 margin = 最小化 $\|\mathbf{w}\|$ 。直覺上，wider margin → model 對 noise 更 robust → better generalization。

Hard Margin SVM (Linearly Separable Case)

\min_{\mathbf{w}, b} \frac{1}{2} \|\mathbf{w}\|^2 \quad \text{subject to} \quad y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 \;\; \forall i

This is a convex quadratic program — has a unique global optimum, no local minima issues.

Constraint 確保所有 points 被正確分類且在 margin 外面。

Support Vectors

Only the data points lying exactly on the margin boundary ( $y_i(\mathbf{w}^\top \mathbf{x}_i + b) = 1$ ) determine the decision boundary. These are the support vectors.

所有其他 points 可以被移動或移除而不改變 solution — 這是 SVM 獨特的 sparsity property。

為什麼這很重要

SVM 只依賴 support vectors 意味著：（1）prediction 時只需要儲存 support vectors，不需要整個 training set（memory efficient），（2）對離 boundary 很遠的 outliers robust。但也意味著：如果一個 noisy point 剛好在 boundary 上，它會對 model 有巨大影響。

Soft Margin SVM (C Parameter)

Real data is rarely linearly separable. Soft margin introduces slack variables $\xi_i \geq 0$ that allow misclassifications at a penalty:

\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^{n} \xi_i \quad \text{s.t.} \quad y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 - \xi_i, \;\; \xi_i \geq 0

Understanding C

$C$ controls the tradeoff between margin width and misclassification:

$C$	Margin	Misclassifications	Behavior
Very large	Narrow	Almost none	Like hard margin → overfitting risk
Medium	Balanced	Some allowed	Typical operating point
Very small	Wide	Many allowed	Strong regularization → underfitting risk

C Is Inverse Regularization

注意 C 的方向和 Ridge/Lasso 的 $\lambda$ 相反： $C$ 越大 = regularization 越弱（more penalty on errors, less on $\|\mathbf{w}\|$ ）。 $\lambda$ 越大 = regularization 越強。這是面試中常見的混淆點。

等價關係：SVM 的 $C$ 大致對應 Ridge/Lasso 的 $1/\lambda$ 。

Hinge Loss Interpretation

Soft margin SVM 可以被重寫成 unconstrained optimization with hinge loss:

\min_{\mathbf{w}, b} \sum_{i=1}^{n} \max(0, 1 - y_i(\mathbf{w}^\top \mathbf{x}_i + b)) + \frac{\lambda}{2}\|\mathbf{w}\|^2

where $\lambda = 1/C$ 。

Hinge loss = $\max(0, 1 - y \cdot f(x))$ ：

Correctly classified with margin ≥ 1 → loss = 0
Correctly classified but inside margin → loss > 0
Misclassified → loss grows linearly

和 logistic loss（cross-entropy）的比較：hinge loss 在 margin 外直接為零（不 penalize well-classified points），logistic loss 永遠有小的 gradient。

The Kernel Trick

Why Kernels Are Needed

Linear SVM 只能畫直線/平面做分類。但很多真實資料的 decision boundary 是非線性的。

Naive approach: 手動把 features 映射到高維空間（例如加 $x_1^2, x_2^2, x_1 x_2$ ），然後在高維空間做 linear SVM。問題：高維映射的計算成本隨維度爆炸。

Kernel trick: 不需要真的做映射，只需要計算映射後的 dot product — 而某些 kernel function 可以 implicitly 計算這個 dot product。

Dual Formulation

SVM 的 dual form using Lagrange multipliers $\alpha_i$ :

\max_{\boldsymbol{\alpha}} \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top \mathbf{x}_j

subject to $0 \leq \alpha_i \leq C$ and $\sum_i \alpha_i y_i = 0$ .

關鍵觀察：data 只以 dot products $\mathbf{x}_i^\top \mathbf{x}_j$ 出現。把 dot product 換成 kernel function $K(\mathbf{x}_i, \mathbf{x}_j)$ → 就在高維空間做 SVM 了。

Kernel Functions

K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^\top \phi(\mathbf{x}_j)

在高維空間計算 dot product without explicitly computing $\phi(\mathbf{x})$ 。

Kernel	Formula	Implicit Feature Space	Best For
Linear	$\mathbf{x}_i^\top \mathbf{x}_j$	Original space	High-dim sparse data (text)
Polynomial	$(\gamma \mathbf{x}_i^\top \mathbf{x}_j + r)^d$	All polynomials up to degree $d$	低維 + 多項式關係
RBF (Gaussian)	$\exp(-\gamma\\|\mathbf{x}_i - \mathbf{x}_j\\|^2)$	Infinite-dimensional	General nonlinear, default choice
Sigmoid	$\tanh(\gamma \mathbf{x}_i^\top \mathbf{x}_j + r)$	Not always valid PD kernel	幾乎不用

Understanding RBF Gamma

$\gamma$ 控制每個 support vector 的「影響半徑」：

$\gamma$	Influence Radius	Decision Boundary	Risk
Large	Local only	Complex, follows each point	Overfitting
Small	Wide reach	Smooth	Underfitting

直覺： $\gamma$ 大 = 每個 support vector 只影響附近的小區域 → boundary 很曲折。 $\gamma$ 小 = 每個 support vector 影響很大的範圍 → boundary 很平滑。

Kernel Selection Heuristic

Linear kernel when $p \gg n$ （text classification, genomics — 幾千個 features 但只有幾百個 samples）。RBF when $n$ is manageable and relationship is non-linear（default starting point）。Polynomial 很少用（高 degree 時 numerical instability）。

Mercer's Condition

Not every function can be a valid kernel. A function $K$ is a valid kernel if and only if the resulting Gram matrix $K_{ij} = K(\mathbf{x}_i, \mathbf{x}_j)$ is positive semi-definite for any set of points.

面試中不常問推導，但知道 Mercer's condition 的存在表示你理解 kernel 不是隨便定義的。

SVM for Regression (SVR)

Support Vector Regression uses epsilon-insensitive loss:

\min_{\mathbf{w}, b} \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{i=1}^{n} (\xi_i + \xi_i^*)

subject to:

|y_i - \mathbf{w}^\top \mathbf{x}_i - b| \leq \epsilon + \xi_i

$\epsilon$ -tube 裡面的 error 不被 penalize — 只有 tube 外面的 points 才是 support vectors。和 MAE 不同的是，SVR 的 $\epsilon$ 提供了一個 "don't care" zone。

SVM vs Other Methods

Scenario	Best Choice	Why
High-dimensional, sparse ( $p \gg n$ )	Linear SVM	Effective margins in high-dim, fast with `LinearSVC`
Tabular data, moderate $n$	Gradient boosting	Handles mixed types, interactions, missing values natively
Small dataset, non-linear	SVM with RBF	Kernel trick powerful with limited data
Very large $n$ (millions)	Tree-based or linear	SVM training is $O(n^2)$ - $O(n^3)$ , too slow
Need probability outputs	Logistic regression or trees	SVM outputs are not probabilities（Platt scaling 是 workaround）
Need interpretability	Logistic regression or trees	SVM 的 decision function 不直覺（尤其 kernel SVM）

Computational Complexity

Operation	Complexity	Note
Training (kernel SVM)	$O(n^2)$ to $O(n^3)$	QP solver dominates
Training (LinearSVC)	$O(n \cdot p)$	Uses coordinate descent, scales well
Prediction (kernel)	$O(n_{sv} \cdot p)$	$n_{sv}$ = number of support vectors
Prediction (linear)	$O(p)$	Just dot product with $\mathbf{w}$

面試中被問 SVM 的 scalability 時，區分 kernel SVM 和 linear SVM 很重要。sklearn.svm.LinearSVC 比 SVC(kernel='linear') 快很多，因為它用不同的 solver。

Feature Scaling Is Critical

SVM 的 margin 是用 Euclidean distance 定義的。如果一個 feature 範圍 0-1000 另一個 0-1，大 feature 主導 distance → margin 被扭曲。Always standardize before SVM（zero mean, unit variance）。Tree-based methods 不需要 — 因為 splits 是 threshold-based。

Real-World Use Cases

Case 1: 文本分類（Text Classification）

SVM 在 NLP 中曾經是王者（pre-deep learning era）。

Why SVM works well for text:

Text features（TF-IDF）通常 high-dimensional（10K+ features）and sparse
Linear SVM 在 high-dim space 很有效 — $p \gg n$ 時不容易 overfit
Margin maximization 在 sparse space 中提供良好的 generalization

from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline

text_clf = make_pipeline(
    TfidfVectorizer(max_features=10000),
    LinearSVC(C=1.0)  # not SVC(kernel='linear') — much faster!
)
text_clf.fit(X_train_text, y_train)
# LinearSVC scales to large text datasets; kernel SVM would be too slow

面試 follow-up：「為什麼不用 kernel SVM 做 text classification？」— 因為 $n$ 通常很大（數萬到數百萬文件），kernel SVM 的 $O(n^2)$ training 不可行。而且 high-dim TF-IDF space 中 linear SVM 已經足夠好。

Case 2: 信用卡詐欺偵測

Challenge: SVM 不原生支援 class imbalance。

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Option 1: class_weight
svm = make_pipeline(
    StandardScaler(),
    SVC(kernel="rbf", C=10, class_weight="balanced")
)
# class_weight='balanced' → C_fraud = C × (n_total / (2 × n_fraud))

# Option 2: Need probabilities? Use Platt scaling
svm_prob = make_pipeline(
    StandardScaler(),
    SVC(kernel="rbf", C=10, probability=True)  # enables Platt scaling
)
# .predict_proba() available but adds computational cost

面試 follow-up：「SVM 的 output 是 probability 嗎？」— 不是。SVM output 是 signed distance from hyperplane。用 probability=True 啟用 Platt scaling（fit logistic regression on SVM outputs）可以得到 approximate probabilities，但這是 post-hoc calibration，不是 native 的。

Case 3: 異常偵測（One-Class SVM）

One-Class SVM 用來做 anomaly detection — 只用 normal data 訓練，找出和 normal data 不同的 outliers。

from sklearn.svm import OneClassSVM

# Train on normal transactions only
oc_svm = OneClassSVM(kernel="rbf", gamma="scale", nu=0.05)
oc_svm.fit(X_normal)  # only normal data!

# Predict: +1 = normal, -1 = anomaly
predictions = oc_svm.predict(X_test)
# nu ≈ expected proportion of outliers

nu parameter ≈ expected fraction of outliers。One-Class SVM 在 boundary 內的 normal data 周圍畫一個 tight boundary，boundary 外的就是 anomaly。適合只有 normal data（沒有 labeled anomalies）的場景。

Hands-on: SVM in Python

SVM Kernels & C Parameter

from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

X, y = make_moons(n_samples=300, noise=0.2)

configs = [
    ("Linear, C=1",     SVC(kernel="linear", C=1)),
    ("RBF, C=1",        SVC(kernel="rbf", C=1, gamma="scale")),
    ("RBF, C=100",      SVC(kernel="rbf", C=100, gamma="scale")),
    ("RBF, C=0.01",     SVC(kernel="rbf", C=0.01, gamma="scale")),
]
for name, svm in configs:
    pipe = make_pipeline(StandardScaler(), svm)  # always scale for SVM!
    scores = cross_val_score(pipe, X, y, cv=5)
    pipe.fit(X, y)
    n_sv = pipe.named_steps["svc"].n_support_.sum()
    print(f"{name}: CV Acc={scores.mean():.3f}, Support Vectors={n_sv}")

# Linear fails on non-linear data (moons)
# Large C → fewer SVs, tighter margin, overfitting risk
# Small C → more SVs, wider margin, more regularization

Grid Search for C and Gamma

from sklearn.model_selection import GridSearchCV

pipe = make_pipeline(StandardScaler(), SVC())
param_grid = {
    "svc__C": [0.1, 1, 10, 100],
    "svc__gamma": ["scale", "auto", 0.01, 0.1, 1],
    "svc__kernel": ["rbf"],
}
grid = GridSearchCV(pipe, param_grid, cv=5, scoring="f1")
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Best F1: {grid.best_score_:.4f}")

Interactive: Decision Boundaries

Experiment with SVM and other classifiers to see how decision boundaries change:

Decision Boundary Explorer

See how different classifiers separate data. Requires scikit-learn (~20MB, cached after first load).

Dataset

Classifier

Click Run to generate the decision boundary

Interview Signals

What interviewers listen for:

你能解釋 margin maximization 的直覺和數學
你知道 C 和 $\lambda$ 的方向相反（C 大 = regularization 弱）
你能用非技術語言解釋 kernel trick
你知道 SVM 的 scalability 限制和什麼時候用 LinearSVC
你能比較 SVM 和其他方法（logistic regression, trees）

Practice

Flashcards

Flashcards (1/10)

Why does SVM only depend on support vectors?

In the dual formulation, only points with non-zero Lagrange multipliers (α > 0) affect the solution. 這些是在 margin boundary 上或裡面的點。其他所有點的 α = 0，移除它們不改變 decision boundary（KKT complementary slackness）。

Click card to flip

Quiz

Question 1/10

RBF kernel implicitly maps data to what kind of feature space?

Mark as Complete

How confident are you with this topic?

3/5 — Okay