Support Vector Machines
Interview Context
SVM 在實務中已不如 tree-based methods 常用於 tabular data,但仍是面試高頻題 — 因為它考驗你對 optimization、geometry、kernel trick 的理解。要知道理論,也要知道什麼時候該(和不該)用 SVM。
Maximum Margin Classifier
The core idea of SVM: find the hyperplane that separates two classes with the largest possible margin.
Given training data where , the decision boundary is:
The margin is the distance between the two parallel hyperplanes :
最大化 margin = 最小化 。直覺上,wider margin → model 對 noise 更 robust → better generalization。
Hard Margin SVM (Linearly Separable Case)
This is a convex quadratic program — has a unique global optimum, no local minima issues.
Constraint 確保所有 points 被正確分類且在 margin 外面。
Support Vectors
Only the data points lying exactly on the margin boundary () determine the decision boundary. These are the support vectors.
所有其他 points 可以被移動或移除而不改變 solution — 這是 SVM 獨特的 sparsity property。
為什麼這很重要
SVM 只依賴 support vectors 意味著:(1)prediction 時只需要儲存 support vectors,不需要整個 training set(memory efficient),(2)對離 boundary 很遠的 outliers robust。但也意味著:如果一個 noisy point 剛好在 boundary 上,它會對 model 有巨大影響。
Soft Margin SVM (C Parameter)
Real data is rarely linearly separable. Soft margin introduces slack variables that allow misclassifications at a penalty:
Understanding C
controls the tradeoff between margin width and misclassification:
| Margin | Misclassifications | Behavior | |
|---|---|---|---|
| Very large | Narrow | Almost none | Like hard margin → overfitting risk |
| Medium | Balanced | Some allowed | Typical operating point |
| Very small | Wide | Many allowed | Strong regularization → underfitting risk |
C Is Inverse Regularization
注意 C 的方向和 Ridge/Lasso 的 相反: 越大 = regularization 越弱(more penalty on errors, less on )。 越大 = regularization 越強。這是面試中常見的混淆點。
等價關係:SVM 的 大致對應 Ridge/Lasso 的 。
Hinge Loss Interpretation
Soft margin SVM 可以被重寫成 unconstrained optimization with hinge loss:
where 。
Hinge loss = :
- Correctly classified with margin ≥ 1 → loss = 0
- Correctly classified but inside margin → loss > 0
- Misclassified → loss grows linearly
和 logistic loss(cross-entropy)的比較:hinge loss 在 margin 外直接為零(不 penalize well-classified points),logistic loss 永遠有小的 gradient。
The Kernel Trick
Why Kernels Are Needed
Linear SVM 只能畫直線/平面做分類。但很多真實資料的 decision boundary 是非線性的。
Naive approach: 手動把 features 映射到高維空間(例如加 ),然後在高維空間做 linear SVM。問題:高維映射的計算成本隨維度爆炸。
Kernel trick: 不需要真的做映射,只需要計算映射後的 dot product — 而某些 kernel function 可以 implicitly 計算這個 dot product。
Dual Formulation
SVM 的 dual form using Lagrange multipliers :
subject to and .
關鍵觀察:data 只以 dot products 出現。把 dot product 換成 kernel function → 就在高維空間做 SVM 了。
Kernel Functions
在高維空間計算 dot product without explicitly computing 。
| Kernel | Formula | Implicit Feature Space | Best For |
|---|---|---|---|
| Linear | Original space | High-dim sparse data (text) | |
| Polynomial | All polynomials up to degree | 低維 + 多項式關係 | |
| RBF (Gaussian) | Infinite-dimensional | General nonlinear, default choice | |
| Sigmoid | Not always valid PD kernel | 幾乎不用 |
Understanding RBF Gamma
控制每個 support vector 的「影響半徑」:
| Influence Radius | Decision Boundary | Risk | |
|---|---|---|---|
| Large | Local only | Complex, follows each point | Overfitting |
| Small | Wide reach | Smooth | Underfitting |
直覺: 大 = 每個 support vector 只影響附近的小區域 → boundary 很曲折。 小 = 每個 support vector 影響很大的範圍 → boundary 很平滑。
Kernel Selection Heuristic
Linear kernel when (text classification, genomics — 幾千個 features 但只有幾百個 samples)。RBF when is manageable and relationship is non-linear(default starting point)。Polynomial 很少用(高 degree 時 numerical instability)。
Mercer's Condition
Not every function can be a valid kernel. A function is a valid kernel if and only if the resulting Gram matrix is positive semi-definite for any set of points.
面試中不常問推導,但知道 Mercer's condition 的存在表示你理解 kernel 不是隨便定義的。
SVM for Regression (SVR)
Support Vector Regression uses epsilon-insensitive loss:
subject to:
-tube 裡面的 error 不被 penalize — 只有 tube 外面的 points 才是 support vectors。和 MAE 不同的是,SVR 的 提供了一個 "don't care" zone。
SVM vs Other Methods
| Scenario | Best Choice | Why |
|---|---|---|
| High-dimensional, sparse () | Linear SVM | Effective margins in high-dim, fast with LinearSVC |
| Tabular data, moderate | Gradient boosting | Handles mixed types, interactions, missing values natively |
| Small dataset, non-linear | SVM with RBF | Kernel trick powerful with limited data |
| Very large (millions) | Tree-based or linear | SVM training is -, too slow |
| Need probability outputs | Logistic regression or trees | SVM outputs are not probabilities(Platt scaling 是 workaround) |
| Need interpretability | Logistic regression or trees | SVM 的 decision function 不直覺(尤其 kernel SVM) |
Computational Complexity
| Operation | Complexity | Note |
|---|---|---|
| Training (kernel SVM) | to | QP solver dominates |
| Training (LinearSVC) | Uses coordinate descent, scales well | |
| Prediction (kernel) | = number of support vectors | |
| Prediction (linear) | Just dot product with |
面試中被問 SVM 的 scalability 時,區分 kernel SVM 和 linear SVM 很重要。sklearn.svm.LinearSVC 比 SVC(kernel='linear') 快很多,因為它用不同的 solver。
Feature Scaling Is Critical
SVM 的 margin 是用 Euclidean distance 定義的。如果一個 feature 範圍 0-1000 另一個 0-1,大 feature 主導 distance → margin 被扭曲。Always standardize before SVM(zero mean, unit variance)。Tree-based methods 不需要 — 因為 splits 是 threshold-based。
Real-World Use Cases
Case 1: 文本分類(Text Classification)
SVM 在 NLP 中曾經是王者(pre-deep learning era)。
Why SVM works well for text:
- Text features(TF-IDF)通常 high-dimensional(10K+ features)and sparse
- Linear SVM 在 high-dim space 很有效 — 時不容易 overfit
- Margin maximization 在 sparse space 中提供良好的 generalization
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
text_clf = make_pipeline(
TfidfVectorizer(max_features=10000),
LinearSVC(C=1.0) # not SVC(kernel='linear') — much faster!
)
text_clf.fit(X_train_text, y_train)
# LinearSVC scales to large text datasets; kernel SVM would be too slow
面試 follow-up:「為什麼不用 kernel SVM 做 text classification?」— 因為 通常很大(數萬到數百萬文件),kernel SVM 的 training 不可行。而且 high-dim TF-IDF space 中 linear SVM 已經足夠好。
Case 2: 信用卡詐欺偵測
Challenge: SVM 不原生支援 class imbalance。
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# Option 1: class_weight
svm = make_pipeline(
StandardScaler(),
SVC(kernel="rbf", C=10, class_weight="balanced")
)
# class_weight='balanced' → C_fraud = C × (n_total / (2 × n_fraud))
# Option 2: Need probabilities? Use Platt scaling
svm_prob = make_pipeline(
StandardScaler(),
SVC(kernel="rbf", C=10, probability=True) # enables Platt scaling
)
# .predict_proba() available but adds computational cost
面試 follow-up:「SVM 的 output 是 probability 嗎?」— 不是。SVM output 是 signed distance from hyperplane。用 probability=True 啟用 Platt scaling(fit logistic regression on SVM outputs)可以得到 approximate probabilities,但這是 post-hoc calibration,不是 native 的。
Case 3: 異常偵測(One-Class SVM)
One-Class SVM 用來做 anomaly detection — 只用 normal data 訓練,找出和 normal data 不同的 outliers。
from sklearn.svm import OneClassSVM
# Train on normal transactions only
oc_svm = OneClassSVM(kernel="rbf", gamma="scale", nu=0.05)
oc_svm.fit(X_normal) # only normal data!
# Predict: +1 = normal, -1 = anomaly
predictions = oc_svm.predict(X_test)
# nu ≈ expected proportion of outliers
nu parameter ≈ expected fraction of outliers。One-Class SVM 在 boundary 內的 normal data 周圍畫一個 tight boundary,boundary 外的就是 anomaly。適合只有 normal data(沒有 labeled anomalies)的場景。
Hands-on: SVM in Python
SVM Kernels & C Parameter
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
X, y = make_moons(n_samples=300, noise=0.2)
configs = [
("Linear, C=1", SVC(kernel="linear", C=1)),
("RBF, C=1", SVC(kernel="rbf", C=1, gamma="scale")),
("RBF, C=100", SVC(kernel="rbf", C=100, gamma="scale")),
("RBF, C=0.01", SVC(kernel="rbf", C=0.01, gamma="scale")),
]
for name, svm in configs:
pipe = make_pipeline(StandardScaler(), svm) # always scale for SVM!
scores = cross_val_score(pipe, X, y, cv=5)
pipe.fit(X, y)
n_sv = pipe.named_steps["svc"].n_support_.sum()
print(f"{name}: CV Acc={scores.mean():.3f}, Support Vectors={n_sv}")
# Linear fails on non-linear data (moons)
# Large C → fewer SVs, tighter margin, overfitting risk
# Small C → more SVs, wider margin, more regularization
Grid Search for C and Gamma
from sklearn.model_selection import GridSearchCV
pipe = make_pipeline(StandardScaler(), SVC())
param_grid = {
"svc__C": [0.1, 1, 10, 100],
"svc__gamma": ["scale", "auto", 0.01, 0.1, 1],
"svc__kernel": ["rbf"],
}
grid = GridSearchCV(pipe, param_grid, cv=5, scoring="f1")
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Best F1: {grid.best_score_:.4f}")
Interactive: Decision Boundaries
Experiment with SVM and other classifiers to see how decision boundaries change:
Decision Boundary Explorer
See how different classifiers separate data. Requires scikit-learn (~20MB, cached after first load).
Interview Signals
What interviewers listen for:
- 你能解釋 margin maximization 的直覺和數學
- 你知道 C 和 的方向相反(C 大 = regularization 弱)
- 你能用非技術語言解釋 kernel trick
- 你知道 SVM 的 scalability 限制和什麼時候用 LinearSVC
- 你能比較 SVM 和其他方法(logistic regression, trees)
Practice
Flashcards
Flashcards (1/10)
Why does SVM only depend on support vectors?
In the dual formulation, only points with non-zero Lagrange multipliers (α > 0) affect the solution. 這些是在 margin boundary 上或裡面的點。其他所有點的 α = 0,移除它們不改變 decision boundary(KKT complementary slackness)。
Quiz
RBF kernel implicitly maps data to what kind of feature space?