Dimensionality Reduction
Interview Context
Dimensionality reduction 在面試中常以兩種形式出現:(1)「你有 1000 個 features,怎麼處理?」(2)「解釋 PCA 的原理和限制」。要知道理論,也要知道什麼時候降維有用、什麼時候不需要。
What You Should Understand
- 能解釋 PCA 的數學原理(eigendecomposition / SVD)和幾何直覺
- 知道 PCA 的假設和限制(linear, variance = importance)
- 能比較 PCA、t-SNE、UMAP 的適用場景
- 理解 curse of dimensionality 對 distance-based methods 的影響
- 知道什麼時候降維有幫助、什麼時候反而有害
The Curse of Dimensionality
Why High Dimensions Are Problematic
As dimensionality increases, several counterintuitive phenomena emerge:
1. Distance concentration: All pairwise distances converge to the same value.
所有 points 之間的距離變得差不多 → distance-based methods(KNN、K-Means、SVM with RBF)失去區分能力。
2. Volume explosion: The volume of a unit hypersphere relative to the unit hypercube shrinks to 0.
幾乎所有的 data 都集中在 hypercube 的「角落」 → 很少有 data 在中心附近。
3. Sample complexity: 要在高維空間中得到 reliable 的統計估計,需要的樣本量隨維度指數增長。
面試重點
被問 curse of dimensionality 時,不要只說「高維度 = 差」。要具體說明:distance concentration 讓 KNN/K-Means 失效、volume explosion 讓 density estimation 困難、sample complexity 讓 model 容易 overfit。
When to Reduce Dimensions
| Scenario | Reduce? | Method |
|---|---|---|
| 1000 features, many correlated | Yes | PCA — 去除冗餘 |
| Visualization of clusters | Yes | t-SNE / UMAP — 降到 2D |
| KNN/K-Means on high-dim data | Yes | PCA → clustering |
| Tree-based models (RF, GBM) | Usually no | Trees 自然做 feature selection,不受 distance curse 影響 |
| Deep learning (images, text) | Usually no | NN 自己學 representation |
| Feature selection 已經做了 | Maybe no | 如果 features 已經 curated,不需要 PCA |
PCA (Principal Component Analysis)
The Core Idea
PCA finds the orthogonal directions (principal components) along which the data has maximum variance:
- PC1 = direction of maximum variance
- PC2 = direction of maximum variance orthogonal to PC1
- PC3 = ... and so on
Mathematical Formulation
Given centered data matrix (), PCA is equivalent to:
1. Eigendecomposition of the covariance matrix:
where = eigenvectors (principal component directions), = diagonal matrix of eigenvalues.
2. Singular Value Decomposition (SVD) of :
PCA 的 principal components = 的 columns。Projected data = (取前 個 components)。
Explained Variance
Each eigenvalue represents the variance captured by that component:
Cumulative explained variance: Plot vs . 選 使 cumulative variance ≥ 某個 threshold(通常 90-95%)。
from sklearn.decomposition import PCA
import numpy as np
pca = PCA(n_components=0.95) # retain 95% of variance
X_reduced = pca.fit_transform(X_scaled)
print(f"Original: {X_scaled.shape[1]} features")
print(f"Reduced: {X_reduced.shape[1]} components")
print(f"Explained variance: {pca.explained_variance_ratio_.cumsum()[-1]:.1%}")
Geometric Interpretation
PCA 的幾何直覺:把 data 的 coordinate system 旋轉到和 data 的 shape 對齊。
- 原始 features 可能 correlated → 座標軸不和 data 的 spread 方向對齊
- PCA 找到 data 真正 spread 的方向 → 新座標軸 = principal components
- 降維 = 丟掉 variance 最小的方向(假設那些方向是 noise)
PCA Assumptions and Limitations
| Assumption | Meaning | When It Fails |
|---|---|---|
| Linearity | Components are linear combinations of features | Non-linear manifold structure |
| Variance = importance | High-variance directions are informative | Variance 最大的方向可能是 noise |
| Orthogonality | Components are uncorrelated | True latent factors may be non-orthogonal |
| Continuous features | Distance meaningful | Categorical / binary features |
| Centered data | Mean subtracted | 忘記 standardize 就做 PCA → 被 scale 大的 feature 主導 |
PCA 前一定要 Standardize
如果 features 的 scale 不同(例如 income 0-1M vs age 0-100),PCA 會被 large-scale feature 主導(因為它的 variance 最大)。一定要先 StandardScaler 再做 PCA — 除非你確定所有 features 已經在相同 scale。
PCA for Feature Engineering
PCA 不只是降維工具,也可以做 feature engineering:
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import GradientBoostingClassifier
# PCA as preprocessing step
pipe = make_pipeline(
StandardScaler(),
PCA(n_components=50),
GradientBoostingClassifier()
)
# Reduces overfitting when p >> n
# Also removes multicollinearity (PCA components are orthogonal)
SVD (Singular Value Decomposition)
Relationship to PCA
Any matrix () can be decomposed as:
- (): left singular vectors
- (): diagonal matrix of singular values
- (): right singular vectors = PCA eigenvectors
- Relationship:
PCA via SVD 比 eigendecomposition 更 numerically stable 且更 efficient — sklearn 內部就是用 SVD 算 PCA。
Truncated SVD
Keep only the top singular values/vectors:
This is the best rank- approximation of (Eckart-Young theorem).
Key advantage: TruncatedSVD works on sparse matrices — PCA 需要 centering(mean subtraction),破壞 sparsity。對 text data(TF-IDF)用 TruncatedSVD 而不是 PCA。
from sklearn.decomposition import TruncatedSVD
# For sparse data (e.g., TF-IDF text features)
svd = TruncatedSVD(n_components=100)
X_reduced = svd.fit_transform(X_tfidf) # keeps sparsity during computation
SVD in Recommender Systems
Matrix factorization for recommendations: User-item rating matrix .
把 user 和 item 都映射到低維 latent space → 預測 missing ratings = 。Netflix Prize 的核心方法之一。
t-SNE
Algorithm
t-Distributed Stochastic Neighbor Embedding — a non-linear method for visualization (reducing to 2D or 3D).
Steps:
- Compute pairwise similarities in high-dim using Gaussian kernels(nearby points have high similarity)
- Initialize points randomly in 2D
- Define similarities in 2D using Student-t distribution(heavier tails than Gaussian)
- Minimize KL divergence between high-dim and low-dim similarity distributions via gradient descent
Key Parameters
Perplexity (typically 5-50): Roughly the number of effective nearest neighbors. 控制 local vs global structure 的 tradeoff:
| Perplexity | Effect |
|---|---|
| Low (5-10) | Very local — small tight clusters, may miss global structure |
| Medium (30) | Balanced — typical default |
| High (50-100) | More global — clusters spread out, may blur fine structure |
t-SNE Limitations
| Limitation | Detail |
|---|---|
| Visualization only | 不能用 t-SNE 的 output 做 downstream ML(不保留 distance) |
| Non-deterministic | 不同 random seed → 不同 result。要 run 多次確認 pattern 穩定 |
| Doesn't preserve global structure | Cluster 之間的距離和大小沒有意義 |
| Slow | (exact),(Barnes-Hut approximation) |
| Can't transform new data | 沒有 transform() — 必須 refit 整個 dataset |
| Crowding problem | Student-t 的 heavy tail 讓遠的 points 看起來比實際更近 |
t-SNE 的常見誤讀
t-SNE 圖中:(1)Cluster 之間的距離沒有意義。(2)Cluster 的大小沒有意義。(3)只有 cluster 的存在與否和哪些 points 在一起是有意義的。面試中如果看到 t-SNE 圖就說「cluster A 離 B 很遠所以它們很不同」— 這是錯的。
UMAP
Why UMAP Over t-SNE
Uniform Manifold Approximation and Projection — 基於 topological data analysis,通常比 t-SNE 更好:
| Aspect | t-SNE | UMAP |
|---|---|---|
| Speed | or | — 快很多 |
| Global structure | Not preserved | Better preservation |
| Scalability | Struggles > 100K | Handles millions |
| Determinism | Non-deterministic | More consistent(but still stochastic) |
| Dimensionality | Usually only 2D/3D | Can reduce to any |
| Transform new data | No transform() | Has transform() — 可以 apply 到 new data |
| Theory | Probabilistic | Topological (Riemannian geometry) |
import umap
reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
embedding = reducer.fit_transform(X_scaled)
# Can transform new data:
new_embedding = reducer.transform(X_new)
UMAP Parameters
| Parameter | Default | Effect |
|---|---|---|
n_neighbors | 15 | Local vs global(像 t-SNE 的 perplexity)。大 = 更 global。 |
min_dist | 0.1 | 低維空間中 points 的最小距離。小 = clusters 更 tight。 |
n_components | 2 | 目標維度。可以設 > 2 用於 downstream ML。 |
metric | euclidean | Distance metric。支援 cosine, manhattan 等。 |
Other Methods
Kernel PCA
PCA captures linear structure. Kernel PCA uses the kernel trick to capture non-linear structure:
和 kernel SVM 一樣,不需要 explicitly map 到高維空間。RBF kernel PCA 可以展開 "Swiss roll" 等 non-linear manifolds。
LDA (Linear Discriminant Analysis)
Supervised dimensionality reduction — 找到最能區分 classes 的 projection direction:
where = between-class scatter, = within-class scatter.
PCA 找 variance 最大的方向(unsupervised),LDA 找最能區分 classes 的方向(supervised)。LDA 最多降到 維( = number of classes)。
Autoencoders
Neural network-based non-linear dimensionality reduction:
Latent code 就是降維後的 representation。和 PCA 的關係:linear autoencoder with MSE loss = PCA。Non-linear autoencoder(with ReLU)可以學到 non-linear manifold structure。
Method Comparison
| Method | Linear? | Supervised? | Scalable? | New Data? | Best For |
|---|---|---|---|---|---|
| PCA | Yes | No | Yes () | Yes | General purpose, preprocessing |
| TruncatedSVD | Yes | No | Yes (sparse) | Yes | Sparse data (text) |
| LDA | Yes | Yes | Yes | Yes | Classification preprocessing |
| Kernel PCA | No | No | No () | No | Non-linear, small data |
| t-SNE | No | No | No () | No | 2D visualization only |
| UMAP | No | No | Yes | Yes | Visualization + downstream ML |
| Autoencoder | No | No | Yes (GPU) | Yes | Large-scale, complex manifolds |
Real-World Use Cases
Case 1: 信用卡詐欺偵測 — PCA for Anomaly Detection
信用卡交易的 raw features(V1-V28 in Kaggle dataset)已經是 PCA-transformed — 銀行不公開原始 features,只提供 PCA components。
另一種用法:用 PCA reconstruction error 做 anomaly detection。
from sklearn.decomposition import PCA
pca = PCA(n_components=10)
X_reduced = pca.fit_transform(X_train_normal) # train on normal only
X_reconstructed = pca.inverse_transform(X_reduced)
reconstruction_error = np.mean((X_train_normal - X_reconstructed) ** 2, axis=1)
# High reconstruction error → anomaly
# Normal data is well-represented by top PCs; fraud data isn't
直覺:normal transactions 在一個 low-dimensional subspace 內。PCA 用少量 components 就能 reconstruct normal data。Fraud transactions 偏離這個 subspace → reconstruction error 高 → detected as anomaly。
Case 2: 推薦系統 — SVD Matrix Factorization
User-item rating matrix 有 99% missing values。用 truncated SVD 找 latent factors:
from sklearn.decomposition import TruncatedSVD
# R = user-item matrix (sparse, n_users × n_items)
svd = TruncatedSVD(n_components=50)
user_factors = svd.fit_transform(R) # n_users × 50
item_factors = svd.components_.T # n_items × 50
# Predict rating for user i, item j
predicted_rating = user_factors[i] @ item_factors[j]
面試 follow-up:「SVD 和 collaborative filtering 的關係?」— SVD 是 model-based collaborative filtering 的核心。User 和 item 都映射到同一個 latent space,距離近 = 偏好相似。
Case 3: 客戶分群 — UMAP + Clustering
高維 customer features 直接做 K-Means 效果差(curse of dimensionality)。先用 UMAP 降維再 clustering:
import umap
from sklearn.cluster import KMeans
# Step 1: Reduce to manageable dimensions
reducer = umap.UMAP(n_components=10, random_state=42)
X_umap = reducer.fit_transform(X_scaled)
# Step 2: Cluster in reduced space
km = KMeans(n_clusters=5, random_state=42)
labels = km.fit_predict(X_umap)
# Step 3: Visualize in 2D
reducer_2d = umap.UMAP(n_components=2, random_state=42)
X_2d = reducer_2d.fit_transform(X_scaled)
# Color by cluster labels → see if clusters are visually separated
面試 follow-up:「為什麼用 UMAP(n_components=10) 再 K-Means,而不是直接在原始空間做?」— 原始空間可能有幾百維,Euclidean distance 在高維中失去區分度。UMAP 降到 10 維保留主要 structure,K-Means 在低維中表現更好。
Hands-on: Dimensionality Reduction in Python
PCA with Explained Variance
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Always standardize before PCA
X_scaled = StandardScaler().fit_transform(X)
# Fit PCA and check explained variance
pca = PCA()
pca.fit(X_scaled)
cumulative_var = pca.explained_variance_ratio_.cumsum()
n_95 = (cumulative_var < 0.95).sum() + 1
print(f"Components for 95% variance: {n_95} / {X_scaled.shape[1]}")
# Apply dimensionality reduction
pca_final = PCA(n_components=n_95)
X_reduced = pca_final.fit_transform(X_scaled)
t-SNE vs UMAP Visualization
from sklearn.manifold import TSNE
import umap
# t-SNE (slow, 2D only, no transform)
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
# UMAP (fast, any dim, has transform)
reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
X_umap = reducer.fit_transform(X_scaled)
X_new_umap = reducer.transform(X_new) # can apply to new data!
Interview Signals
What interviewers listen for:
- 你能解釋 PCA 的數學(eigendecomposition or SVD)和幾何直覺
- 你知道 PCA 前要 standardize,以及為什麼
- 你能區分 t-SNE 和 UMAP 的優缺點,知道 t-SNE 圖中距離/大小沒有意義
- 你理解 curse of dimensionality 的具體效應(distance concentration, volume explosion)
- 你知道什麼時候降維有用、什麼時候不需要(trees 不受 curse 影響)
Practice
Flashcards
Flashcards (1/10)
PCA 找的是什麼方向?為什麼?
Maximum variance directions(principal components)。假設:variance 最大的方向包含最多資訊,variance 最小的是 noise。降維 = 丟掉 low-variance directions。數學上是 covariance matrix 的 eigenvectors。
Quiz
PCA 的第一個 principal component 代表什麼?