Dimensionality Reduction

Interview Context

Dimensionality reduction 在面試中常以兩種形式出現:(1)「你有 1000 個 features,怎麼處理?」(2)「解釋 PCA 的原理和限制」。要知道理論,也要知道什麼時候降維有用、什麼時候不需要。

What You Should Understand

  • 能解釋 PCA 的數學原理(eigendecomposition / SVD)和幾何直覺
  • 知道 PCA 的假設和限制(linear, variance = importance)
  • 能比較 PCA、t-SNE、UMAP 的適用場景
  • 理解 curse of dimensionality 對 distance-based methods 的影響
  • 知道什麼時候降維有幫助、什麼時候反而有害

The Curse of Dimensionality

Why High Dimensions Are Problematic

As dimensionality dd increases, several counterintuitive phenomena emerge:

1. Distance concentration: All pairwise distances converge to the same value.

maxxixjminxixjminxixj0as d\frac{\max \|x_i - x_j\| - \min \|x_i - x_j\|}{\min \|x_i - x_j\|} \to 0 \quad \text{as } d \to \infty

所有 points 之間的距離變得差不多 → distance-based methods(KNN、K-Means、SVM with RBF)失去區分能力。

2. Volume explosion: The volume of a unit hypersphere relative to the unit hypercube shrinks to 0.

VsphereVcube=πd/2d2d1Γ(d/2)0\frac{V_{\text{sphere}}}{V_{\text{cube}}} = \frac{\pi^{d/2}}{d \cdot 2^{d-1} \cdot \Gamma(d/2)} \to 0

幾乎所有的 data 都集中在 hypercube 的「角落」 → 很少有 data 在中心附近。

3. Sample complexity: 要在高維空間中得到 reliable 的統計估計,需要的樣本量隨維度指數增長。

面試重點

被問 curse of dimensionality 時,不要只說「高維度 = 差」。要具體說明:distance concentration 讓 KNN/K-Means 失效、volume explosion 讓 density estimation 困難、sample complexity 讓 model 容易 overfit。

When to Reduce Dimensions

ScenarioReduce?Method
1000 features, many correlatedYesPCA — 去除冗餘
Visualization of clustersYest-SNE / UMAP — 降到 2D
KNN/K-Means on high-dim dataYesPCA → clustering
Tree-based models (RF, GBM)Usually noTrees 自然做 feature selection,不受 distance curse 影響
Deep learning (images, text)Usually noNN 自己學 representation
Feature selection 已經做了Maybe no如果 features 已經 curated,不需要 PCA

PCA (Principal Component Analysis)

The Core Idea

PCA finds the orthogonal directions (principal components) along which the data has maximum variance:

  • PC1 = direction of maximum variance
  • PC2 = direction of maximum variance orthogonal to PC1
  • PC3 = ... and so on

Mathematical Formulation

Given centered data matrix X\mathbf{X} (n×pn \times p), PCA is equivalent to:

1. Eigendecomposition of the covariance matrix:

C=1n1XX=VΛV\mathbf{C} = \frac{1}{n-1}\mathbf{X}^\top \mathbf{X} = \mathbf{V} \boldsymbol{\Lambda} \mathbf{V}^\top

where V\mathbf{V} = eigenvectors (principal component directions), Λ\boldsymbol{\Lambda} = diagonal matrix of eigenvalues.

2. Singular Value Decomposition (SVD) of X\mathbf{X}:

X=UΣV\mathbf{X} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^\top

PCA 的 principal components = V\mathbf{V} 的 columns。Projected data = XVk\mathbf{X} \mathbf{V}_k(取前 kk 個 components)。

Explained Variance

Each eigenvalue λi\lambda_i represents the variance captured by that component:

Explained variance ratioi=λijλj\text{Explained variance ratio}_i = \frac{\lambda_i}{\sum_j \lambda_j}

Cumulative explained variance: Plot i=1kλi/jλj\sum_{i=1}^{k} \lambda_i / \sum_j \lambda_j vs kk. 選 kk 使 cumulative variance ≥ 某個 threshold(通常 90-95%)。

from sklearn.decomposition import PCA
import numpy as np

pca = PCA(n_components=0.95)  # retain 95% of variance
X_reduced = pca.fit_transform(X_scaled)
print(f"Original: {X_scaled.shape[1]} features")
print(f"Reduced:  {X_reduced.shape[1]} components")
print(f"Explained variance: {pca.explained_variance_ratio_.cumsum()[-1]:.1%}")

Geometric Interpretation

PCA 的幾何直覺:把 data 的 coordinate system 旋轉到和 data 的 shape 對齊。

  • 原始 features 可能 correlated → 座標軸不和 data 的 spread 方向對齊
  • PCA 找到 data 真正 spread 的方向 → 新座標軸 = principal components
  • 降維 = 丟掉 variance 最小的方向(假設那些方向是 noise)

PCA Assumptions and Limitations

AssumptionMeaningWhen It Fails
LinearityComponents are linear combinations of featuresNon-linear manifold structure
Variance = importanceHigh-variance directions are informativeVariance 最大的方向可能是 noise
OrthogonalityComponents are uncorrelatedTrue latent factors may be non-orthogonal
Continuous featuresDistance meaningfulCategorical / binary features
Centered dataMean subtracted忘記 standardize 就做 PCA → 被 scale 大的 feature 主導

PCA 前一定要 Standardize

如果 features 的 scale 不同(例如 income 0-1M vs age 0-100),PCA 會被 large-scale feature 主導(因為它的 variance 最大)。一定要先 StandardScaler 再做 PCA — 除非你確定所有 features 已經在相同 scale。

PCA for Feature Engineering

PCA 不只是降維工具,也可以做 feature engineering:

from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import GradientBoostingClassifier

# PCA as preprocessing step
pipe = make_pipeline(
    StandardScaler(),
    PCA(n_components=50),
    GradientBoostingClassifier()
)
# Reduces overfitting when p >> n
# Also removes multicollinearity (PCA components are orthogonal)

SVD (Singular Value Decomposition)

Relationship to PCA

Any matrix X\mathbf{X} (n×pn \times p) can be decomposed as:

X=UΣV\mathbf{X} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^\top
  • U\mathbf{U} (n×nn \times n): left singular vectors
  • Σ\boldsymbol{\Sigma} (n×pn \times p): diagonal matrix of singular values σi\sigma_i
  • V\mathbf{V} (p×pp \times p): right singular vectors = PCA eigenvectors
  • Relationship: λi=σi2/(n1)\lambda_i = \sigma_i^2 / (n-1)

PCA via SVD 比 eigendecomposition 更 numerically stable 且更 efficient — sklearn 內部就是用 SVD 算 PCA。

Truncated SVD

Keep only the top kk singular values/vectors:

XUkΣkVk\mathbf{X} \approx \mathbf{U}_k \boldsymbol{\Sigma}_k \mathbf{V}_k^\top

This is the best rank-kk approximation of X\mathbf{X} (Eckart-Young theorem).

Key advantage: TruncatedSVD works on sparse matrices — PCA 需要 centering(mean subtraction),破壞 sparsity。對 text data(TF-IDF)用 TruncatedSVD 而不是 PCA。

from sklearn.decomposition import TruncatedSVD

# For sparse data (e.g., TF-IDF text features)
svd = TruncatedSVD(n_components=100)
X_reduced = svd.fit_transform(X_tfidf)  # keeps sparsity during computation

SVD in Recommender Systems

Matrix factorization for recommendations: User-item rating matrix RUΣV\mathbf{R} \approx \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^\top.

把 user 和 item 都映射到低維 latent space → 預測 missing ratings = uiΣvj\mathbf{u}_i^\top \boldsymbol{\Sigma} \mathbf{v}_j。Netflix Prize 的核心方法之一。

t-SNE

Algorithm

t-Distributed Stochastic Neighbor Embedding — a non-linear method for visualization (reducing to 2D or 3D).

Steps:

  1. Compute pairwise similarities in high-dim using Gaussian kernels(nearby points have high similarity)
  2. Initialize points randomly in 2D
  3. Define similarities in 2D using Student-t distribution(heavier tails than Gaussian)
  4. Minimize KL divergence between high-dim and low-dim similarity distributions via gradient descent

Key Parameters

Perplexity (typically 5-50): Roughly the number of effective nearest neighbors. 控制 local vs global structure 的 tradeoff:

PerplexityEffect
Low (5-10)Very local — small tight clusters, may miss global structure
Medium (30)Balanced — typical default
High (50-100)More global — clusters spread out, may blur fine structure

t-SNE Limitations

LimitationDetail
Visualization only不能用 t-SNE 的 output 做 downstream ML(不保留 distance)
Non-deterministic不同 random seed → 不同 result。要 run 多次確認 pattern 穩定
Doesn't preserve global structureCluster 之間的距離和大小沒有意義
SlowO(n2)O(n^2)(exact),O(nlogn)O(n \log n)(Barnes-Hut approximation)
Can't transform new data沒有 transform() — 必須 refit 整個 dataset
Crowding problemStudent-t 的 heavy tail 讓遠的 points 看起來比實際更近

t-SNE 的常見誤讀

t-SNE 圖中:(1)Cluster 之間的距離沒有意義。(2)Cluster 的大小沒有意義。(3)只有 cluster 的存在與否哪些 points 在一起是有意義的。面試中如果看到 t-SNE 圖就說「cluster A 離 B 很遠所以它們很不同」— 這是錯的。

UMAP

Why UMAP Over t-SNE

Uniform Manifold Approximation and Projection — 基於 topological data analysis,通常比 t-SNE 更好:

Aspectt-SNEUMAP
SpeedO(n2)O(n^2) or O(nlogn)O(n \log n)O(n1.14)O(n^{1.14}) — 快很多
Global structureNot preservedBetter preservation
ScalabilityStruggles > 100KHandles millions
DeterminismNon-deterministicMore consistent(but still stochastic)
DimensionalityUsually only 2D/3DCan reduce to any kk
Transform new dataNo transform()Has transform() — 可以 apply 到 new data
TheoryProbabilisticTopological (Riemannian geometry)
import umap

reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
embedding = reducer.fit_transform(X_scaled)
# Can transform new data:
new_embedding = reducer.transform(X_new)

UMAP Parameters

ParameterDefaultEffect
n_neighbors15Local vs global(像 t-SNE 的 perplexity)。大 = 更 global。
min_dist0.1低維空間中 points 的最小距離。小 = clusters 更 tight。
n_components2目標維度。可以設 > 2 用於 downstream ML。
metriceuclideanDistance metric。支援 cosine, manhattan 等。

Other Methods

Kernel PCA

PCA captures linear structure. Kernel PCA uses the kernel trick to capture non-linear structure:

Kernel PCA=PCA in kernel-induced feature space\text{Kernel PCA} = \text{PCA in kernel-induced feature space}

和 kernel SVM 一樣,不需要 explicitly map 到高維空間。RBF kernel PCA 可以展開 "Swiss roll" 等 non-linear manifolds。

LDA (Linear Discriminant Analysis)

Supervised dimensionality reduction — 找到最能區分 classes 的 projection direction:

maxwwSBwwSWw\max_{\mathbf{w}} \frac{\mathbf{w}^\top \mathbf{S}_B \mathbf{w}}{\mathbf{w}^\top \mathbf{S}_W \mathbf{w}}

where SB\mathbf{S}_B = between-class scatter, SW\mathbf{S}_W = within-class scatter.

PCA 找 variance 最大的方向(unsupervised),LDA 找最能區分 classes 的方向(supervised)。LDA 最多降到 K1K-1 維(KK = number of classes)。

Autoencoders

Neural network-based non-linear dimensionality reduction:

InputEncoderLatent code (bottleneck)DecoderReconstruction\text{Input} \xrightarrow{\text{Encoder}} \text{Latent code (bottleneck)} \xrightarrow{\text{Decoder}} \text{Reconstruction}

Latent code 就是降維後的 representation。和 PCA 的關係:linear autoencoder with MSE loss = PCA。Non-linear autoencoder(with ReLU)可以學到 non-linear manifold structure。

Method Comparison

MethodLinear?Supervised?Scalable?New Data?Best For
PCAYesNoYes (O(np2)O(np^2))YesGeneral purpose, preprocessing
TruncatedSVDYesNoYes (sparse)YesSparse data (text)
LDAYesYesYesYesClassification preprocessing
Kernel PCANoNoNo (O(n3)O(n^3))NoNon-linear, small data
t-SNENoNoNo (O(n2)O(n^2))No2D visualization only
UMAPNoNoYesYesVisualization + downstream ML
AutoencoderNoNoYes (GPU)YesLarge-scale, complex manifolds

Real-World Use Cases

Case 1: 信用卡詐欺偵測 — PCA for Anomaly Detection

信用卡交易的 raw features(V1-V28 in Kaggle dataset)已經是 PCA-transformed — 銀行不公開原始 features,只提供 PCA components。

另一種用法:用 PCA reconstruction error 做 anomaly detection。

from sklearn.decomposition import PCA

pca = PCA(n_components=10)
X_reduced = pca.fit_transform(X_train_normal)  # train on normal only
X_reconstructed = pca.inverse_transform(X_reduced)
reconstruction_error = np.mean((X_train_normal - X_reconstructed) ** 2, axis=1)

# High reconstruction error → anomaly
# Normal data is well-represented by top PCs; fraud data isn't

直覺:normal transactions 在一個 low-dimensional subspace 內。PCA 用少量 components 就能 reconstruct normal data。Fraud transactions 偏離這個 subspace → reconstruction error 高 → detected as anomaly。

Case 2: 推薦系統 — SVD Matrix Factorization

User-item rating matrix 有 99% missing values。用 truncated SVD 找 latent factors:

from sklearn.decomposition import TruncatedSVD

# R = user-item matrix (sparse, n_users × n_items)
svd = TruncatedSVD(n_components=50)
user_factors = svd.fit_transform(R)          # n_users × 50
item_factors = svd.components_.T              # n_items × 50

# Predict rating for user i, item j
predicted_rating = user_factors[i] @ item_factors[j]

面試 follow-up:「SVD 和 collaborative filtering 的關係?」— SVD 是 model-based collaborative filtering 的核心。User 和 item 都映射到同一個 latent space,距離近 = 偏好相似。

Case 3: 客戶分群 — UMAP + Clustering

高維 customer features 直接做 K-Means 效果差(curse of dimensionality)。先用 UMAP 降維再 clustering:

import umap
from sklearn.cluster import KMeans

# Step 1: Reduce to manageable dimensions
reducer = umap.UMAP(n_components=10, random_state=42)
X_umap = reducer.fit_transform(X_scaled)

# Step 2: Cluster in reduced space
km = KMeans(n_clusters=5, random_state=42)
labels = km.fit_predict(X_umap)

# Step 3: Visualize in 2D
reducer_2d = umap.UMAP(n_components=2, random_state=42)
X_2d = reducer_2d.fit_transform(X_scaled)
# Color by cluster labels → see if clusters are visually separated

面試 follow-up:「為什麼用 UMAP(n_components=10) 再 K-Means,而不是直接在原始空間做?」— 原始空間可能有幾百維,Euclidean distance 在高維中失去區分度。UMAP 降到 10 維保留主要 structure,K-Means 在低維中表現更好。

Hands-on: Dimensionality Reduction in Python

PCA with Explained Variance

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Always standardize before PCA
X_scaled = StandardScaler().fit_transform(X)

# Fit PCA and check explained variance
pca = PCA()
pca.fit(X_scaled)

cumulative_var = pca.explained_variance_ratio_.cumsum()
n_95 = (cumulative_var < 0.95).sum() + 1
print(f"Components for 95% variance: {n_95} / {X_scaled.shape[1]}")

# Apply dimensionality reduction
pca_final = PCA(n_components=n_95)
X_reduced = pca_final.fit_transform(X_scaled)

t-SNE vs UMAP Visualization

from sklearn.manifold import TSNE
import umap

# t-SNE (slow, 2D only, no transform)
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

# UMAP (fast, any dim, has transform)
reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
X_umap = reducer.fit_transform(X_scaled)
X_new_umap = reducer.transform(X_new)  # can apply to new data!

Interview Signals

What interviewers listen for:

  • 你能解釋 PCA 的數學(eigendecomposition or SVD)和幾何直覺
  • 你知道 PCA 前要 standardize,以及為什麼
  • 你能區分 t-SNE 和 UMAP 的優缺點,知道 t-SNE 圖中距離/大小沒有意義
  • 你理解 curse of dimensionality 的具體效應(distance concentration, volume explosion)
  • 你知道什麼時候降維有用、什麼時候不需要(trees 不受 curse 影響)

Practice

Flashcards

Flashcards (1/10)

PCA 找的是什麼方向?為什麼?

Maximum variance directions(principal components)。假設:variance 最大的方向包含最多資訊,variance 最小的是 noise。降維 = 丟掉 low-variance directions。數學上是 covariance matrix 的 eigenvectors。

Click card to flip

Quiz

Question 1/10

PCA 的第一個 principal component 代表什麼?

Mark as Complete

3/5 — Okay