Decompositions & Matrix Calculus
Course Reference
Extends Introduction to Applied Linear Algebra by Boyd & Vandenberghe (Stanford) with decompositions and matrix calculus essential for ML/DL.
Why This Matters
Eigendecomposition 是 PCA 的數學基礎,SVD 是推薦系統和 LoRA 的核心,matrix calculus 是 backpropagation 的語言。理解這些讓你能從原理理解 ML 算法,而不只是會 call API。
Eigendecomposition
Definition
For a square matrix , an eigenvector and eigenvalue satisfy:
直覺: 作用在 上只是 scale(不改方向)。 = scaling factor。
Eigendecomposition
For a symmetric matrix :
- : columns = eigenvectors(orthonormal if symmetric)
- : diagonal matrix of eigenvalues
ML Connections
| Application | How Eigendecomposition Is Used |
|---|---|
| PCA | Eigenvectors of covariance matrix = principal components; eigenvalues = explained variance |
| PageRank | Dominant eigenvector of transition matrix = page importance scores |
| Spectral clustering | Eigenvectors of graph Laplacian = cluster indicators |
| Markov chains (RL) | Eigenvector with = stationary distribution |
| Convexity check | Eigenvalues of Hessian: all > 0 → convex, mixed → saddle points |
import numpy as np
A = np.array([[4, 2], [1, 3]])
eigenvalues, eigenvectors = np.linalg.eig(A)
# For symmetric matrices (covariance), use eigh (guaranteed real, sorted)
cov = np.cov(X.T)
eigenvalues, eigenvectors = np.linalg.eigh(cov)
# eigenvalues sorted ascending; last = largest = PC1
import numpy as np
# Eigendecomposition
A = np.array([[4, 2], [2, 3]])
eigenvalues, eigenvectors = np.linalg.eigh(A) # eigh for symmetric (real eigenvalues, sorted)
# eigenvalues: [1.59, 5.41]
# eigenvectors: columns are the eigenvectors
# Verify: A @ v = λ * v
v = eigenvectors[:, 1] # eigenvector for largest eigenvalue
lam = eigenvalues[1]
np.allclose(A @ v, lam * v) # True ✓
# Reconstruct: A = V Λ V^T
A_reconstructed = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
np.allclose(A, A_reconstructed) # True ✓
# PCA: eigendecomposition of covariance matrix
X = np.random.randn(100, 5)
X_centered = X - X.mean(axis=0)
cov = (X_centered.T @ X_centered) / (len(X) - 1) # = np.cov(X.T)
eig_vals, eig_vecs = np.linalg.eigh(cov)
# eig_vals[-1] = largest = PC1's variance
# eig_vecs[:, -1] = PC1 direction
explained_ratio = eig_vals / eig_vals.sum() # explained variance per PC
PCA via Eigendecomposition
- PC1 = eigenvector with largest eigenvalue = direction of maximum variance
- PC2 = eigenvector with second largest eigenvalue, orthogonal to PC1
- Explained variance ratio =
面試連結:PCA
「PCA 的數學原理是什麼?」→ 對 centered data 的 covariance matrix 做 eigendecomposition。Eigenvectors = principal directions,eigenvalues = variance in those directions。降維 = 只保留 top-k eigenvectors。
Singular Value Decomposition (SVD)
Definition
Any matrix (not just square) can be decomposed as:
| Component | Size | Properties |
|---|---|---|
| Left singular vectors (orthonormal) | ||
| Diagonal: singular values | ||
| Right singular vectors (orthonormal) = PCA eigenvectors |
Relationship to Eigendecomposition
PCA via SVD: → principal components = columns of 。SVD 比 eigendecomposition 更 numerically stable — sklearn 的 PCA 底層就是用 SVD。
Truncated SVD: Low-Rank Approximation
Keep only top singular values:
Eckart-Young theorem: This is the best rank- approximation of in Frobenius norm.
ML Connections
| Application | How SVD Is Used |
|---|---|
| PCA | → columns of = PCs |
| Recommender systems | — user/item latent factors |
| LSA (NLP) | TruncatedSVD on TF-IDF → topic extraction |
| Image compression | Keep top singular values → reconstruct with fewer parameters |
| LoRA (LLM fine-tuning) | — low-rank update to frozen weights |
| Pseudo-inverse |
import numpy as np
# Full SVD
U, sigma, Vt = np.linalg.svd(X, full_matrices=False)
# Reconstruct: X ≈ U @ np.diag(sigma) @ Vt
# Truncated SVD (for sparse data — PCA alternative)
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=50)
X_reduced = svd.fit_transform(X_tfidf) # keeps sparsity
import numpy as np
# Full SVD
X = np.random.randn(100, 10)
U, sigma, Vt = np.linalg.svd(X, full_matrices=False)
# U: (100, 10), sigma: (10,), Vt: (10, 10)
# Reconstruct
X_reconstructed = U @ np.diag(sigma) @ Vt
np.allclose(X, X_reconstructed) # True ✓
# Truncated SVD: rank-k approximation
k = 3
X_approx = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :]
error = np.linalg.norm(X - X_approx, 'fro')
total = np.linalg.norm(X, 'fro')
print(f"Rank-{k} captures {1 - error**2/total**2:.1%} of variance")
# Relationship: eigenvalues of X^T X = sigma^2
XtX_eigs = np.linalg.eigvalsh(X.T @ X)
sigma_squared = sigma**2
# XtX_eigs (sorted ascending) ≈ sigma_squared (sorted descending, reversed)
# TruncatedSVD for sparse data (NLP)
from sklearn.decomposition import TruncatedSVD
# svd = TruncatedSVD(n_components=50).fit_transform(X_tfidf_sparse)
# LoRA: low-rank weight update
d, r = 768, 16 # model dim, LoRA rank
B = np.random.randn(d, r) * 0.01 # d × r
A = np.random.randn(r, d) * 0.01 # r × d
delta_W = B @ A # rank-r update: (d × r) @ (r × d) = (d × d)
# Full W has d² = 589K params, LoRA has 2dr = 24K params (24x fewer)
面試連結:LoRA
LoRA (Low-Rank Adaptation) 用 low-rank matrices fine-tune LLMs:,其中 , , 。只 train 和 (far fewer params than full )。核心 insight = weight updates are often low-rank → don't need full-rank update matrix。
Matrix Calculus
Why Matrix Calculus for ML
Backpropagation = chain rule applied to matrix operations. 理解 matrix derivatives 讓你能手推 gradient(面試常考)和 debug gradient issues。
Scalar-by-Vector Gradient
For a scalar function , the gradient is:
Gradient 指向 增加最快的方向。Gradient descent: → move in opposite direction。
Common Gradients
| Function | Gradient | ML Usage |
|---|---|---|
| Linear model gradient | ||
| (= if symmetric) | Quadratic form (regularization) | |
| OLS gradient → set to 0 → normal equation | ||
| L2 regularization gradient |
OLS Derivation via Matrix Calculus
面試中能手推 OLS closed-form 是 strong signal。
import numpy as np
# Gradient of f = a^T x → ∇f = a
a = np.array([2, -1, 3])
x = np.array([1, 2, 3])
# f = a @ x = 2 - 2 + 9 = 9
# gradient w.r.t. x = a = [2, -1, 3]
# Gradient of f = ||x||^2 = x^T x → ∇f = 2x
x = np.array([1, 2, 3])
grad = 2 * x # [2, 4, 6]
# OLS: verify gradient = 0 at optimal beta
X = np.random.randn(100, 3)
y_true = X @ np.array([2, -1, 0.5]) + np.random.randn(100) * 0.1
beta = np.linalg.inv(X.T @ X) @ X.T @ y_true # OLS closed form
# Gradient at optimal beta should be ≈ 0
gradient = -2 * X.T @ (y_true - X @ beta)
print(f"Gradient at optimal beta: {gradient}") # ≈ [0, 0, 0]
# Ridge: gradient includes regularization term
lam = 1.0
# ∇L = -2X^T(y - Xβ) + 2λβ = 0
# → (X^TX + λI)β = X^Ty
beta_ridge = np.linalg.inv(X.T @ X + lam * np.eye(3)) @ X.T @ y_true
Taylor Approximation
Scalar Taylor Expansion
Any smooth function can be approximated locally by a polynomial:
| Order | Approximation | What It Captures |
|---|---|---|
| 0th | Value only | |
| 1st (linear) | Slope — used in gradient descent | |
| 2nd (quadratic) | Curvature — used in Newton's method, XGBoost |
Multivariate Taylor Expansion
For :
- 1st order term → gradient descent 的理論基礎(local linear approximation)
- 2nd order term → Hessian 提供 curvature 資訊
ML Connections
| Application | Which Order | How |
|---|---|---|
| Gradient descent | 1st order | — move along negative gradient(linear approx) |
| Newton's method | 2nd order | — use curvature for better step |
| XGBoost | 2nd order | Loss ≈ — gradient + Hessian for optimal leaf values |
| Adam optimizer | ≈ diagonal 2nd order | Second moment diagonal Hessian → adaptive lr per parameter |
| Natural gradient | 2nd order (Fisher) | — curvature in parameter space |
面試連結:XGBoost Second-Order
XGBoost 對 loss function 做 second-order Taylor expansion:,其中 (gradient), (Hessian)。用 和 同時找最佳 split 和 leaf value — 比只用 gradient 的 vanilla gradient boosting 更快 converge。
import numpy as np
# Taylor approximation of f(x) = exp(x) around x=0
x0 = 0
dx = 0.5
f_exact = np.exp(x0 + dx) # exact: 1.6487
f_0th = np.exp(x0) # 0th order: 1.0
f_1st = np.exp(x0) * (1 + dx) # 1st order: 1.5
f_2nd = np.exp(x0) * (1 + dx + dx**2/2) # 2nd order: 1.625
# Higher order → closer to exact
# Gradient descent = 1st order Taylor
# Newton's method = 2nd order Taylor
# f(x + Δx) ≈ f(x) + f'(x)Δx + ½f''(x)Δx²
# Minimize: set derivative w.r.t. Δx to 0:
# f'(x) + f''(x)Δx = 0 → Δx = -f'(x)/f''(x) → Newton step
# XGBoost: second-order approximation for each sample
# g_i = gradient of loss at current prediction
# h_i = Hessian (second derivative) of loss
# For squared error: g_i = 2(y_hat - y), h_i = 2
# For log loss: g_i = y_hat - y, h_i = y_hat * (1 - y_hat)
Jacobian
Definition
For a vector function , the Jacobian is the matrix of all partial derivatives:
ML Connections
| Where Jacobians Appear | Why |
|---|---|
| Backprop in RNNs | — product of Jacobians → vanishing/exploding gradients |
| Change of variables | Probability transforms with Jacobian determinant — normalizing flows, VAE |
| Neural network layers | Each layer's Jacobian describes how output changes with input |
面試連結:Vanishing Gradient in RNNs
RNN 的 gradient 涉及 Jacobians 的乘積:。每個 的 spectral norm(largest singular value)< 1 → 乘積 → 0(vanishing)。> 1 → 乘積 → ∞(exploding)。LSTM 的 additive cell state update 避免了這個 multiplicative chain。
import numpy as np
# Jacobian: matrix of partial derivatives
# f: R^2 → R^2, f(x) = [x1^2 + x2, x1*x2]
# J = [[2*x1, 1], [x2, x1]]
x = np.array([3.0, 2.0])
J = np.array([[2*x[0], 1],
[x[1], x[0]]])
# J = [[6, 1], [2, 3]]
# Spectral norm = largest singular value (controls gradient magnitude)
spectral_norm = np.linalg.svd(J, compute_uv=False)[0] # largest singular value
# RNN vanishing gradient: product of Jacobians
# If spectral_norm < 1 for each J_i, product → 0
W_hh = np.random.randn(4, 4) * 0.5 # small weights
tanh_deriv = np.diag(np.random.uniform(0, 1, 4)) # tanh' ∈ (0, 1)
J_rnn = tanh_deriv @ W_hh
print(f"Spectral norm of J_rnn: {np.linalg.svd(J_rnn, compute_uv=False)[0]:.3f}")
# After 20 time steps:
J_product = np.eye(4)
for _ in range(20):
J_product = J_rnn @ J_product
print(f"Norm after 20 steps: {np.linalg.norm(J_product):.6f}") # → ≈ 0 (vanishing!)
# PyTorch autograd computes Jacobians automatically via .backward()
Hessian
Definition
For a scalar function , the Hessian is the matrix of second derivatives:
Hessian 描述 loss surface 的 curvature。
Hessian and Optimization
| Hessian Property | What It Means |
|---|---|
| PD (all ) | Local minimum → convex locally |
| ND (all ) | Local maximum |
| Mixed signs | Saddle point — minimum in some directions, maximum in others |
| Condition number | 大 → ill-conditioned → gradient descent slow(narrow valley) |
ML Connections
| Where Hessian Appears | How |
|---|---|
| Convexity of loss | MSE: (PSD → convex)。Neural nets: non-convex → saddle points |
| Newton's method | — uses curvature for better updates |
| XGBoost | Uses second-order approximation (gradient + Hessian) for better splits |
| Natural gradient | — Fisher information ≈ expected Hessian |
| Adam optimizer | Second moment ≈ diagonal Hessian → adaptive learning rate per parameter |
面試連結:Neural Networks 不是 Convex
Linear regression 的 Hessian = (PSD → convex → unique global minimum → GD 一定 converge)。Neural networks 的 Hessian 有 positive 和 negative eigenvalues → non-convex → many local minima 和 saddle points。但 empirically SGD 找到的 local minima 通常 generalize well(flat minima hypothesis)。
import numpy as np
# Hessian of OLS loss: H = 2 X^T X (constant — doesn't depend on beta)
X = np.random.randn(100, 3)
H_ols = 2 * X.T @ X # 3×3 matrix
# Check convexity: all eigenvalues ≥ 0?
eigs = np.linalg.eigvalsh(H_ols)
print(f"OLS Hessian eigenvalues: {eigs.round(2)}") # all > 0 → convex ✓
# Condition number: κ = λ_max / λ_min → large = ill-conditioned
kappa = eigs[-1] / eigs[0]
print(f"Condition number: {kappa:.1f}")
# Large κ → narrow valley → GD oscillates → slow convergence
# Solution: use Adam (adaptive lr per parameter) or preconditioning
# Numerical Hessian for arbitrary function
from scipy.optimize import approx_fprime
def loss(beta):
return np.sum((X @ beta - np.ones(100))**2)
def numerical_hessian(f, x, eps=1e-5):
n = len(x)
H = np.zeros((n, n))
for i in range(n):
def grad_i(x):
return approx_fprime(x, f, eps)[i]
H[i] = approx_fprime(x, grad_i, eps)
return H
H_num = numerical_hessian(loss, np.zeros(3))
np.allclose(H_num, H_ols, atol=0.1) # True ✓
Quadratic Forms
Definition
ML Connections
| Quadratic Form | Where |
|---|---|
| Mahalanobis distance — distance accounting for covariance | |
| MSE loss — quadratic in parameters | |
| L2 regularization — simplest quadratic form | |
| Second-order Taylor expansion of loss → curvature |
import numpy as np
# Quadratic form: x^T A x
A = np.array([[4, 1], [1, 3]])
x = np.array([1, 2])
quadratic = x @ A @ x # 1*4*1 + 1*1*2 + 2*1*1 + 2*3*2 = 4+2+2+12 = 20
# Mahalanobis distance: x^T Σ^{-1} x (accounts for covariance)
from scipy.spatial.distance import mahalanobis
mean = np.array([0, 0])
cov = np.array([[2, 1], [1, 3]])
x = np.array([1, 2])
d = mahalanobis(x, mean, np.linalg.inv(cov)) # ≠ Euclidean distance
# L2 regularization is simplest quadratic form: β^T β = ||β||^2
beta = np.array([0.5, -0.3, 0.8])
l2_penalty = beta @ beta # = 0.25 + 0.09 + 0.64 = 0.98
Trace
Properties
| Property | Formula | Use |
|---|---|---|
| Cyclic | Simplify matrix derivatives | |
| Frobenius norm | Matrix distance, weight decay | |
| Sum of eigenvalues | Total variance in PCA = tr(C) | |
| Transpose | Symmetry in derivatives |
import numpy as np
A = np.array([[1, 2], [3, 4]])
# Trace = sum of diagonal elements = sum of eigenvalues
np.trace(A) # 1 + 4 = 5
np.linalg.eigvalsh(A).sum() # also 5 (for symmetric; use eig for general)
# Frobenius norm via trace
fro_norm = np.sqrt(np.trace(A.T @ A)) # same as np.linalg.norm(A, 'fro')
# Cyclic property: tr(ABC) = tr(BCA) = tr(CAB)
B = np.random.randn(2, 3)
C = np.random.randn(3, 2)
np.trace(A @ B @ C) # = np.trace(B @ C @ A) = np.trace(C @ A @ B)
# Total variance in PCA = tr(covariance matrix)
X = np.random.randn(100, 5)
cov = np.cov(X.T)
total_variance = np.trace(cov) # = sum of all eigenvalues = sum of all PC variances
Hadamard (Element-wise) Product
和 matrix multiplication 不同 — Hadamard product 是 element-wise。
ML Connections
| Where Appears | Formula |
|---|---|
| LSTM gates | — gates control element-wise |
| Dropout | , — element-wise mask |
| Attention masking | Mask scores — set future positions to |
| Backprop with ReLU |
import numpy as np
# Hadamard (element-wise) product: A ⊙ B
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
hadamard = A * B # [[5, 12], [21, 32]] — element-wise!
# vs matrix multiply:
matmul = A @ B # [[19, 22], [43, 50]] — very different!
# LSTM gate: c_t = f_t ⊙ c_{t-1} + i_t ⊙ c_tilde
hidden_size = 4
f_t = np.array([0.9, 0.1, 0.8, 0.5]) # forget gate (sigmoid output)
c_prev = np.array([1.0, 2.0, -0.5, 3.0]) # previous cell state
i_t = np.array([0.2, 0.9, 0.1, 0.7]) # input gate
c_tilde = np.array([0.5, -1.0, 2.0, 0.3]) # candidate
c_new = f_t * c_prev + i_t * c_tilde # element-wise! (Hadamard)
# f_t=0.9 → keep 90% of c_prev[0]; f_t=0.1 → forget 90% of c_prev[1]
# Dropout mask: element-wise multiply
mask = np.random.binomial(1, 0.5, size=4) # [1, 0, 1, 0]
h_dropped = c_new * mask / 0.5 # inverted dropout (scale by 1/(1-p))
Interview Signals
What interviewers listen for:
- 你能解釋 PCA 的 eigendecomposition / SVD 數學
- 你知道 SVD 在推薦系統和 LoRA 中的應用
- 你能手推 OLS gradient 並得到 normal equation
- 你理解 Jacobian 和 vanishing gradient 的關係(RNN)
- 你知道 Hessian 和 convexity/saddle points 的連結
Practice
Flashcards
Flashcards (1/10)
Eigendecomposition 在 PCA 中扮演什麼角色?
對 covariance matrix C 做 eigendecomposition: C = VΛVᵀ。Eigenvectors V = principal component directions。Eigenvalues Λ = variance in each direction。降維 = 只保留 top-k eigenvectors(最大 eigenvalues)。Explained variance ratio = λᵢ/Σλⱼ。
Quiz
PCA 的 principal components 是什麼的 eigenvectors?