Matrices & Linear Equations
Course Reference
Based on Introduction to Applied Linear Algebra by Boyd & Vandenberghe (Stanford). Covers Chapters 5-11.
Why This Matters for ML
這章把 linear algebra 從抽象概念連接到 ML 的核心操作 — OLS 的 normal equation、NN 的 forward pass、RNN 的 state transition、Ridge 的 regularization trick、PCA 的 orthogonal projection。掌握這些讓你理解 ML 為什麼 work,而不只是怎麼用。
Linear Independence (Ch 5)
Linear Dependence
A set of vectors is linearly dependent if there exist scalars , not all zero, such that:
直覺:如果某個 vector 可以被其他 vectors 的 linear combination 表示,那它就是 "多餘的" — 它沒帶來新資訊。
Linearly independent = 唯一讓上式成立的方法是所有 。每個 vector 都貢獻了新的 "方向"。
Basis
A basis for a subspace is a minimal spanning set — 它是 linearly independent 且 span 整個 subspace 的 vector set。
- 的 basis 恰好有 個 vectors
- Standard basis:
- Basis 不唯一 — 同一個 subspace 有無限多組 basis
Orthonormal Vectors
Vectors are orthonormal if:
兩個性質:(1) 彼此 orthogonal(垂直,dot product = 0),(2) 每個都是 unit length(norm = 1)。
為什麼重要:orthonormal basis 讓計算超級簡單 — projection 變成一個 dot product,inverse 變成 transpose。
Gram-Schmidt Process
Given linearly independent vectors, Gram-Schmidt 把它們轉成 orthonormal vectors:
- For each subsequent vector: subtract projections onto previous 's, then normalize
直覺:每一步把新 vector 中 "已有方向" 的成分去掉,只留下新方向,然後 normalize 成 unit length。
ML Connections
| Concept | ML/DL Application |
|---|---|
| Linear dependence | Multicollinearity — features are dependent → singular → OLS fails |
| Basis | PCA finds orthonormal basis of max-variance directions; word embeddings span a "meaning space" |
| Orthonormal vectors | PCA components are orthonormal → uncorrelated → each captures unique variance |
| Gram-Schmidt | QR decomposition (used for numerically stable least squares) is based on Gram-Schmidt |
| Independence check | Feature engineering: if rank(X) < p → redundant features → remove or regularize |
import numpy as np
# Check linear independence via matrix rank
v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])
v3 = np.array([5, 7, 9]) # v3 = v1 + v2 → dependent!
A = np.column_stack([v1, v2, v3])
print(f"Rank: {np.linalg.matrix_rank(A)}") # 2, not 3 → linearly dependent
# Independent set
w1 = np.array([1, 0, 0])
w2 = np.array([0, 1, 0])
w3 = np.array([0, 0, 1])
B = np.column_stack([w1, w2, w3])
print(f"Rank: {np.linalg.matrix_rank(B)}") # 3 → independent
# Gram-Schmidt orthogonalization
def gram_schmidt(V):
"""V: matrix with vectors as columns. Returns Q (orthonormal columns)."""
n, k = V.shape
Q = np.zeros((n, k))
for j in range(k):
q = V[:, j].copy().astype(float)
for i in range(j):
q -= np.dot(Q[:, i], V[:, j]) * Q[:, i] # subtract projection
Q[:, j] = q / np.linalg.norm(q) # normalize
return Q
V = np.array([[1, 1], [1, 0], [0, 1]], dtype=float)
Q = gram_schmidt(V)
print("Q:\n", Q)
# Verify orthonormality: Q^T Q should be identity
print("Q^T Q:\n", np.round(Q.T @ Q, 10)) # [[1, 0], [0, 1]]
Matrices (Ch 6)
Matrix Basics
A matrix is a 2D array with rows and columns. 在 ML 中,matrices 無處不在。
Special matrices:
- Zero matrix : all entries 0
- Identity matrix : diagonal = 1, rest = 0。
- Transpose : rows ↔ columns,
- Symmetric: (covariance matrix 就是 symmetric)
Matrix Operations
Addition: where (same dimensions required)
Frobenius norm: 把 matrix 當 vector 算 L2 norm
Matrix-Vector Multiplication
= linear combination of columns of , weighted by entries of :
直覺:matrix-vector multiply 就是在問 "用 A 的 columns 作為 basis,以 x 的 entries 作為 weights,能組出什麼 vector?" 這就是為什麼 Ax 的 output 永遠在 column space of A 裡。
ML Connections
| Concept | ML/DL Application |
|---|---|
| Data matrix | — rows = samples, columns = features |
| Weight matrix | NN layer: |
| Covariance matrix | (centered) — symmetric, PSD |
| Attention scores | → matrix of pairwise similarities |
| User-item matrix | Recommender systems — sparse matrix, factorized as |
| Transpose | Backprop: gradient flows through (transposed weight matrix) |
import numpy as np
# Matrix basics
A = np.array([[1, 2], [3, 4], [5, 6]]) # 3×2
print(f"Shape: {A.shape}") # (3, 2)
print(f"Transpose: {A.T.shape}") # (2, 3)
# Identity matrix
I = np.eye(3)
print(f"AI = A? {np.allclose(I @ np.array([[1],[2],[3]]), np.array([[1],[2],[3]]))}")
# Matrix-vector multiplication as linear combination of columns
W = np.array([[1, 0, -1],
[0, 1, 2]]) # 2×3 weight matrix
x = np.array([3, 1, 2])
# Wx = 3*col1 + 1*col2 + 2*col3
result = W @ x # [3*1 + 1*0 + 2*(-1), 3*0 + 1*1 + 2*2] = [1, 5]
print(f"Wx = {result}")
# Frobenius norm
print(f"||W||_F = {np.linalg.norm(W, 'fro'):.4f}")
# Symmetric matrix: covariance
X = np.random.randn(100, 3)
X_centered = X - X.mean(axis=0)
cov = (X_centered.T @ X_centered) / (len(X) - 1)
print(f"Covariance is symmetric: {np.allclose(cov, cov.T)}") # True
Matrix Examples (Ch 7)
Geometric Transformations
Matrices represent geometric transformations — 每個 matrix 就是一個 "空間操作"。
| Transformation | Matrix (2D) | Effect |
|---|---|---|
| Rotation by | 旋轉 vector,保持 length | |
| Scaling by | 沿 axes 拉伸/壓縮 | |
| Reflection (x-axis) | 翻轉 y 座標 | |
| Projection onto | 把 vector 投影到 方向 |
Convolution as Matrix Multiplication
1D convolution 可以用 Toeplitz matrix 表示 — kernel 的 entries 沿 diagonal 排列。這就是為什麼 CNN 本質上還是 matrix multiplication,只是 weight matrix 有特殊結構(稀疏 + shared weights)。
Selector Matrices
A selector matrix picks specific rows from a vector or matrix — each row is a standard basis vector :
Selector matrices 可以 select、reorder、甚至 duplicate rows。
ML 中的應用:
- Embedding lookup:
nn.Embedding本質上就是 selector matrix × embedding table。embedding[token_id]= one-hot(token_id)ᵀ × W - Sampling / masking: Random selector picks subset of data points(mini-batch SGD)
- Permutation matrix: Special selector that reorders without duplication(orthogonal, det = ±1)
import numpy as np
# Selector matrix: pick rows 2, 0, 2 from x
x = np.array([10, 20, 30, 40])
A = np.array([[0, 0, 1, 0], # select x[2]
[1, 0, 0, 0], # select x[0]
[0, 0, 1, 0]]) # select x[2] again
print(A @ x) # [30, 10, 30]
# Embedding lookup IS selector matrix multiplication
vocab_size, embed_dim = 5, 3
W = np.random.randn(vocab_size, embed_dim) # embedding table
token_id = 2
one_hot = np.zeros(vocab_size)
one_hot[token_id] = 1
embedding = one_hot @ W # = W[token_id] — same as selector!
print(np.allclose(embedding, W[token_id])) # True
# Permutation matrix (special selector — reorder, no duplicates)
P = np.array([[0, 0, 1], [1, 0, 0], [0, 1, 0]]) # cycle: 0→1→2→0
x = np.array([10, 20, 30])
print(P @ x) # [30, 10, 20]
print(f"Orthogonal? {np.allclose(P.T @ P, np.eye(3))}") # True
Incidence Matrix
An incidence matrix represents a graph as a matrix — rows = edges, columns = nodes:
每一行有一個 和一個 (代表一條 edge 連接兩個 nodes)。
ML 中的應用:
- Graph Neural Networks (GNN): Adjacency matrix 和 incidence matrix 描述 graph structure → message passing
- Network flow: Supply chain optimization = flow on graph
- Laplacian matrix: = graph Laplacian → spectral clustering(eigenvalues of L → cluster structure)
import numpy as np
# Graph: 0 → 1, 0 → 2, 1 → 2 (3 nodes, 3 edges)
# Incidence matrix: edges × nodes
A_inc = np.array([[-1, 1, 0], # edge 0→1
[-1, 0, 1], # edge 0→2
[ 0, -1, 1]]) # edge 1→2
# Graph Laplacian: L = A^T A
L = A_inc.T @ A_inc
print("Laplacian:\n", L)
# [[ 2, -1, -1],
# [-1, 2, -1],
# [-1, -1, 2]]
# Spectral clustering: eigenvalues of Laplacian
eigs = np.linalg.eigvalsh(L)
print(f"Laplacian eigenvalues: {eigs.round(2)}")
# Smallest eigenvalue always 0 (connected graph has exactly one 0)
# Number of zero eigenvalues = number of connected components
# Second smallest (Fiedler value) → used for spectral clustering
面試連結:Spectral Clustering
Graph Laplacian (D = degree matrix, A = adjacency matrix)的 eigenvectors 包含 cluster structure。Second smallest eigenvalue 的 eigenvector(Fiedler vector)→ 二分圖的 optimal cut。這就是 spectral clustering 的數學基礎。
ML Connections
| Concept | ML/DL Application |
|---|---|
| NN layers | Each layer = geometric transformation of feature space |
| CNN convolution | Toeplitz-structured matrix multiplication (sparse + weight sharing) |
| Selector / Embedding | nn.Embedding = selector matrix × embedding table |
| Incidence / Laplacian | Graph structure → GNN message passing, spectral clustering |
| PCA | Rotation to align with max-variance directions |
| OLS | = projection of y onto column space of X |
| Data augmentation | Rotation, scaling, reflection matrices applied to images |
import numpy as np
# 2D Rotation matrix
theta = np.pi / 4 # 45 degrees
R = np.array([[np.cos(theta), -np.sin(theta)],
[np.sin(theta), np.cos(theta)]])
v = np.array([1, 0])
v_rotated = R @ v
print(f"Rotated: {v_rotated}") # [0.707, 0.707]
print(f"Length preserved: {np.linalg.norm(v):.3f} → {np.linalg.norm(v_rotated):.3f}")
# Scaling matrix
S = np.array([[2, 0],
[0, 0.5]])
print(f"Scaled: {S @ v}") # [2, 0] — stretched along x, shrunk along y
# Projection matrix onto direction u
u = np.array([1, 1]) / np.sqrt(2) # unit vector at 45°
P = np.outer(u, u) # uu^T (since u is unit, no denominator needed)
w = np.array([3, 1])
w_proj = P @ w
print(f"Projection of {w} onto u: {w_proj}") # [2, 2]
# Projection is idempotent: P^2 = P
print(f"P² = P? {np.allclose(P @ P, P)}") # True
# Convolution as matrix multiplication (Toeplitz)
signal = np.array([1, 2, 3, 4, 5])
kernel = np.array([1, 0, -1])
# Build Toeplitz matrix for "valid" convolution
n, k = len(signal), len(kernel)
T = np.zeros((n - k + 1, n))
for i in range(n - k + 1):
T[i, i:i+k] = kernel
conv_result = T @ signal
print(f"Conv via matmul: {conv_result}") # [-2, -2, -2]
print(f"Conv via np: {np.convolve(signal, kernel, 'valid')}") # same
Linear Equations (Ch 8)
System
The fundamental question: given and , find such that .
解的存在性取決於 是否在 column space of 裡(即 能否被 的 columns 線性組合出來)。
Three Cases
| Case | Condition | Solutions | ML Example |
|---|---|---|---|
| Exactly determined | , full rank | Unique: | Square system, rare in ML |
| Over-determined | (more equations) | No exact solution → least squares | OLS: n samples > p features |
| Under-determined | (more unknowns) | Infinite solutions → need regularization | p > n: high-dimensional, use Ridge/Lasso |
Normal Equation (OLS)
For over-determined , the least squares solution minimizes :
直覺: 是 在 column space of 的 orthogonal projection。Residual 和所有 columns of 垂直()。
Ridge Regression: Fixing Singularity
When is singular (multicollinearity or ), add :
讓所有 eigenvalues 至少為 → always invertible → always has unique solution。
ML Connections
| Concept | ML/DL Application |
|---|---|
| Over-determined | OLS regression — more data points than parameters |
| Under-determined | p > n — infinite solutions → need regularization (Ridge, Lasso) |
| Normal equation | OLS closed-form: |
| Ridge | — ensures invertibility |
| Lasso | No closed form — but conceptually: L1 constraint selects sparse solution from infinite set |
import numpy as np
# Exactly determined: Ax = b with square A
A = np.array([[2, 1], [1, 3]])
b = np.array([5, 7])
x = np.linalg.solve(A, b)
print(f"Solution: {x}")
print(f"Verify: Ax = {A @ x}")
# Over-determined: least squares (n > p)
np.random.seed(42)
n, p = 100, 3
X = np.column_stack([np.ones(n), np.random.randn(n, p - 1)])
beta_true = np.array([2.0, -1.5, 0.8])
y = X @ beta_true + 0.5 * np.random.randn(n)
# Normal equation
beta_hat = np.linalg.inv(X.T @ X) @ X.T @ y
print(f"True beta: {beta_true}")
print(f"Estimated beta: {np.round(beta_hat, 3)}")
# Equivalently using lstsq (numerically more stable)
beta_lstsq, residuals, rank, sv = np.linalg.lstsq(X, y, rcond=None)
print(f"lstsq beta: {np.round(beta_lstsq, 3)}")
# Under-determined: p > n → infinite solutions → Ridge
n_small, p_large = 10, 50
X_wide = np.random.randn(n_small, p_large)
y_small = np.random.randn(n_small)
# OLS fails: X^T X is 50×50 but rank ≤ 10
print(f"Rank of X^T X: {np.linalg.matrix_rank(X_wide.T @ X_wide)}") # 10
# Ridge regression: always works
lam = 1.0
beta_ridge = np.linalg.inv(X_wide.T @ X_wide + lam * np.eye(p_large)) @ X_wide.T @ y_small
print(f"Ridge solution norm: {np.linalg.norm(beta_ridge):.4f}")
面試重點:Normal Equation vs Gradient Descent
「什麼時候用 normal equation,什麼時候用 gradient descent?」→ Normal equation: complexity,p 小時快。Gradient descent: per iteration,p 大時較好。一般 就改用 GD。另外 normal equation 需要 invertible,GD 不需要。
Linear Dynamical Systems (Ch 9)
State-Space Model
A linear dynamical system describes how a state evolves over time:
where is the state vector, is the dynamics matrix, is the input, and maps input to state.
直覺: 決定 "系統本身怎麼演化", 是外部輸入的影響。
Stability via Eigenvalues
System stability depends on eigenvalues of :
| Condition | Behavior | ML Analogy |
|---|---|---|
| All | State converges to 0 → stable | Vanishing gradients in RNN |
| Some | State explodes → unstable | Exploding gradients in RNN |
| State oscillates → marginally stable | Ideal gradient flow |
ML Connections
| Concept | ML/DL Application |
|---|---|
| RNN hidden state | — linear dynamical system + nonlinearity |
| Vanishing/exploding gradients | Eigenvalues of : if all → vanishing; if any → exploding |
| Markov chains | — stationary distribution is eigenvector for |
| Kalman filter | State estimation: predict with dynamics model, update with observation |
| LSTM/GRU | Gates control eigenvalue spectrum → prevent vanishing/exploding |
import numpy as np
# Simulate a linear dynamical system: x_{t+1} = A x_t
# Stable system (eigenvalues < 1)
A_stable = np.array([[0.8, 0.1],
[-0.1, 0.7]])
eigenvalues = np.linalg.eigvals(A_stable)
print(f"Eigenvalues: {eigenvalues}")
print(f"Magnitudes: {np.abs(eigenvalues)}") # Both < 1 → stable
x = np.array([5.0, 3.0])
trajectory = [x.copy()]
for t in range(50):
x = A_stable @ x
trajectory.append(x.copy())
trajectory = np.array(trajectory)
print(f"After 50 steps: {trajectory[-1]}") # → [0, 0]
# Unstable system (eigenvalue > 1)
A_unstable = np.array([[1.1, 0.0],
[0.0, 0.9]])
eigs = np.linalg.eigvals(A_unstable)
print(f"\nUnstable eigenvalues: {eigs}") # 1.1 > 1 → explodes
x = np.array([1.0, 1.0])
for t in range(20):
x = A_unstable @ x
print(f"After 20 steps: {x}") # First component explodes
# RNN analogy: repeated multiplication by W_h
# If max |eigenvalue(W_h)| > 1, gradients explode
W_h = np.random.randn(3, 3) * 0.5 # small weights → stable
eigs_rnn = np.linalg.eigvals(W_h)
print(f"\nRNN W_h eigenvalue magnitudes: {np.abs(eigs_rnn)}")
print(f"Gradient will {'vanish' if np.max(np.abs(eigs_rnn)) < 1 else 'explode'}")
# Markov chain: stationary distribution
P = np.array([[0.7, 0.3],
[0.4, 0.6]])
eigs_mc, vecs_mc = np.linalg.eig(P.T)
# Stationary distribution = eigenvector for eigenvalue 1
idx = np.argmin(np.abs(eigs_mc - 1.0))
pi = np.abs(vecs_mc[:, idx])
pi = pi / pi.sum()
print(f"\nStationary distribution: {pi}") # [4/7, 3/7] ≈ [0.571, 0.429]
print(f"Verify: pi @ P = {pi @ P}") # Should equal pi
Application Examples
Population Dynamics: Species 之間的 predator-prey 關係用 linear system 建模。Population vector 的每個 component = 一個 species 的數量。Matrix encode 了 growth rates 和 inter-species interactions。
Epidemic Dynamics (SIR Model): State vector (Susceptible, Infected, Recovered)。Transition matrix 描述感染率和恢復率。Linear approximation 在 early stage 合理,大規模時需要 nonlinear model。
Motion of a Mass: Position + velocity = state vector 。 encodes 物理定律(Newton's second law discretized)。和 RL 中 continuous control(robotics)的 state space 直接對應。
Supply Chain Dynamics: 每個 node = warehouse/store,state = inventory level。 描述 shipping between nodes, = production/demand。Optimal control → minimize inventory cost → linear quadratic regulator (LQR)。
import numpy as np
# SIR Epidemic Model (linearized)
# State: [S, I, R] (Susceptible, Infected, Recovered)
beta = 0.3 # infection rate
gamma = 0.1 # recovery rate
# Transition matrix (simplified, linearized around disease-free state)
A_sir = np.array([[1-beta, 0, 0],
[beta, 1-gamma, 0],
[0, gamma, 1]])
x = np.array([0.99, 0.01, 0.0]) # 1% initially infected
for t in range(30):
x = A_sir @ x
x = np.clip(x, 0, 1) # ensure non-negative
x = x / x.sum() # normalize (conservation)
print(f"After 30 days: S={x[0]:.3f}, I={x[1]:.3f}, R={x[2]:.3f}")
# Motion of a mass: state = [position, velocity]
dt = 0.1
A_motion = np.array([[1, dt], # x_{t+1} = x_t + v_t * dt
[0, 0.99]]) # v_{t+1} = 0.99 * v_t (friction)
x_motion = np.array([0.0, 1.0]) # start at origin, velocity = 1
for t in range(100):
x_motion = A_motion @ x_motion
print(f"Final position: {x_motion[0]:.2f}, velocity: {x_motion[1]:.4f}")
# Position accumulates, velocity decays (stable because eigenvalues < 1)
Matrix Multiplication (Ch 10)
Matrix-Matrix Multiplication
For and :
Inner dimensions must match:
直覺: = dot product of row of and column of 。
Key Properties
- Not commutative: in general
- Associative:
- Composition: corresponds to applying transformation first, then
- Transpose:
Matrix Power
— applying transformation repeatedly times.
PageRank: importance vector where is the link matrix. Power iteration 就是反覆做 matrix-vector multiply 直到 converge。
QR Factorization
Any matrix can be decomposed as:
where is orthogonal () and is upper triangular.
用途:numerically stable least squares — 不需要算 (會 amplify condition number),而是用 QR 直接 solve。
ML Connections
| Concept | ML/DL Application |
|---|---|
| Chain of matmuls | NN forward pass: |
| Matrix power | PageRank (power iteration), Markov chain convergence |
| Attention: matmul of queries with keys, then with values | |
| QR decomposition | Numerically stable OLS, Gram-Schmidt in disguise |
| Computational cost | : — Transformer attention: |
import numpy as np
# Matrix multiplication: shape rules
A = np.random.randn(3, 4) # 3×4
B = np.random.randn(4, 2) # 4×2
C = A @ B # 3×2
print(f"({A.shape}) @ ({B.shape}) = {C.shape}")
# Composition: AB means "apply B then A"
# Rotate then scale = Scale_matrix @ Rotate_matrix
theta = np.pi / 6
R = np.array([[np.cos(theta), -np.sin(theta)],
[np.sin(theta), np.cos(theta)]])
S = np.array([[2, 0],
[0, 0.5]])
combined = S @ R # first rotate, then scale
v = np.array([1, 0])
print(f"Rotate then scale: {combined @ v}")
print(f"Same as: {S @ (R @ v)}") # associative
# Matrix power: PageRank-style
M = np.array([[0, 0.5, 0],
[1, 0, 0.5],
[0, 0.5, 0.5]])
r = np.array([1/3, 1/3, 1/3]) # initial uniform
for i in range(100):
r = M @ r
r = r / r.sum() # normalize
print(f"\nPageRank converged: {np.round(r, 4)}")
# QR decomposition
X = np.random.randn(50, 3)
Q, R_mat = np.linalg.qr(X)
print(f"\nQ shape: {Q.shape}, R shape: {R_mat.shape}")
print(f"Q is orthogonal: {np.allclose(Q.T @ Q, np.eye(3))}")
print(f"A = QR? {np.allclose(X, Q @ R_mat)}")
# Solve least squares via QR (more stable than normal equation)
y = np.random.randn(50)
# From Xb = y → QRb = y → Rb = Q^T y
beta_qr = np.linalg.solve(R_mat, Q.T @ y)
beta_ne = np.linalg.inv(X.T @ X) @ X.T @ y # normal equation
print(f"QR solution ≈ Normal eq: {np.allclose(beta_qr, beta_ne)}")
Matrix Inverses (Ch 11)
Left and Right Inverses
For non-square matrices:
- Left inverse of (, tall): such that
- Right inverse of (, wide): such that
Left inverse 存在 → over-determined system 有 unique least-squares solution。 Right inverse 存在 → under-determined system 有 (non-unique) solution。
Square Matrix Inverse
exists if and only if:
- is square ()
- Full rank:
These three conditions are all equivalent for square matrices.
Pseudo-Inverse (Moore-Penrose)
For any matrix (even non-square or singular), the pseudo-inverse always exists:
- If has full column rank: (left inverse)
- If has full row rank: (right inverse)
- General: computed via SVD —
np.linalg.lstsq and np.linalg.pinv 底層就是用 pseudo-inverse。
ML Connections
| Concept | ML/DL Application |
|---|---|
| OLS normal equation — requires invertible | |
| Ridge — guarantees invertibility | |
| Pseudo-inverse | np.linalg.lstsq — works even when is singular |
| Mahalanobis distance | — needs |
| Inverse in backprop | Computing Hessian inverse for second-order methods (Newton, natural gradient) |
import numpy as np
# Matrix inverse
A = np.array([[4, 7], [2, 6]])
A_inv = np.linalg.inv(A)
print(f"A @ A_inv = I? {np.allclose(A @ A_inv, np.eye(2))}")
# Solving Ax = b via inverse
b = np.array([1, 2])
x = A_inv @ b
print(f"Solution: {x}")
print(f"Verify: {A @ x}") # Should be [1, 2]
# Pseudo-inverse for non-square matrix
X = np.random.randn(20, 3) # tall matrix
X_pinv = np.linalg.pinv(X)
print(f"\nX shape: {X.shape}, X+ shape: {X_pinv.shape}")
print(f"X+ @ X ≈ I? {np.allclose(X_pinv @ X, np.eye(3))}") # Left inverse
# lstsq uses pseudo-inverse internally
y = np.random.randn(20)
beta_pinv = X_pinv @ y
beta_lstsq, _, _, _ = np.linalg.lstsq(X, y, rcond=None)
print(f"pinv ≈ lstsq: {np.allclose(beta_pinv, beta_lstsq)}")
# Ridge vs OLS: Ridge always invertible
np.random.seed(0)
n, p = 10, 50 # p > n → OLS fails
X_wide = np.random.randn(n, p)
y_wide = np.random.randn(n)
# OLS: X^T X is singular
rank_XtX = np.linalg.matrix_rank(X_wide.T @ X_wide)
print(f"\nX^T X rank: {rank_XtX} (need {p} for invertibility)")
# Ridge: always works
lam = 1.0
beta_ridge = np.linalg.inv(X_wide.T @ X_wide + lam * np.eye(p)) @ X_wide.T @ y_wide
print(f"Ridge rank (X^TX + λI): {np.linalg.matrix_rank(X_wide.T @ X_wide + lam * np.eye(p))}") # 50
# Pseudo-inverse also works
beta_pinv_wide = np.linalg.pinv(X_wide) @ y_wide
print(f"Pseudo-inverse gives minimum-norm solution: ||beta||={np.linalg.norm(beta_pinv_wide):.4f}")
Rank, Determinant & Positive Definiteness
Matrix Rank
Rank = number of linearly independent columns (= number of independent rows).
- Full rank: — no redundant columns/rows
- Rank deficient: — matrix is singular (if square)
直覺:rank 告訴你 matrix 真正 "用到" 了幾個維度。Rank-deficient = 有些維度是多餘的。
Determinant
For a square matrix :
Geometric interpretation: = the factor by which scales volumes.
- : squashes space to lower dimension (information lost)
- : preserves orientation
- : flips orientation
Positive Definiteness
A symmetric matrix is positive definite (PD) if:
Equivalent conditions (all say the same thing):
- All eigenvalues
- All leading principal minors
- Cholesky decomposition exists
- for some full-rank
Positive semi-definite (PSD): (eigenvalues )
ML Connections
| Concept | ML/DL Application |
|---|---|
| Rank < p | Multicollinearity → singular → OLS fails |
| Low-rank approx | PCA ( components), matrix factorization (recommenders), LoRA (LLM fine-tuning) |
| Covariance PSD | is always PSD → eigenvalues → PCA works |
| Kernel matrix PD | Mercer's theorem: valid kernel kernel matrix is PSD |
| Hessian PD | everywhere loss is strictly convex → unique global minimum |
| Determinant | Gaussian log-likelihood involves ; singular → → problem |
| Cholesky | Fast sampling from multivariate Gaussian: , |
import numpy as np
# Matrix rank
A = np.array([[1, 2, 3],
[4, 5, 6],
[5, 7, 9]]) # row 3 = row 1 + row 2
print(f"Rank: {np.linalg.matrix_rank(A)}") # 2 (rank deficient)
# Full rank matrix
B = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 10]]) # independent rows
print(f"Rank: {np.linalg.matrix_rank(B)}") # 3
# Determinant
print(f"\ndet(A) = {np.linalg.det(A):.6f}") # ≈ 0 (singular)
print(f"det(B) = {np.linalg.det(B):.6f}") # ≠ 0 (invertible)
# Positive definiteness check via eigenvalues
M = np.array([[4, 2],
[2, 3]])
eigvals = np.linalg.eigvalsh(M) # eigvalsh for symmetric matrices
print(f"\nEigenvalues: {eigvals}")
print(f"Positive definite: {np.all(eigvals > 0)}") # True
# Covariance matrix is always PSD
X = np.random.randn(100, 5)
X_c = X - X.mean(axis=0)
cov = (X_c.T @ X_c) / (len(X) - 1)
cov_eigs = np.linalg.eigvalsh(cov)
print(f"\nCovariance eigenvalues: {np.round(cov_eigs, 4)}")
print(f"All >= 0 (PSD): {np.all(cov_eigs >= -1e-10)}") # True
# Cholesky decomposition (only works for PD matrices)
L = np.linalg.cholesky(cov)
print(f"\nCholesky L shape: {L.shape}")
print(f"LL^T = cov? {np.allclose(L @ L.T, cov)}")
# Sampling from multivariate Gaussian using Cholesky
mu = np.zeros(5)
z = np.random.randn(5)
sample = mu + L @ z # Much faster than np.random.multivariate_normal for repeated sampling
# Check: not PD matrix → Cholesky fails
M_not_pd = np.array([[1, 2], [2, 1]]) # eigenvalues: 3, -1
eigs_bad = np.linalg.eigvalsh(M_not_pd)
print(f"\nNot PD eigenvalues: {eigs_bad}") # [-1, 3]
try:
np.linalg.cholesky(M_not_pd)
except np.linalg.LinAlgError as e:
print(f"Cholesky failed: {e}")
面試重點:Why Ridge Always Works
「Ridge regression 為什麼在任何情況下都能求解?」→ OLS 需要 ,但 可能 singular。Ridge 加上 :。如果 的 eigenvalue 是 ,加 後變成 。因為 ,所以所有 eigenvalues 都 → positive definite → always invertible。
Practice
Flashcards
Flashcards (1/10)
什麼是 linear independence?在 ML 中為什麼重要?
Vectors linearly independent = 沒有任何一個能被其他 vectors 線性組合出來。唯一讓 α₁v₁ + ... + αₖvₖ = 0 的方法是所有 αᵢ = 0。ML 中:如果 features linearly dependent → multicollinearity → X^TX singular → OLS 不能求解。要用 Ridge 或 drop features。
Quiz
三個 vectors v₁, v₂, v₃ 在 R³ 中。rank([v₁ v₂ v₃]) = 2。以下哪個正確?