Matrices & Linear Equations

Course Reference

Based on Introduction to Applied Linear Algebra by Boyd & Vandenberghe (Stanford). Covers Chapters 5-11.

Why This Matters for ML

這章把 linear algebra 從抽象概念連接到 ML 的核心操作 — OLS 的 normal equation、NN 的 forward pass、RNN 的 state transition、Ridge 的 regularization trick、PCA 的 orthogonal projection。掌握這些讓你理解 ML 為什麼 work,而不只是怎麼用。


Linear Independence (Ch 5)

Linear Dependence

A set of vectors v1,v2,,vk\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k is linearly dependent if there exist scalars α1,,αk\alpha_1, \ldots, \alpha_k, not all zero, such that:

α1v1+α2v2++αkvk=0\alpha_1 \mathbf{v}_1 + \alpha_2 \mathbf{v}_2 + \cdots + \alpha_k \mathbf{v}_k = \mathbf{0}

直覺:如果某個 vector 可以被其他 vectors 的 linear combination 表示,那它就是 "多餘的" — 它沒帶來新資訊。

Linearly independent = 唯一讓上式成立的方法是所有 αi=0\alpha_i = 0。每個 vector 都貢獻了新的 "方向"。

Basis

A basis for a subspace is a minimal spanning set — 它是 linearly independent 且 span 整個 subspace 的 vector set。

  • Rn\mathbb{R}^n 的 basis 恰好有 nn 個 vectors
  • Standard basis: e1=[1,0,,0],e2=[0,1,,0],\mathbf{e}_1 = [1,0,\ldots,0]^\top, \mathbf{e}_2 = [0,1,\ldots,0]^\top, \ldots
  • Basis 不唯一 — 同一個 subspace 有無限多組 basis

Orthonormal Vectors

Vectors q1,,qk\mathbf{q}_1, \ldots, \mathbf{q}_k are orthonormal if:

qiqj={1i=j0ij\mathbf{q}_i^\top \mathbf{q}_j = \begin{cases} 1 & i = j \\ 0 & i \neq j \end{cases}

兩個性質:(1) 彼此 orthogonal(垂直,dot product = 0),(2) 每個都是 unit length(norm = 1)。

為什麼重要:orthonormal basis 讓計算超級簡單 — projection 變成一個 dot product,inverse 變成 transpose。

Gram-Schmidt Process

Given linearly independent vectors, Gram-Schmidt 把它們轉成 orthonormal vectors:

  1. q1=v1/v1\mathbf{q}_1 = \mathbf{v}_1 / \|\mathbf{v}_1\|
  2. For each subsequent vector: subtract projections onto previous q\mathbf{q}'s, then normalize
q~k=vkj=1k1(qjvk)qj,qk=q~kq~k\tilde{\mathbf{q}}_k = \mathbf{v}_k - \sum_{j=1}^{k-1} (\mathbf{q}_j^\top \mathbf{v}_k) \mathbf{q}_j, \quad \mathbf{q}_k = \frac{\tilde{\mathbf{q}}_k}{\|\tilde{\mathbf{q}}_k\|}

直覺:每一步把新 vector 中 "已有方向" 的成分去掉,只留下新方向,然後 normalize 成 unit length。

ML Connections

ConceptML/DL Application
Linear dependenceMulticollinearity — features are dependent → XXX^\top X singular → OLS fails
BasisPCA finds orthonormal basis of max-variance directions; word embeddings span a "meaning space"
Orthonormal vectorsPCA components are orthonormal → uncorrelated → each captures unique variance
Gram-SchmidtQR decomposition (used for numerically stable least squares) is based on Gram-Schmidt
Independence checkFeature engineering: if rank(X) < p → redundant features → remove or regularize
import numpy as np

# Check linear independence via matrix rank
v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])
v3 = np.array([5, 7, 9])  # v3 = v1 + v2 → dependent!

A = np.column_stack([v1, v2, v3])
print(f"Rank: {np.linalg.matrix_rank(A)}")  # 2, not 3 → linearly dependent

# Independent set
w1 = np.array([1, 0, 0])
w2 = np.array([0, 1, 0])
w3 = np.array([0, 0, 1])
B = np.column_stack([w1, w2, w3])
print(f"Rank: {np.linalg.matrix_rank(B)}")  # 3 → independent

# Gram-Schmidt orthogonalization
def gram_schmidt(V):
    """V: matrix with vectors as columns. Returns Q (orthonormal columns)."""
    n, k = V.shape
    Q = np.zeros((n, k))
    for j in range(k):
        q = V[:, j].copy().astype(float)
        for i in range(j):
            q -= np.dot(Q[:, i], V[:, j]) * Q[:, i]  # subtract projection
        Q[:, j] = q / np.linalg.norm(q)               # normalize
    return Q

V = np.array([[1, 1], [1, 0], [0, 1]], dtype=float)
Q = gram_schmidt(V)
print("Q:\n", Q)

# Verify orthonormality: Q^T Q should be identity
print("Q^T Q:\n", np.round(Q.T @ Q, 10))  # [[1, 0], [0, 1]]

Matrices (Ch 6)

Matrix Basics

A matrix ARm×nA \in \mathbb{R}^{m \times n} is a 2D array with mm rows and nn columns. 在 ML 中,matrices 無處不在。

Special matrices:

  • Zero matrix 0m×n\mathbf{0}_{m \times n}: all entries 0
  • Identity matrix InI_n: diagonal = 1, rest = 0。AI=IA=AAI = IA = A
  • Transpose AA^\top: rows ↔ columns, (A)ij=Aji(A^\top)_{ij} = A_{ji}
  • Symmetric: A=AA = A^\top(covariance matrix 就是 symmetric)

Matrix Operations

Addition: C=A+BC = A + B where Cij=Aij+BijC_{ij} = A_{ij} + B_{ij}(same dimensions required)

Frobenius norm: 把 matrix 當 vector 算 L2 norm

AF=i,jAij2=tr(AA)\|A\|_F = \sqrt{\sum_{i,j} A_{ij}^2} = \sqrt{\text{tr}(A^\top A)}

Matrix-Vector Multiplication

AxA\mathbf{x} = linear combination of columns of AA, weighted by entries of x\mathbf{x}:

Ax=x1a1+x2a2++xnanA\mathbf{x} = x_1 \mathbf{a}_1 + x_2 \mathbf{a}_2 + \cdots + x_n \mathbf{a}_n

直覺:matrix-vector multiply 就是在問 "用 A 的 columns 作為 basis,以 x 的 entries 作為 weights,能組出什麼 vector?" 這就是為什麼 Ax 的 output 永遠在 column space of A 裡。

ML Connections

ConceptML/DL Application
Data matrix XXn×pn \times p — rows = samples, columns = features
Weight matrix WWNN layer: h=σ(Wx+b)\mathbf{h} = \sigma(W\mathbf{x} + \mathbf{b})
Covariance matrixΣ=1n1XX\Sigma = \frac{1}{n-1}X^\top X (centered) — symmetric, PSD
Attention scoresQKQK^\topn×nn \times n matrix of pairwise similarities
User-item matrixRecommender systems — sparse matrix, factorized as UVUV^\top
TransposeBackprop: gradient flows through WW^\top (transposed weight matrix)
import numpy as np

# Matrix basics
A = np.array([[1, 2], [3, 4], [5, 6]])  # 3×2
print(f"Shape: {A.shape}")        # (3, 2)
print(f"Transpose: {A.T.shape}")  # (2, 3)

# Identity matrix
I = np.eye(3)
print(f"AI = A? {np.allclose(I @ np.array([[1],[2],[3]]), np.array([[1],[2],[3]]))}")

# Matrix-vector multiplication as linear combination of columns
W = np.array([[1, 0, -1],
              [0, 1,  2]])  # 2×3 weight matrix
x = np.array([3, 1, 2])
# Wx = 3*col1 + 1*col2 + 2*col3
result = W @ x  # [3*1 + 1*0 + 2*(-1), 3*0 + 1*1 + 2*2] = [1, 5]
print(f"Wx = {result}")

# Frobenius norm
print(f"||W||_F = {np.linalg.norm(W, 'fro'):.4f}")

# Symmetric matrix: covariance
X = np.random.randn(100, 3)
X_centered = X - X.mean(axis=0)
cov = (X_centered.T @ X_centered) / (len(X) - 1)
print(f"Covariance is symmetric: {np.allclose(cov, cov.T)}")  # True

Matrix Examples (Ch 7)

Geometric Transformations

Matrices represent geometric transformations — 每個 matrix 就是一個 "空間操作"。

TransformationMatrix (2D)Effect
Rotation by θ\theta[cosθsinθsinθcosθ]\begin{bmatrix}\cos\theta & -\sin\theta \\ \sin\theta & \cos\theta\end{bmatrix}旋轉 vector,保持 length
Scaling by sx,sys_x, s_y[sx00sy]\begin{bmatrix}s_x & 0 \\ 0 & s_y\end{bmatrix}沿 axes 拉伸/壓縮
Reflection (x-axis)[1001]\begin{bmatrix}1 & 0 \\ 0 & -1\end{bmatrix}翻轉 y 座標
Projection onto u\mathbf{u}uuuu\frac{\mathbf{u}\mathbf{u}^\top}{\mathbf{u}^\top\mathbf{u}}把 vector 投影到 u\mathbf{u} 方向

Convolution as Matrix Multiplication

1D convolution 可以用 Toeplitz matrix 表示 — kernel 的 entries 沿 diagonal 排列。這就是為什麼 CNN 本質上還是 matrix multiplication,只是 weight matrix 有特殊結構(稀疏 + shared weights)。

Selector Matrices

A selector matrix picks specific rows from a vector or matrix — each row is a standard basis vector ei\mathbf{e}_i^\top:

A=[e3e1e3]    Ax=[x3x1x3]\mathbf{A} = \begin{bmatrix} \mathbf{e}_3^\top \\ \mathbf{e}_1^\top \\ \mathbf{e}_3^\top \end{bmatrix} \implies \mathbf{A}\mathbf{x} = \begin{bmatrix} x_3 \\ x_1 \\ x_3 \end{bmatrix}

Selector matrices 可以 select、reorder、甚至 duplicate rows。

ML 中的應用:

  • Embedding lookup: nn.Embedding 本質上就是 selector matrix × embedding table。embedding[token_id] = one-hot(token_id)ᵀ × W
  • Sampling / masking: Random selector picks subset of data points(mini-batch SGD)
  • Permutation matrix: Special selector that reorders without duplication(orthogonal, det = ±1)
import numpy as np

# Selector matrix: pick rows 2, 0, 2 from x
x = np.array([10, 20, 30, 40])
A = np.array([[0, 0, 1, 0],   # select x[2]
              [1, 0, 0, 0],   # select x[0]
              [0, 0, 1, 0]])  # select x[2] again
print(A @ x)  # [30, 10, 30]

# Embedding lookup IS selector matrix multiplication
vocab_size, embed_dim = 5, 3
W = np.random.randn(vocab_size, embed_dim)  # embedding table
token_id = 2
one_hot = np.zeros(vocab_size)
one_hot[token_id] = 1
embedding = one_hot @ W  # = W[token_id] — same as selector!
print(np.allclose(embedding, W[token_id]))  # True

# Permutation matrix (special selector — reorder, no duplicates)
P = np.array([[0, 0, 1], [1, 0, 0], [0, 1, 0]])  # cycle: 0→1→2→0
x = np.array([10, 20, 30])
print(P @ x)  # [30, 10, 20]
print(f"Orthogonal? {np.allclose(P.T @ P, np.eye(3))}")  # True

Incidence Matrix

An incidence matrix represents a graph as a matrix — rows = edges, columns = nodes:

Aij={1if edge i starts at node j+1if edge i ends at node j0otherwiseA_{ij} = \begin{cases} -1 & \text{if edge } i \text{ starts at node } j \\ +1 & \text{if edge } i \text{ ends at node } j \\ 0 & \text{otherwise} \end{cases}

每一行有一個 1-1 和一個 +1+1(代表一條 edge 連接兩個 nodes)。

ML 中的應用:

  • Graph Neural Networks (GNN): Adjacency matrix 和 incidence matrix 描述 graph structure → message passing
  • Network flow: Supply chain optimization = flow on graph
  • Laplacian matrix: L=AA\mathbf{L} = \mathbf{A}^\top\mathbf{A} = graph Laplacian → spectral clustering(eigenvalues of L → cluster structure)
import numpy as np

# Graph: 0 → 1, 0 → 2, 1 → 2 (3 nodes, 3 edges)
# Incidence matrix: edges × nodes
A_inc = np.array([[-1,  1,  0],   # edge 0→1
                  [-1,  0,  1],   # edge 0→2
                  [ 0, -1,  1]])  # edge 1→2

# Graph Laplacian: L = A^T A
L = A_inc.T @ A_inc
print("Laplacian:\n", L)
# [[ 2, -1, -1],
#  [-1,  2, -1],
#  [-1, -1,  2]]

# Spectral clustering: eigenvalues of Laplacian
eigs = np.linalg.eigvalsh(L)
print(f"Laplacian eigenvalues: {eigs.round(2)}")
# Smallest eigenvalue always 0 (connected graph has exactly one 0)
# Number of zero eigenvalues = number of connected components
# Second smallest (Fiedler value) → used for spectral clustering

面試連結:Spectral Clustering

Graph Laplacian L=DA\mathbf{L} = \mathbf{D} - \mathbf{A}(D = degree matrix, A = adjacency matrix)的 eigenvectors 包含 cluster structure。Second smallest eigenvalue 的 eigenvector(Fiedler vector)→ 二分圖的 optimal cut。這就是 spectral clustering 的數學基礎。

ML Connections

ConceptML/DL Application
NN layersEach layer = geometric transformation of feature space
CNN convolutionToeplitz-structured matrix multiplication (sparse + weight sharing)
Selector / Embeddingnn.Embedding = selector matrix × embedding table
Incidence / LaplacianGraph structure → GNN message passing, spectral clustering
PCARotation to align with max-variance directions
OLSy^=X(XX)1Xy\hat{y} = X(X^\top X)^{-1}X^\top y = projection of y onto column space of X
Data augmentationRotation, scaling, reflection matrices applied to images
import numpy as np

# 2D Rotation matrix
theta = np.pi / 4  # 45 degrees
R = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta),  np.cos(theta)]])
v = np.array([1, 0])
v_rotated = R @ v
print(f"Rotated: {v_rotated}")  # [0.707, 0.707]
print(f"Length preserved: {np.linalg.norm(v):.3f} → {np.linalg.norm(v_rotated):.3f}")

# Scaling matrix
S = np.array([[2, 0],
              [0, 0.5]])
print(f"Scaled: {S @ v}")  # [2, 0] — stretched along x, shrunk along y

# Projection matrix onto direction u
u = np.array([1, 1]) / np.sqrt(2)  # unit vector at 45°
P = np.outer(u, u)  # uu^T (since u is unit, no denominator needed)
w = np.array([3, 1])
w_proj = P @ w
print(f"Projection of {w} onto u: {w_proj}")  # [2, 2]

# Projection is idempotent: P^2 = P
print(f"P² = P? {np.allclose(P @ P, P)}")  # True

# Convolution as matrix multiplication (Toeplitz)
signal = np.array([1, 2, 3, 4, 5])
kernel = np.array([1, 0, -1])

# Build Toeplitz matrix for "valid" convolution
n, k = len(signal), len(kernel)
T = np.zeros((n - k + 1, n))
for i in range(n - k + 1):
    T[i, i:i+k] = kernel

conv_result = T @ signal
print(f"Conv via matmul: {conv_result}")        # [-2, -2, -2]
print(f"Conv via np:     {np.convolve(signal, kernel, 'valid')}")  # same

Linear Equations (Ch 8)

System Ax=bAx = b

The fundamental question: given ARm×nA \in \mathbb{R}^{m \times n} and bRm\mathbf{b} \in \mathbb{R}^m, find xRn\mathbf{x} \in \mathbb{R}^n such that Ax=bA\mathbf{x} = \mathbf{b}.

解的存在性取決於 b\mathbf{b} 是否在 column space of AA 裡(即 b\mathbf{b} 能否被 AA 的 columns 線性組合出來)。

Three Cases

CaseConditionSolutionsML Example
Exactly determinedm=nm = n, full rankUnique: x=A1b\mathbf{x} = A^{-1}\mathbf{b}Square system, rare in ML
Over-determinedm>nm > n (more equations)No exact solution → least squares minAxb2\min \|A\mathbf{x} - \mathbf{b}\|^2OLS: n samples > p features
Under-determinedm<nm < n (more unknowns)Infinite solutions → need regularizationp > n: high-dimensional, use Ridge/Lasso

Normal Equation (OLS)

For over-determined Xβ=yX\boldsymbol{\beta} = \mathbf{y}, the least squares solution minimizes Xβy2\|X\boldsymbol{\beta} - \mathbf{y}\|^2:

XXβ=Xyβ^=(XX)1XyX^\top X \boldsymbol{\beta} = X^\top \mathbf{y} \quad \Rightarrow \quad \hat{\boldsymbol{\beta}} = (X^\top X)^{-1} X^\top \mathbf{y}

直覺:y^=Xβ^\hat{\mathbf{y}} = X\hat{\boldsymbol{\beta}}y\mathbf{y} 在 column space of XX 的 orthogonal projection。Residual e=yy^\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} 和所有 columns of XX 垂直(Xe=0X^\top \mathbf{e} = \mathbf{0})。

Ridge Regression: Fixing Singularity

When XXX^\top X is singular (multicollinearity or p>np > n), add λI\lambda I:

β^ridge=(XX+λI)1Xy\hat{\boldsymbol{\beta}}_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top \mathbf{y}

λI\lambda I 讓所有 eigenvalues 至少為 λ>0\lambda > 0 → always invertible → always has unique solution。

ML Connections

ConceptML/DL Application
Over-determined Ax=bAx = bOLS regression — more data points than parameters
Under-determined Ax=bAx = bp > n — infinite solutions → need regularization (Ridge, Lasso)
Normal equationOLS closed-form: β^=(XX)1Xy\hat{\beta} = (X^\top X)^{-1}X^\top y
Ridge(XX+λI)1Xy(X^\top X + \lambda I)^{-1}X^\top yλI\lambda I ensures invertibility
LassoNo closed form — but conceptually: L1 constraint selects sparse solution from infinite set
import numpy as np

# Exactly determined: Ax = b with square A
A = np.array([[2, 1], [1, 3]])
b = np.array([5, 7])
x = np.linalg.solve(A, b)
print(f"Solution: {x}")
print(f"Verify: Ax = {A @ x}")

# Over-determined: least squares (n > p)
np.random.seed(42)
n, p = 100, 3
X = np.column_stack([np.ones(n), np.random.randn(n, p - 1)])
beta_true = np.array([2.0, -1.5, 0.8])
y = X @ beta_true + 0.5 * np.random.randn(n)

# Normal equation
beta_hat = np.linalg.inv(X.T @ X) @ X.T @ y
print(f"True beta:      {beta_true}")
print(f"Estimated beta: {np.round(beta_hat, 3)}")

# Equivalently using lstsq (numerically more stable)
beta_lstsq, residuals, rank, sv = np.linalg.lstsq(X, y, rcond=None)
print(f"lstsq beta:     {np.round(beta_lstsq, 3)}")

# Under-determined: p > n → infinite solutions → Ridge
n_small, p_large = 10, 50
X_wide = np.random.randn(n_small, p_large)
y_small = np.random.randn(n_small)

# OLS fails: X^T X is 50×50 but rank ≤ 10
print(f"Rank of X^T X: {np.linalg.matrix_rank(X_wide.T @ X_wide)}")  # 10

# Ridge regression: always works
lam = 1.0
beta_ridge = np.linalg.inv(X_wide.T @ X_wide + lam * np.eye(p_large)) @ X_wide.T @ y_small
print(f"Ridge solution norm: {np.linalg.norm(beta_ridge):.4f}")

面試重點:Normal Equation vs Gradient Descent

「什麼時候用 normal equation,什麼時候用 gradient descent?」→ Normal equation: O(p3)O(p^3) complexity,p 小時快。Gradient descent: O(np)O(np) per iteration,p 大時較好。一般 p>10000p > 10000 就改用 GD。另外 normal equation 需要 XXX^\top X invertible,GD 不需要。


Linear Dynamical Systems (Ch 9)

State-Space Model

A linear dynamical system describes how a state evolves over time:

xt+1=Axt+But\mathbf{x}_{t+1} = A\mathbf{x}_t + B\mathbf{u}_t

where xt\mathbf{x}_t is the state vector, AA is the dynamics matrix, ut\mathbf{u}_t is the input, and BB maps input to state.

直覺:AA 決定 "系統本身怎麼演化",ButB\mathbf{u}_t 是外部輸入的影響。

Stability via Eigenvalues

System stability depends on eigenvalues of AA:

ConditionBehaviorML Analogy
All λi<1\|\lambda_i\| < 1State converges to 0 → stableVanishing gradients in RNN
Some λi>1\|\lambda_i\| > 1State explodes → unstableExploding gradients in RNN
λi=1\|\lambda_i\| = 1State oscillates → marginally stableIdeal gradient flow

ML Connections

ConceptML/DL Application
RNN hidden stateht=f(Whht1+Wxxt)\mathbf{h}_t = f(W_h \mathbf{h}_{t-1} + W_x \mathbf{x}_t) — linear dynamical system + nonlinearity
Vanishing/exploding gradientsEigenvalues of WhW_h: if all <1< 1 → vanishing; if any >1> 1 → exploding
Markov chainsπt+1=Pπt\boldsymbol{\pi}_{t+1} = P^\top \boldsymbol{\pi}_t — stationary distribution is eigenvector for λ=1\lambda = 1
Kalman filterState estimation: predict with dynamics model, update with observation
LSTM/GRUGates control eigenvalue spectrum → prevent vanishing/exploding
import numpy as np

# Simulate a linear dynamical system: x_{t+1} = A x_t
# Stable system (eigenvalues < 1)
A_stable = np.array([[0.8, 0.1],
                      [-0.1, 0.7]])
eigenvalues = np.linalg.eigvals(A_stable)
print(f"Eigenvalues: {eigenvalues}")
print(f"Magnitudes:  {np.abs(eigenvalues)}")  # Both < 1 → stable

x = np.array([5.0, 3.0])
trajectory = [x.copy()]
for t in range(50):
    x = A_stable @ x
    trajectory.append(x.copy())
trajectory = np.array(trajectory)
print(f"After 50 steps: {trajectory[-1]}")  # → [0, 0]

# Unstable system (eigenvalue > 1)
A_unstable = np.array([[1.1, 0.0],
                        [0.0, 0.9]])
eigs = np.linalg.eigvals(A_unstable)
print(f"\nUnstable eigenvalues: {eigs}")  # 1.1 > 1 → explodes

x = np.array([1.0, 1.0])
for t in range(20):
    x = A_unstable @ x
print(f"After 20 steps: {x}")  # First component explodes

# RNN analogy: repeated multiplication by W_h
# If max |eigenvalue(W_h)| > 1, gradients explode
W_h = np.random.randn(3, 3) * 0.5  # small weights → stable
eigs_rnn = np.linalg.eigvals(W_h)
print(f"\nRNN W_h eigenvalue magnitudes: {np.abs(eigs_rnn)}")
print(f"Gradient will {'vanish' if np.max(np.abs(eigs_rnn)) < 1 else 'explode'}")

# Markov chain: stationary distribution
P = np.array([[0.7, 0.3],
              [0.4, 0.6]])
eigs_mc, vecs_mc = np.linalg.eig(P.T)
# Stationary distribution = eigenvector for eigenvalue 1
idx = np.argmin(np.abs(eigs_mc - 1.0))
pi = np.abs(vecs_mc[:, idx])
pi = pi / pi.sum()
print(f"\nStationary distribution: {pi}")  # [4/7, 3/7] ≈ [0.571, 0.429]
print(f"Verify: pi @ P = {pi @ P}")       # Should equal pi

Application Examples

Population Dynamics: Species 之間的 predator-prey 關係用 linear system 建模。Population vector xt\mathbf{x}_t 的每個 component = 一個 species 的數量。Matrix AA encode 了 growth rates 和 inter-species interactions。

Epidemic Dynamics (SIR Model): State vector xt=[St,It,Rt]\mathbf{x}_t = [S_t, I_t, R_t](Susceptible, Infected, Recovered)。Transition matrix 描述感染率和恢復率。Linear approximation 在 early stage 合理,大規模時需要 nonlinear model。

Motion of a Mass: Position + velocity = state vector [x,v][x, v]^\topAA encodes 物理定律(Newton's second law discretized)。和 RL 中 continuous control(robotics)的 state space 直接對應。

Supply Chain Dynamics: 每個 node = warehouse/store,state = inventory level。AA 描述 shipping between nodes,u\mathbf{u} = production/demand。Optimal control → minimize inventory cost → linear quadratic regulator (LQR)。

import numpy as np

# SIR Epidemic Model (linearized)
# State: [S, I, R] (Susceptible, Infected, Recovered)
beta = 0.3   # infection rate
gamma = 0.1  # recovery rate

# Transition matrix (simplified, linearized around disease-free state)
A_sir = np.array([[1-beta, 0,     0],
                  [beta,   1-gamma, 0],
                  [0,      gamma,  1]])

x = np.array([0.99, 0.01, 0.0])  # 1% initially infected
for t in range(30):
    x = A_sir @ x
    x = np.clip(x, 0, 1)  # ensure non-negative
    x = x / x.sum()        # normalize (conservation)
print(f"After 30 days: S={x[0]:.3f}, I={x[1]:.3f}, R={x[2]:.3f}")

# Motion of a mass: state = [position, velocity]
dt = 0.1
A_motion = np.array([[1, dt],    # x_{t+1} = x_t + v_t * dt
                     [0, 0.99]]) # v_{t+1} = 0.99 * v_t (friction)
x_motion = np.array([0.0, 1.0])  # start at origin, velocity = 1
for t in range(100):
    x_motion = A_motion @ x_motion
print(f"Final position: {x_motion[0]:.2f}, velocity: {x_motion[1]:.4f}")
# Position accumulates, velocity decays (stable because eigenvalues < 1)

Matrix Multiplication (Ch 10)

Matrix-Matrix Multiplication

For ARm×nA \in \mathbb{R}^{m \times n} and BRn×pB \in \mathbb{R}^{n \times p}:

C=ABRm×p,Cij=k=1nAikBkjC = AB \in \mathbb{R}^{m \times p}, \quad C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}

Inner dimensions must match: (m×n)×(n×p)=(m×p)(m \times \underline{n}) \times (\underline{n} \times p) = (m \times p)

直覺:CijC_{ij} = dot product of row ii of AA and column jj of BB

Key Properties

  • Not commutative: ABBAAB \neq BA in general
  • Associative: (AB)C=A(BC)(AB)C = A(BC)
  • Composition: ABAB corresponds to applying transformation BB first, then AA
  • Transpose: (AB)=BA(AB)^\top = B^\top A^\top

Matrix Power

Ak=AAAk timesA^k = \underbrace{A \cdot A \cdots A}_{k \text{ times}} — applying transformation AA repeatedly kk times.

PageRank: importance vector r=limkMkr0\mathbf{r} = \lim_{k \to \infty} M^k \mathbf{r}_0 where MM is the link matrix. Power iteration 就是反覆做 matrix-vector multiply 直到 converge。

QR Factorization

Any matrix AA can be decomposed as:

A=QRA = QR

where QQ is orthogonal (QQ=IQ^\top Q = I) and RR is upper triangular.

用途:numerically stable least squares — 不需要算 XXX^\top X(會 amplify condition number),而是用 QR 直接 solve。

ML Connections

ConceptML/DL Application
Chain of matmulsNN forward pass: hL=WLσ(W2σ(W1x))\mathbf{h}_L = W_L \sigma(\cdots W_2 \sigma(W_1 \mathbf{x}))
Matrix powerPageRank (power iteration), Markov chain convergence
QKVQK^\top VAttention: matmul of queries with keys, then with values
QR decompositionNumerically stable OLS, Gram-Schmidt in disguise
Computational cost(m×n)×(n×p)(m \times n) \times (n \times p): O(mnp)O(mnp) — Transformer attention: O(n2d)O(n^2 d)
import numpy as np

# Matrix multiplication: shape rules
A = np.random.randn(3, 4)  # 3×4
B = np.random.randn(4, 2)  # 4×2
C = A @ B                   # 3×2
print(f"({A.shape}) @ ({B.shape}) = {C.shape}")

# Composition: AB means "apply B then A"
# Rotate then scale = Scale_matrix @ Rotate_matrix
theta = np.pi / 6
R = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta),  np.cos(theta)]])
S = np.array([[2, 0],
              [0, 0.5]])
combined = S @ R  # first rotate, then scale
v = np.array([1, 0])
print(f"Rotate then scale: {combined @ v}")
print(f"Same as: {S @ (R @ v)}")  # associative

# Matrix power: PageRank-style
M = np.array([[0, 0.5, 0],
              [1, 0, 0.5],
              [0, 0.5, 0.5]])
r = np.array([1/3, 1/3, 1/3])  # initial uniform
for i in range(100):
    r = M @ r
    r = r / r.sum()  # normalize
print(f"\nPageRank converged: {np.round(r, 4)}")

# QR decomposition
X = np.random.randn(50, 3)
Q, R_mat = np.linalg.qr(X)
print(f"\nQ shape: {Q.shape}, R shape: {R_mat.shape}")
print(f"Q is orthogonal: {np.allclose(Q.T @ Q, np.eye(3))}")
print(f"A = QR? {np.allclose(X, Q @ R_mat)}")

# Solve least squares via QR (more stable than normal equation)
y = np.random.randn(50)
# From Xb = y → QRb = y → Rb = Q^T y
beta_qr = np.linalg.solve(R_mat, Q.T @ y)
beta_ne = np.linalg.inv(X.T @ X) @ X.T @ y  # normal equation
print(f"QR solution ≈ Normal eq: {np.allclose(beta_qr, beta_ne)}")

Matrix Inverses (Ch 11)

Left and Right Inverses

For non-square matrices:

  • Left inverse of AA (m>nm > n, tall): AL1=(AA)1AA_L^{-1} = (A^\top A)^{-1} A^\top such that AL1A=InA_L^{-1} A = I_n
  • Right inverse of AA (m<nm < n, wide): AR1=A(AA)1A_R^{-1} = A^\top (A A^\top)^{-1} such that AAR1=ImA A_R^{-1} = I_m

Left inverse 存在 → over-determined system 有 unique least-squares solution。 Right inverse 存在 → under-determined system 有 (non-unique) solution。

Square Matrix Inverse

A1A^{-1} exists if and only if:

  1. AA is square (n×nn \times n)
  2. Full rank: rank(A)=n\text{rank}(A) = n
  3. det(A)0\det(A) \neq 0

These three conditions are all equivalent for square matrices.

A1A=AA1=IA^{-1}A = AA^{-1} = I

Pseudo-Inverse (Moore-Penrose)

For any matrix AA (even non-square or singular), the pseudo-inverse A+A^+ always exists:

  • If AA has full column rank: A+=(AA)1AA^+ = (A^\top A)^{-1} A^\top (left inverse)
  • If AA has full row rank: A+=A(AA)1A^+ = A^\top (A A^\top)^{-1} (right inverse)
  • General: computed via SVD — A=UΣVA+=VΣ+UA = U\Sigma V^\top \Rightarrow A^+ = V\Sigma^+ U^\top

np.linalg.lstsq and np.linalg.pinv 底層就是用 pseudo-inverse。

ML Connections

ConceptML/DL Application
(XX)1(X^\top X)^{-1}OLS normal equation — requires XXX^\top X invertible
(XX+λI)1(X^\top X + \lambda I)^{-1}Ridge — λI\lambda I guarantees invertibility
Pseudo-inverse X+X^+np.linalg.lstsq — works even when XXX^\top X is singular
Mahalanobis distanced=(xμ)Σ1(xμ)d = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu})} — needs Σ1\Sigma^{-1}
Inverse in backpropComputing Hessian inverse for second-order methods (Newton, natural gradient)
import numpy as np

# Matrix inverse
A = np.array([[4, 7], [2, 6]])
A_inv = np.linalg.inv(A)
print(f"A @ A_inv = I? {np.allclose(A @ A_inv, np.eye(2))}")

# Solving Ax = b via inverse
b = np.array([1, 2])
x = A_inv @ b
print(f"Solution: {x}")
print(f"Verify: {A @ x}")  # Should be [1, 2]

# Pseudo-inverse for non-square matrix
X = np.random.randn(20, 3)   # tall matrix
X_pinv = np.linalg.pinv(X)
print(f"\nX shape: {X.shape}, X+ shape: {X_pinv.shape}")
print(f"X+ @ X ≈ I? {np.allclose(X_pinv @ X, np.eye(3))}")  # Left inverse

# lstsq uses pseudo-inverse internally
y = np.random.randn(20)
beta_pinv = X_pinv @ y
beta_lstsq, _, _, _ = np.linalg.lstsq(X, y, rcond=None)
print(f"pinv ≈ lstsq: {np.allclose(beta_pinv, beta_lstsq)}")

# Ridge vs OLS: Ridge always invertible
np.random.seed(0)
n, p = 10, 50  # p > n → OLS fails
X_wide = np.random.randn(n, p)
y_wide = np.random.randn(n)

# OLS: X^T X is singular
rank_XtX = np.linalg.matrix_rank(X_wide.T @ X_wide)
print(f"\nX^T X rank: {rank_XtX} (need {p} for invertibility)")

# Ridge: always works
lam = 1.0
beta_ridge = np.linalg.inv(X_wide.T @ X_wide + lam * np.eye(p)) @ X_wide.T @ y_wide
print(f"Ridge rank (X^TX + λI): {np.linalg.matrix_rank(X_wide.T @ X_wide + lam * np.eye(p))}")  # 50

# Pseudo-inverse also works
beta_pinv_wide = np.linalg.pinv(X_wide) @ y_wide
print(f"Pseudo-inverse gives minimum-norm solution: ||beta||={np.linalg.norm(beta_pinv_wide):.4f}")

Rank, Determinant & Positive Definiteness

Matrix Rank

Rank = number of linearly independent columns (= number of independent rows).

rank(A)min(m,n)\text{rank}(A) \leq \min(m, n)
  • Full rank: rank(A)=min(m,n)\text{rank}(A) = \min(m, n) — no redundant columns/rows
  • Rank deficient: rank(A)<min(m,n)\text{rank}(A) < \min(m, n) — matrix is singular (if square)

直覺:rank 告訴你 matrix 真正 "用到" 了幾個維度。Rank-deficient = 有些維度是多餘的。

Determinant

For a square matrix ARn×nA \in \mathbb{R}^{n \times n}:

det(A)=0    A is singular (not invertible)\det(A) = 0 \iff A \text{ is singular (not invertible)}

Geometric interpretation: det(A)|\det(A)| = the factor by which AA scales volumes.

  • det=0\det = 0: squashes space to lower dimension (information lost)
  • det>0\det > 0: preserves orientation
  • det<0\det < 0: flips orientation

Positive Definiteness

A symmetric matrix AA is positive definite (PD) if:

xAx>0for all x0\mathbf{x}^\top A \mathbf{x} > 0 \quad \text{for all } \mathbf{x} \neq \mathbf{0}

Equivalent conditions (all say the same thing):

  • All eigenvalues >0> 0
  • All leading principal minors >0> 0
  • Cholesky decomposition A=LLA = LL^\top exists
  • A=BBA = B^\top B for some full-rank BB

Positive semi-definite (PSD): xAx0\mathbf{x}^\top A \mathbf{x} \geq 0 (eigenvalues 0\geq 0)

ML Connections

ConceptML/DL Application
Rank < pMulticollinearity → XXX^\top X singular → OLS fails
Low-rank approxPCA (kk components), matrix factorization (recommenders), LoRA (LLM fine-tuning)
Covariance PSDΣ=1n1XX\Sigma = \frac{1}{n-1}X^\top X is always PSD → eigenvalues 0\geq 0 → PCA works
Kernel matrix PDMercer's theorem: valid kernel \Leftrightarrow kernel matrix is PSD
Hessian PDH0H \succ 0 everywhere \Rightarrow loss is strictly convex → unique global minimum
DeterminantGaussian log-likelihood involves logdet(Σ)\log\det(\Sigma); singular Σ\Sigmadet=0\det = 0 → problem
CholeskyFast sampling from multivariate Gaussian: x=μ+Lz\mathbf{x} = \boldsymbol{\mu} + L\mathbf{z}, zN(0,I)\mathbf{z} \sim \mathcal{N}(0,I)
import numpy as np

# Matrix rank
A = np.array([[1, 2, 3],
              [4, 5, 6],
              [5, 7, 9]])  # row 3 = row 1 + row 2
print(f"Rank: {np.linalg.matrix_rank(A)}")  # 2 (rank deficient)

# Full rank matrix
B = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 10]])  # independent rows
print(f"Rank: {np.linalg.matrix_rank(B)}")  # 3

# Determinant
print(f"\ndet(A) = {np.linalg.det(A):.6f}")  # ≈ 0 (singular)
print(f"det(B) = {np.linalg.det(B):.6f}")    # ≠ 0 (invertible)

# Positive definiteness check via eigenvalues
M = np.array([[4, 2],
              [2, 3]])
eigvals = np.linalg.eigvalsh(M)  # eigvalsh for symmetric matrices
print(f"\nEigenvalues: {eigvals}")
print(f"Positive definite: {np.all(eigvals > 0)}")  # True

# Covariance matrix is always PSD
X = np.random.randn(100, 5)
X_c = X - X.mean(axis=0)
cov = (X_c.T @ X_c) / (len(X) - 1)
cov_eigs = np.linalg.eigvalsh(cov)
print(f"\nCovariance eigenvalues: {np.round(cov_eigs, 4)}")
print(f"All >= 0 (PSD): {np.all(cov_eigs >= -1e-10)}")  # True

# Cholesky decomposition (only works for PD matrices)
L = np.linalg.cholesky(cov)
print(f"\nCholesky L shape: {L.shape}")
print(f"LL^T = cov? {np.allclose(L @ L.T, cov)}")

# Sampling from multivariate Gaussian using Cholesky
mu = np.zeros(5)
z = np.random.randn(5)
sample = mu + L @ z  # Much faster than np.random.multivariate_normal for repeated sampling

# Check: not PD matrix → Cholesky fails
M_not_pd = np.array([[1, 2], [2, 1]])  # eigenvalues: 3, -1
eigs_bad = np.linalg.eigvalsh(M_not_pd)
print(f"\nNot PD eigenvalues: {eigs_bad}")  # [-1, 3]
try:
    np.linalg.cholesky(M_not_pd)
except np.linalg.LinAlgError as e:
    print(f"Cholesky failed: {e}")

面試重點:Why Ridge Always Works

「Ridge regression 為什麼在任何情況下都能求解?」→ OLS 需要 (XX)1(X^\top X)^{-1},但 XXX^\top X 可能 singular。Ridge 加上 λI\lambda I(XX+λI)(X^\top X + \lambda I)。如果 XXX^\top X 的 eigenvalue 是 σi\sigma_i,加 λI\lambda I 後變成 σi+λ\sigma_i + \lambda。因為 λ>0\lambda > 0,所以所有 eigenvalues 都 >0> 0 → positive definite → always invertible。


Practice

Flashcards

Flashcards (1/10)

什麼是 linear independence?在 ML 中為什麼重要?

Vectors linearly independent = 沒有任何一個能被其他 vectors 線性組合出來。唯一讓 α₁v₁ + ... + αₖvₖ = 0 的方法是所有 αᵢ = 0。ML 中:如果 features linearly dependent → multicollinearity → X^TX singular → OLS 不能求解。要用 Ridge 或 drop features。

Click card to flip

Quiz

Question 1/10

三個 vectors v₁, v₂, v₃ 在 R³ 中。rank([v₁ v₂ v₃]) = 2。以下哪個正確?

Mark as Complete

3/5 — Okay