Matrices & Linear Equations

Course Reference

Based on Introduction to Applied Linear Algebra by Boyd & Vandenberghe (Stanford). Covers Chapters 5-11.

Why This Matters for ML

這章把 linear algebra 從抽象概念連接到 ML 的核心操作 — OLS 的 normal equation、NN 的 forward pass、RNN 的 state transition、Ridge 的 regularization trick、PCA 的 orthogonal projection。掌握這些讓你理解 ML 為什麼 work，而不只是怎麼用。

Linear Independence (Ch 5)

Linear Dependence

A set of vectors $\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k$ is linearly dependent if there exist scalars $\alpha_1, \ldots, \alpha_k$ , not all zero, such that:

\alpha_1 \mathbf{v}_1 + \alpha_2 \mathbf{v}_2 + \cdots + \alpha_k \mathbf{v}_k = \mathbf{0}

直覺：如果某個 vector 可以被其他 vectors 的 linear combination 表示，那它就是 "多餘的" — 它沒帶來新資訊。

Linearly independent = 唯一讓上式成立的方法是所有 $\alpha_i = 0$ 。每個 vector 都貢獻了新的 "方向"。

Basis

A basis for a subspace is a minimal spanning set — 它是 linearly independent 且 span 整個 subspace 的 vector set。

$\mathbb{R}^n$ 的 basis 恰好有 $n$ 個 vectors
Standard basis: $\mathbf{e}_1 = [1,0,\ldots,0]^\top, \mathbf{e}_2 = [0,1,\ldots,0]^\top, \ldots$
Basis 不唯一 — 同一個 subspace 有無限多組 basis

Orthonormal Vectors

Vectors $\mathbf{q}_1, \ldots, \mathbf{q}_k$ are orthonormal if:

\mathbf{q}_i^\top \mathbf{q}_j = \begin{cases} 1 & i = j \\ 0 & i \neq j \end{cases}

兩個性質：(1) 彼此 orthogonal（垂直，dot product = 0），(2) 每個都是 unit length（norm = 1）。

為什麼重要：orthonormal basis 讓計算超級簡單 — projection 變成一個 dot product，inverse 變成 transpose。

Gram-Schmidt Process

Given linearly independent vectors, Gram-Schmidt 把它們轉成 orthonormal vectors：

$\mathbf{q}_1 = \mathbf{v}_1 / \|\mathbf{v}_1\|$
For each subsequent vector: subtract projections onto previous $\mathbf{q}$ 's, then normalize

\tilde{\mathbf{q}}_k = \mathbf{v}_k - \sum_{j=1}^{k-1} (\mathbf{q}_j^\top \mathbf{v}_k) \mathbf{q}_j, \quad \mathbf{q}_k = \frac{\tilde{\mathbf{q}}_k}{\|\tilde{\mathbf{q}}_k\|}

直覺：每一步把新 vector 中 "已有方向" 的成分去掉，只留下新方向，然後 normalize 成 unit length。

ML Connections

Concept	ML/DL Application
Linear dependence	Multicollinearity — features are dependent → $X^\top X$ singular → OLS fails
Basis	PCA finds orthonormal basis of max-variance directions; word embeddings span a "meaning space"
Orthonormal vectors	PCA components are orthonormal → uncorrelated → each captures unique variance
Gram-Schmidt	QR decomposition (used for numerically stable least squares) is based on Gram-Schmidt
Independence check	Feature engineering: if rank(X) < p → redundant features → remove or regularize

import numpy as np

# Check linear independence via matrix rank
v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])
v3 = np.array([5, 7, 9])  # v3 = v1 + v2 → dependent!

A = np.column_stack([v1, v2, v3])
print(f"Rank: {np.linalg.matrix_rank(A)}")  # 2, not 3 → linearly dependent

# Independent set
w1 = np.array([1, 0, 0])
w2 = np.array([0, 1, 0])
w3 = np.array([0, 0, 1])
B = np.column_stack([w1, w2, w3])
print(f"Rank: {np.linalg.matrix_rank(B)}")  # 3 → independent

# Gram-Schmidt orthogonalization
def gram_schmidt(V):
    """V: matrix with vectors as columns. Returns Q (orthonormal columns)."""
    n, k = V.shape
    Q = np.zeros((n, k))
    for j in range(k):
        q = V[:, j].copy().astype(float)
        for i in range(j):
            q -= np.dot(Q[:, i], V[:, j]) * Q[:, i]  # subtract projection
        Q[:, j] = q / np.linalg.norm(q)               # normalize
    return Q

V = np.array([[1, 1], [1, 0], [0, 1]], dtype=float)
Q = gram_schmidt(V)
print("Q:\n", Q)

# Verify orthonormality: Q^T Q should be identity
print("Q^T Q:\n", np.round(Q.T @ Q, 10))  # [[1, 0], [0, 1]]

Matrices (Ch 6)

Matrix Basics

A matrix $A \in \mathbb{R}^{m \times n}$ is a 2D array with $m$ rows and $n$ columns. 在 ML 中，matrices 無處不在。

Special matrices:

Zero matrix $\mathbf{0}_{m \times n}$ : all entries 0
Identity matrix $I_n$ : diagonal = 1, rest = 0。 $AI = IA = A$
Transpose $A^\top$ : rows ↔ columns, $(A^\top)_{ij} = A_{ji}$
Symmetric: $A = A^\top$ （covariance matrix 就是 symmetric）

Matrix Operations

Addition: $C = A + B$ where $C_{ij} = A_{ij} + B_{ij}$ （same dimensions required）

Frobenius norm: 把 matrix 當 vector 算 L2 norm

\|A\|_F = \sqrt{\sum_{i,j} A_{ij}^2} = \sqrt{\text{tr}(A^\top A)}

Matrix-Vector Multiplication

$A\mathbf{x}$ = linear combination of columns of $A$ , weighted by entries of $\mathbf{x}$ :

A\mathbf{x} = x_1 \mathbf{a}_1 + x_2 \mathbf{a}_2 + \cdots + x_n \mathbf{a}_n

直覺：matrix-vector multiply 就是在問 "用 A 的 columns 作為 basis，以 x 的 entries 作為 weights，能組出什麼 vector？" 這就是為什麼 Ax 的 output 永遠在 column space of A 裡。

ML Connections

Concept	ML/DL Application
Data matrix $X$	$n \times p$ — rows = samples, columns = features
Weight matrix $W$	NN layer: $\mathbf{h} = \sigma(W\mathbf{x} + \mathbf{b})$
Covariance matrix	$\Sigma = \frac{1}{n-1}X^\top X$ (centered) — symmetric, PSD
Attention scores	$QK^\top$ → $n \times n$ matrix of pairwise similarities
User-item matrix	Recommender systems — sparse matrix, factorized as $UV^\top$
Transpose	Backprop: gradient flows through $W^\top$ (transposed weight matrix)

import numpy as np

# Matrix basics
A = np.array([[1, 2], [3, 4], [5, 6]])  # 3×2
print(f"Shape: {A.shape}")        # (3, 2)
print(f"Transpose: {A.T.shape}")  # (2, 3)

# Identity matrix
I = np.eye(3)
print(f"AI = A? {np.allclose(I @ np.array([[1],[2],[3]]), np.array([[1],[2],[3]]))}")

# Matrix-vector multiplication as linear combination of columns
W = np.array([[1, 0, -1],
              [0, 1,  2]])  # 2×3 weight matrix
x = np.array([3, 1, 2])
# Wx = 3*col1 + 1*col2 + 2*col3
result = W @ x  # [3*1 + 1*0 + 2*(-1), 3*0 + 1*1 + 2*2] = [1, 5]
print(f"Wx = {result}")

# Frobenius norm
print(f"||W||_F = {np.linalg.norm(W, 'fro'):.4f}")

# Symmetric matrix: covariance
X = np.random.randn(100, 3)
X_centered = X - X.mean(axis=0)
cov = (X_centered.T @ X_centered) / (len(X) - 1)
print(f"Covariance is symmetric: {np.allclose(cov, cov.T)}")  # True

Matrix Examples (Ch 7)

Geometric Transformations

Matrices represent geometric transformations — 每個 matrix 就是一個 "空間操作"。

Transformation	Matrix (2D)	Effect
Rotation by $\theta$	$\begin{bmatrix}\cos\theta & -\sin\theta \\ \sin\theta & \cos\theta\end{bmatrix}$	旋轉 vector，保持 length
Scaling by $s_x, s_y$	$\begin{bmatrix}s_x & 0 \\ 0 & s_y\end{bmatrix}$	沿 axes 拉伸/壓縮
Reflection (x-axis)	$\begin{bmatrix}1 & 0 \\ 0 & -1\end{bmatrix}$	翻轉 y 座標
Projection onto $\mathbf{u}$	$\frac{\mathbf{u}\mathbf{u}^\top}{\mathbf{u}^\top\mathbf{u}}$	把 vector 投影到 $\mathbf{u}$ 方向

Convolution as Matrix Multiplication

1D convolution 可以用 Toeplitz matrix 表示 — kernel 的 entries 沿 diagonal 排列。這就是為什麼 CNN 本質上還是 matrix multiplication，只是 weight matrix 有特殊結構（稀疏 + shared weights）。

Selector Matrices

A selector matrix picks specific rows from a vector or matrix — each row is a standard basis vector $\mathbf{e}_i^\top$ :

\mathbf{A} = \begin{bmatrix} \mathbf{e}_3^\top \\ \mathbf{e}_1^\top \\ \mathbf{e}_3^\top \end{bmatrix} \implies \mathbf{A}\mathbf{x} = \begin{bmatrix} x_3 \\ x_1 \\ x_3 \end{bmatrix}

Selector matrices 可以 select、reorder、甚至 duplicate rows。

ML 中的應用：

Embedding lookup: nn.Embedding 本質上就是 selector matrix × embedding table。embedding[token_id] = one-hot(token_id)ᵀ × W
Sampling / masking: Random selector picks subset of data points（mini-batch SGD）
Permutation matrix: Special selector that reorders without duplication（orthogonal, det = ±1）

import numpy as np

# Selector matrix: pick rows 2, 0, 2 from x
x = np.array([10, 20, 30, 40])
A = np.array([[0, 0, 1, 0],   # select x[2]
              [1, 0, 0, 0],   # select x[0]
              [0, 0, 1, 0]])  # select x[2] again
print(A @ x)  # [30, 10, 30]

# Embedding lookup IS selector matrix multiplication
vocab_size, embed_dim = 5, 3
W = np.random.randn(vocab_size, embed_dim)  # embedding table
token_id = 2
one_hot = np.zeros(vocab_size)
one_hot[token_id] = 1
embedding = one_hot @ W  # = W[token_id] — same as selector!
print(np.allclose(embedding, W[token_id]))  # True

# Permutation matrix (special selector — reorder, no duplicates)
P = np.array([[0, 0, 1], [1, 0, 0], [0, 1, 0]])  # cycle: 0→1→2→0
x = np.array([10, 20, 30])
print(P @ x)  # [30, 10, 20]
print(f"Orthogonal? {np.allclose(P.T @ P, np.eye(3))}")  # True

Incidence Matrix

An incidence matrix represents a graph as a matrix — rows = edges, columns = nodes:

A_{ij} = \begin{cases} -1 & \text{if edge } i \text{ starts at node } j \\ +1 & \text{if edge } i \text{ ends at node } j \\ 0 & \text{otherwise} \end{cases}

每一行有一個 $-1$ 和一個 $+1$ （代表一條 edge 連接兩個 nodes）。

ML 中的應用：

Graph Neural Networks (GNN): Adjacency matrix 和 incidence matrix 描述 graph structure → message passing
Network flow: Supply chain optimization = flow on graph
Laplacian matrix: $\mathbf{L} = \mathbf{A}^\top\mathbf{A}$ = graph Laplacian → spectral clustering（eigenvalues of L → cluster structure）

import numpy as np

# Graph: 0 → 1, 0 → 2, 1 → 2 (3 nodes, 3 edges)
# Incidence matrix: edges × nodes
A_inc = np.array([[-1,  1,  0],   # edge 0→1
                  [-1,  0,  1],   # edge 0→2
                  [ 0, -1,  1]])  # edge 1→2

# Graph Laplacian: L = A^T A
L = A_inc.T @ A_inc
print("Laplacian:\n", L)
# [[ 2, -1, -1],
#  [-1,  2, -1],
#  [-1, -1,  2]]

# Spectral clustering: eigenvalues of Laplacian
eigs = np.linalg.eigvalsh(L)
print(f"Laplacian eigenvalues: {eigs.round(2)}")
# Smallest eigenvalue always 0 (connected graph has exactly one 0)
# Number of zero eigenvalues = number of connected components
# Second smallest (Fiedler value) → used for spectral clustering

面試連結：Spectral Clustering

Graph Laplacian $\mathbf{L} = \mathbf{D} - \mathbf{A}$ （D = degree matrix, A = adjacency matrix）的 eigenvectors 包含 cluster structure。Second smallest eigenvalue 的 eigenvector（Fiedler vector）→ 二分圖的 optimal cut。這就是 spectral clustering 的數學基礎。

ML Connections

Concept	ML/DL Application
NN layers	Each layer = geometric transformation of feature space
CNN convolution	Toeplitz-structured matrix multiplication (sparse + weight sharing)
Selector / Embedding	`nn.Embedding` = selector matrix × embedding table
Incidence / Laplacian	Graph structure → GNN message passing, spectral clustering
PCA	Rotation to align with max-variance directions
OLS	$\hat{y} = X(X^\top X)^{-1}X^\top y$ = projection of y onto column space of X
Data augmentation	Rotation, scaling, reflection matrices applied to images

import numpy as np

# 2D Rotation matrix
theta = np.pi / 4  # 45 degrees
R = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta),  np.cos(theta)]])
v = np.array([1, 0])
v_rotated = R @ v
print(f"Rotated: {v_rotated}")  # [0.707, 0.707]
print(f"Length preserved: {np.linalg.norm(v):.3f} → {np.linalg.norm(v_rotated):.3f}")

# Scaling matrix
S = np.array([[2, 0],
              [0, 0.5]])
print(f"Scaled: {S @ v}")  # [2, 0] — stretched along x, shrunk along y

# Projection matrix onto direction u
u = np.array([1, 1]) / np.sqrt(2)  # unit vector at 45°
P = np.outer(u, u)  # uu^T (since u is unit, no denominator needed)
w = np.array([3, 1])
w_proj = P @ w
print(f"Projection of {w} onto u: {w_proj}")  # [2, 2]

# Projection is idempotent: P^2 = P
print(f"P² = P? {np.allclose(P @ P, P)}")  # True

# Convolution as matrix multiplication (Toeplitz)
signal = np.array([1, 2, 3, 4, 5])
kernel = np.array([1, 0, -1])

# Build Toeplitz matrix for "valid" convolution
n, k = len(signal), len(kernel)
T = np.zeros((n - k + 1, n))
for i in range(n - k + 1):
    T[i, i:i+k] = kernel

conv_result = T @ signal
print(f"Conv via matmul: {conv_result}")        # [-2, -2, -2]
print(f"Conv via np:     {np.convolve(signal, kernel, 'valid')}")  # same

Linear Equations (Ch 8)

System $Ax = b$

The fundamental question: given $A \in \mathbb{R}^{m \times n}$ and $\mathbf{b} \in \mathbb{R}^m$ , find $\mathbf{x} \in \mathbb{R}^n$ such that $A\mathbf{x} = \mathbf{b}$ .

解的存在性取決於 $\mathbf{b}$ 是否在 column space of $A$ 裡（即 $\mathbf{b}$ 能否被 $A$ 的 columns 線性組合出來）。

Three Cases

Case	Condition	Solutions	ML Example
Exactly determined	$m = n$ , full rank	Unique: $\mathbf{x} = A^{-1}\mathbf{b}$	Square system, rare in ML
Over-determined	$m > n$ (more equations)	No exact solution → least squares $\min \\|A\mathbf{x} - \mathbf{b}\\|^2$	OLS: n samples > p features
Under-determined	$m < n$ (more unknowns)	Infinite solutions → need regularization	p > n: high-dimensional, use Ridge/Lasso

Normal Equation (OLS)

For over-determined $X\boldsymbol{\beta} = \mathbf{y}$ , the least squares solution minimizes $\|X\boldsymbol{\beta} - \mathbf{y}\|^2$ :

X^\top X \boldsymbol{\beta} = X^\top \mathbf{y} \quad \Rightarrow \quad \hat{\boldsymbol{\beta}} = (X^\top X)^{-1} X^\top \mathbf{y}

直覺： $\hat{\mathbf{y}} = X\hat{\boldsymbol{\beta}}$ 是 $\mathbf{y}$ 在 column space of $X$ 的 orthogonal projection。Residual $\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$ 和所有 columns of $X$ 垂直（ $X^\top \mathbf{e} = \mathbf{0}$ ）。

Ridge Regression: Fixing Singularity

When $X^\top X$ is singular (multicollinearity or $p > n$ ), add $\lambda I$ :

\hat{\boldsymbol{\beta}}_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top \mathbf{y}

$\lambda I$ 讓所有 eigenvalues 至少為 $\lambda > 0$ → always invertible → always has unique solution。

ML Connections

Concept	ML/DL Application
Over-determined $Ax = b$	OLS regression — more data points than parameters
Under-determined $Ax = b$	p > n — infinite solutions → need regularization (Ridge, Lasso)
Normal equation	OLS closed-form: $\hat{\beta} = (X^\top X)^{-1}X^\top y$
Ridge	$(X^\top X + \lambda I)^{-1}X^\top y$ — $\lambda I$ ensures invertibility
Lasso	No closed form — but conceptually: L1 constraint selects sparse solution from infinite set

import numpy as np

# Exactly determined: Ax = b with square A
A = np.array([[2, 1], [1, 3]])
b = np.array([5, 7])
x = np.linalg.solve(A, b)
print(f"Solution: {x}")
print(f"Verify: Ax = {A @ x}")

# Over-determined: least squares (n > p)
np.random.seed(42)
n, p = 100, 3
X = np.column_stack([np.ones(n), np.random.randn(n, p - 1)])
beta_true = np.array([2.0, -1.5, 0.8])
y = X @ beta_true + 0.5 * np.random.randn(n)

# Normal equation
beta_hat = np.linalg.inv(X.T @ X) @ X.T @ y
print(f"True beta:      {beta_true}")
print(f"Estimated beta: {np.round(beta_hat, 3)}")

# Equivalently using lstsq (numerically more stable)
beta_lstsq, residuals, rank, sv = np.linalg.lstsq(X, y, rcond=None)
print(f"lstsq beta:     {np.round(beta_lstsq, 3)}")

# Under-determined: p > n → infinite solutions → Ridge
n_small, p_large = 10, 50
X_wide = np.random.randn(n_small, p_large)
y_small = np.random.randn(n_small)

# OLS fails: X^T X is 50×50 but rank ≤ 10
print(f"Rank of X^T X: {np.linalg.matrix_rank(X_wide.T @ X_wide)}")  # 10

# Ridge regression: always works
lam = 1.0
beta_ridge = np.linalg.inv(X_wide.T @ X_wide + lam * np.eye(p_large)) @ X_wide.T @ y_small
print(f"Ridge solution norm: {np.linalg.norm(beta_ridge):.4f}")

面試重點：Normal Equation vs Gradient Descent

「什麼時候用 normal equation，什麼時候用 gradient descent？」→ Normal equation: $O(p^3)$ complexity，p 小時快。Gradient descent: $O(np)$ per iteration，p 大時較好。一般 $p > 10000$ 就改用 GD。另外 normal equation 需要 $X^\top X$ invertible，GD 不需要。

Linear Dynamical Systems (Ch 9)

State-Space Model

A linear dynamical system describes how a state evolves over time:

\mathbf{x}_{t+1} = A\mathbf{x}_t + B\mathbf{u}_t

where $\mathbf{x}_t$ is the state vector, $A$ is the dynamics matrix, $\mathbf{u}_t$ is the input, and $B$ maps input to state.

直覺： $A$ 決定 "系統本身怎麼演化"， $B\mathbf{u}_t$ 是外部輸入的影響。

Stability via Eigenvalues

System stability depends on eigenvalues of $A$ :

Condition	Behavior	ML Analogy
All $\\|\lambda_i\\| < 1$	State converges to 0 → stable	Vanishing gradients in RNN
Some $\\|\lambda_i\\| > 1$	State explodes → unstable	Exploding gradients in RNN
$\\|\lambda_i\\| = 1$	State oscillates → marginally stable	Ideal gradient flow

ML Connections

Concept	ML/DL Application
RNN hidden state	$\mathbf{h}_t = f(W_h \mathbf{h}_{t-1} + W_x \mathbf{x}_t)$ — linear dynamical system + nonlinearity
Vanishing/exploding gradients	Eigenvalues of $W_h$ : if all $< 1$ → vanishing; if any $> 1$ → exploding
Markov chains	$\boldsymbol{\pi}_{t+1} = P^\top \boldsymbol{\pi}_t$ — stationary distribution is eigenvector for $\lambda = 1$
Kalman filter	State estimation: predict with dynamics model, update with observation
LSTM/GRU	Gates control eigenvalue spectrum → prevent vanishing/exploding

import numpy as np

# Simulate a linear dynamical system: x_{t+1} = A x_t
# Stable system (eigenvalues < 1)
A_stable = np.array([[0.8, 0.1],
                      [-0.1, 0.7]])
eigenvalues = np.linalg.eigvals(A_stable)
print(f"Eigenvalues: {eigenvalues}")
print(f"Magnitudes:  {np.abs(eigenvalues)}")  # Both < 1 → stable

x = np.array([5.0, 3.0])
trajectory = [x.copy()]
for t in range(50):
    x = A_stable @ x
    trajectory.append(x.copy())
trajectory = np.array(trajectory)
print(f"After 50 steps: {trajectory[-1]}")  # → [0, 0]

# Unstable system (eigenvalue > 1)
A_unstable = np.array([[1.1, 0.0],
                        [0.0, 0.9]])
eigs = np.linalg.eigvals(A_unstable)
print(f"\nUnstable eigenvalues: {eigs}")  # 1.1 > 1 → explodes

x = np.array([1.0, 1.0])
for t in range(20):
    x = A_unstable @ x
print(f"After 20 steps: {x}")  # First component explodes

# RNN analogy: repeated multiplication by W_h
# If max |eigenvalue(W_h)| > 1, gradients explode
W_h = np.random.randn(3, 3) * 0.5  # small weights → stable
eigs_rnn = np.linalg.eigvals(W_h)
print(f"\nRNN W_h eigenvalue magnitudes: {np.abs(eigs_rnn)}")
print(f"Gradient will {'vanish' if np.max(np.abs(eigs_rnn)) < 1 else 'explode'}")

# Markov chain: stationary distribution
P = np.array([[0.7, 0.3],
              [0.4, 0.6]])
eigs_mc, vecs_mc = np.linalg.eig(P.T)
# Stationary distribution = eigenvector for eigenvalue 1
idx = np.argmin(np.abs(eigs_mc - 1.0))
pi = np.abs(vecs_mc[:, idx])
pi = pi / pi.sum()
print(f"\nStationary distribution: {pi}")  # [4/7, 3/7] ≈ [0.571, 0.429]
print(f"Verify: pi @ P = {pi @ P}")       # Should equal pi

Application Examples

Population Dynamics: Species 之間的 predator-prey 關係用 linear system 建模。Population vector $\mathbf{x}_t$ 的每個 component = 一個 species 的數量。Matrix $A$ encode 了 growth rates 和 inter-species interactions。

Epidemic Dynamics (SIR Model): State vector $\mathbf{x}_t = [S_t, I_t, R_t]$ （Susceptible, Infected, Recovered）。Transition matrix 描述感染率和恢復率。Linear approximation 在 early stage 合理，大規模時需要 nonlinear model。

Motion of a Mass: Position + velocity = state vector $[x, v]^\top$ 。 $A$ encodes 物理定律（Newton's second law discretized）。和 RL 中 continuous control（robotics）的 state space 直接對應。

Supply Chain Dynamics: 每個 node = warehouse/store，state = inventory level。 $A$ 描述 shipping between nodes， $\mathbf{u}$ = production/demand。Optimal control → minimize inventory cost → linear quadratic regulator (LQR)。

import numpy as np

# SIR Epidemic Model (linearized)
# State: [S, I, R] (Susceptible, Infected, Recovered)
beta = 0.3   # infection rate
gamma = 0.1  # recovery rate

# Transition matrix (simplified, linearized around disease-free state)
A_sir = np.array([[1-beta, 0,     0],
                  [beta,   1-gamma, 0],
                  [0,      gamma,  1]])

x = np.array([0.99, 0.01, 0.0])  # 1% initially infected
for t in range(30):
    x = A_sir @ x
    x = np.clip(x, 0, 1)  # ensure non-negative
    x = x / x.sum()        # normalize (conservation)
print(f"After 30 days: S={x[0]:.3f}, I={x[1]:.3f}, R={x[2]:.3f}")

# Motion of a mass: state = [position, velocity]
dt = 0.1
A_motion = np.array([[1, dt],    # x_{t+1} = x_t + v_t * dt
                     [0, 0.99]]) # v_{t+1} = 0.99 * v_t (friction)
x_motion = np.array([0.0, 1.0])  # start at origin, velocity = 1
for t in range(100):
    x_motion = A_motion @ x_motion
print(f"Final position: {x_motion[0]:.2f}, velocity: {x_motion[1]:.4f}")
# Position accumulates, velocity decays (stable because eigenvalues < 1)

Matrix Multiplication (Ch 10)

Matrix-Matrix Multiplication

For $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{n \times p}$ :

C = AB \in \mathbb{R}^{m \times p}, \quad C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}

Inner dimensions must match: $(m \times \underline{n}) \times (\underline{n} \times p) = (m \times p)$

直覺： $C_{ij}$ = dot product of row $i$ of $A$ and column $j$ of $B$ 。

Key Properties

Not commutative: $AB \neq BA$ in general
Associative: $(AB)C = A(BC)$
Composition: $AB$ corresponds to applying transformation $B$ first, then $A$
Transpose: $(AB)^\top = B^\top A^\top$

Matrix Power

$A^k = \underbrace{A \cdot A \cdots A}_{k \text{ times}}$ — applying transformation $A$ repeatedly $k$ times.

PageRank: importance vector $\mathbf{r} = \lim_{k \to \infty} M^k \mathbf{r}_0$ where $M$ is the link matrix. Power iteration 就是反覆做 matrix-vector multiply 直到 converge。

QR Factorization

Any matrix $A$ can be decomposed as:

A = QR

where $Q$ is orthogonal ( $Q^\top Q = I$ ) and $R$ is upper triangular.

用途：numerically stable least squares — 不需要算 $X^\top X$ （會 amplify condition number），而是用 QR 直接 solve。

ML Connections

Concept	ML/DL Application
Chain of matmuls	NN forward pass: $\mathbf{h}_L = W_L \sigma(\cdots W_2 \sigma(W_1 \mathbf{x}))$
Matrix power	PageRank (power iteration), Markov chain convergence
$QK^\top V$	Attention: matmul of queries with keys, then with values
QR decomposition	Numerically stable OLS, Gram-Schmidt in disguise
Computational cost	$(m \times n) \times (n \times p)$ : $O(mnp)$ — Transformer attention: $O(n^2 d)$

import numpy as np

# Matrix multiplication: shape rules
A = np.random.randn(3, 4)  # 3×4
B = np.random.randn(4, 2)  # 4×2
C = A @ B                   # 3×2
print(f"({A.shape}) @ ({B.shape}) = {C.shape}")

# Composition: AB means "apply B then A"
# Rotate then scale = Scale_matrix @ Rotate_matrix
theta = np.pi / 6
R = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta),  np.cos(theta)]])
S = np.array([[2, 0],
              [0, 0.5]])
combined = S @ R  # first rotate, then scale
v = np.array([1, 0])
print(f"Rotate then scale: {combined @ v}")
print(f"Same as: {S @ (R @ v)}")  # associative

# Matrix power: PageRank-style
M = np.array([[0, 0.5, 0],
              [1, 0, 0.5],
              [0, 0.5, 0.5]])
r = np.array([1/3, 1/3, 1/3])  # initial uniform
for i in range(100):
    r = M @ r
    r = r / r.sum()  # normalize
print(f"\nPageRank converged: {np.round(r, 4)}")

# QR decomposition
X = np.random.randn(50, 3)
Q, R_mat = np.linalg.qr(X)
print(f"\nQ shape: {Q.shape}, R shape: {R_mat.shape}")
print(f"Q is orthogonal: {np.allclose(Q.T @ Q, np.eye(3))}")
print(f"A = QR? {np.allclose(X, Q @ R_mat)}")

# Solve least squares via QR (more stable than normal equation)
y = np.random.randn(50)
# From Xb = y → QRb = y → Rb = Q^T y
beta_qr = np.linalg.solve(R_mat, Q.T @ y)
beta_ne = np.linalg.inv(X.T @ X) @ X.T @ y  # normal equation
print(f"QR solution ≈ Normal eq: {np.allclose(beta_qr, beta_ne)}")

Matrix Inverses (Ch 11)

Left and Right Inverses

For non-square matrices:

Left inverse of $A$ ( $m > n$ , tall): $A_L^{-1} = (A^\top A)^{-1} A^\top$ such that $A_L^{-1} A = I_n$
Right inverse of $A$ ( $m < n$ , wide): $A_R^{-1} = A^\top (A A^\top)^{-1}$ such that $A A_R^{-1} = I_m$

Left inverse 存在 → over-determined system 有 unique least-squares solution。 Right inverse 存在 → under-determined system 有 (non-unique) solution。

Square Matrix Inverse

$A^{-1}$ exists if and only if:

$A$ is square ( $n \times n$ )
Full rank: $\text{rank}(A) = n$
$\det(A) \neq 0$

These three conditions are all equivalent for square matrices.

A^{-1}A = AA^{-1} = I

Pseudo-Inverse (Moore-Penrose)

For any matrix $A$ (even non-square or singular), the pseudo-inverse $A^+$ always exists:

If $A$ has full column rank: $A^+ = (A^\top A)^{-1} A^\top$ (left inverse)
If $A$ has full row rank: $A^+ = A^\top (A A^\top)^{-1}$ (right inverse)
General: computed via SVD — $A = U\Sigma V^\top \Rightarrow A^+ = V\Sigma^+ U^\top$

np.linalg.lstsq and np.linalg.pinv 底層就是用 pseudo-inverse。

ML Connections

Concept	ML/DL Application
$(X^\top X)^{-1}$	OLS normal equation — requires $X^\top X$ invertible
$(X^\top X + \lambda I)^{-1}$	Ridge — $\lambda I$ guarantees invertibility
Pseudo-inverse $X^+$	`np.linalg.lstsq` — works even when $X^\top X$ is singular
Mahalanobis distance	$d = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu})}$ — needs $\Sigma^{-1}$
Inverse in backprop	Computing Hessian inverse for second-order methods (Newton, natural gradient)

import numpy as np

# Matrix inverse
A = np.array([[4, 7], [2, 6]])
A_inv = np.linalg.inv(A)
print(f"A @ A_inv = I? {np.allclose(A @ A_inv, np.eye(2))}")

# Solving Ax = b via inverse
b = np.array([1, 2])
x = A_inv @ b
print(f"Solution: {x}")
print(f"Verify: {A @ x}")  # Should be [1, 2]

# Pseudo-inverse for non-square matrix
X = np.random.randn(20, 3)   # tall matrix
X_pinv = np.linalg.pinv(X)
print(f"\nX shape: {X.shape}, X+ shape: {X_pinv.shape}")
print(f"X+ @ X ≈ I? {np.allclose(X_pinv @ X, np.eye(3))}")  # Left inverse

# lstsq uses pseudo-inverse internally
y = np.random.randn(20)
beta_pinv = X_pinv @ y
beta_lstsq, _, _, _ = np.linalg.lstsq(X, y, rcond=None)
print(f"pinv ≈ lstsq: {np.allclose(beta_pinv, beta_lstsq)}")

# Ridge vs OLS: Ridge always invertible
np.random.seed(0)
n, p = 10, 50  # p > n → OLS fails
X_wide = np.random.randn(n, p)
y_wide = np.random.randn(n)

# OLS: X^T X is singular
rank_XtX = np.linalg.matrix_rank(X_wide.T @ X_wide)
print(f"\nX^T X rank: {rank_XtX} (need {p} for invertibility)")

# Ridge: always works
lam = 1.0
beta_ridge = np.linalg.inv(X_wide.T @ X_wide + lam * np.eye(p)) @ X_wide.T @ y_wide
print(f"Ridge rank (X^TX + λI): {np.linalg.matrix_rank(X_wide.T @ X_wide + lam * np.eye(p))}")  # 50

# Pseudo-inverse also works
beta_pinv_wide = np.linalg.pinv(X_wide) @ y_wide
print(f"Pseudo-inverse gives minimum-norm solution: ||beta||={np.linalg.norm(beta_pinv_wide):.4f}")

Rank, Determinant & Positive Definiteness

Matrix Rank

Rank = number of linearly independent columns (= number of independent rows).

\text{rank}(A) \leq \min(m, n)

Full rank: $\text{rank}(A) = \min(m, n)$ — no redundant columns/rows
Rank deficient: $\text{rank}(A) < \min(m, n)$ — matrix is singular (if square)

直覺：rank 告訴你 matrix 真正 "用到" 了幾個維度。Rank-deficient = 有些維度是多餘的。

Determinant

For a square matrix $A \in \mathbb{R}^{n \times n}$ :

\det(A) = 0 \iff A \text{ is singular (not invertible)}

Geometric interpretation: $|\det(A)|$ = the factor by which $A$ scales volumes.

$\det = 0$ : squashes space to lower dimension (information lost)
$\det > 0$ : preserves orientation
$\det < 0$ : flips orientation

Positive Definiteness

A symmetric matrix $A$ is positive definite (PD) if:

\mathbf{x}^\top A \mathbf{x} > 0 \quad \text{for all } \mathbf{x} \neq \mathbf{0}

Equivalent conditions (all say the same thing):

All eigenvalues $> 0$
All leading principal minors $> 0$
Cholesky decomposition $A = LL^\top$ exists
$A = B^\top B$ for some full-rank $B$

Positive semi-definite (PSD): $\mathbf{x}^\top A \mathbf{x} \geq 0$ (eigenvalues $\geq 0$ )

ML Connections

Concept	ML/DL Application
Rank < p	Multicollinearity → $X^\top X$ singular → OLS fails
Low-rank approx	PCA ( $k$ components), matrix factorization (recommenders), LoRA (LLM fine-tuning)
Covariance PSD	$\Sigma = \frac{1}{n-1}X^\top X$ is always PSD → eigenvalues $\geq 0$ → PCA works
Kernel matrix PD	Mercer's theorem: valid kernel $\Leftrightarrow$ kernel matrix is PSD
Hessian PD	$H \succ 0$ everywhere $\Rightarrow$ loss is strictly convex → unique global minimum
Determinant	Gaussian log-likelihood involves $\log\det(\Sigma)$ ; singular $\Sigma$ → $\det = 0$ → problem
Cholesky	Fast sampling from multivariate Gaussian: $\mathbf{x} = \boldsymbol{\mu} + L\mathbf{z}$ , $\mathbf{z} \sim \mathcal{N}(0,I)$

import numpy as np

# Matrix rank
A = np.array([[1, 2, 3],
              [4, 5, 6],
              [5, 7, 9]])  # row 3 = row 1 + row 2
print(f"Rank: {np.linalg.matrix_rank(A)}")  # 2 (rank deficient)

# Full rank matrix
B = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 10]])  # independent rows
print(f"Rank: {np.linalg.matrix_rank(B)}")  # 3

# Determinant
print(f"\ndet(A) = {np.linalg.det(A):.6f}")  # ≈ 0 (singular)
print(f"det(B) = {np.linalg.det(B):.6f}")    # ≠ 0 (invertible)

# Positive definiteness check via eigenvalues
M = np.array([[4, 2],
              [2, 3]])
eigvals = np.linalg.eigvalsh(M)  # eigvalsh for symmetric matrices
print(f"\nEigenvalues: {eigvals}")
print(f"Positive definite: {np.all(eigvals > 0)}")  # True

# Covariance matrix is always PSD
X = np.random.randn(100, 5)
X_c = X - X.mean(axis=0)
cov = (X_c.T @ X_c) / (len(X) - 1)
cov_eigs = np.linalg.eigvalsh(cov)
print(f"\nCovariance eigenvalues: {np.round(cov_eigs, 4)}")
print(f"All >= 0 (PSD): {np.all(cov_eigs >= -1e-10)}")  # True

# Cholesky decomposition (only works for PD matrices)
L = np.linalg.cholesky(cov)
print(f"\nCholesky L shape: {L.shape}")
print(f"LL^T = cov? {np.allclose(L @ L.T, cov)}")

# Sampling from multivariate Gaussian using Cholesky
mu = np.zeros(5)
z = np.random.randn(5)
sample = mu + L @ z  # Much faster than np.random.multivariate_normal for repeated sampling

# Check: not PD matrix → Cholesky fails
M_not_pd = np.array([[1, 2], [2, 1]])  # eigenvalues: 3, -1
eigs_bad = np.linalg.eigvalsh(M_not_pd)
print(f"\nNot PD eigenvalues: {eigs_bad}")  # [-1, 3]
try:
    np.linalg.cholesky(M_not_pd)
except np.linalg.LinAlgError as e:
    print(f"Cholesky failed: {e}")

面試重點：Why Ridge Always Works

「Ridge regression 為什麼在任何情況下都能求解？」→ OLS 需要 $(X^\top X)^{-1}$ ，但 $X^\top X$ 可能 singular。Ridge 加上 $\lambda I$ ： $(X^\top X + \lambda I)$ 。如果 $X^\top X$ 的 eigenvalue 是 $\sigma_i$ ，加 $\lambda I$ 後變成 $\sigma_i + \lambda$ 。因為 $\lambda > 0$ ，所以所有 eigenvalues 都 $> 0$ → positive definite → always invertible。

Practice

Flashcards

Flashcards (1/10)

什麼是 linear independence？在 ML 中為什麼重要？

Vectors linearly independent = 沒有任何一個能被其他 vectors 線性組合出來。唯一讓 α₁v₁ + ... + αₖvₖ = 0 的方法是所有 αᵢ = 0。ML 中：如果 features linearly dependent → multicollinearity → X^TX singular → OLS 不能求解。要用 Ridge 或 drop features。

Click card to flip

Quiz

Question 1/10

三個 vectors v₁, v₂, v₃ 在 R³ 中。rank([v₁ v₂ v₃]) = 2。以下哪個正確？

Mark as Complete

How confident are you with this topic?

3/5 — Okay

Matrices & Linear Equations

Linear Independence (Ch 5)

Linear Dependence

Basis

Orthonormal Vectors

Gram-Schmidt Process

ML Connections

Matrices (Ch 6)

Matrix Basics

Matrix Operations

Matrix-Vector Multiplication

ML Connections

Matrix Examples (Ch 7)

Geometric Transformations

Convolution as Matrix Multiplication

Selector Matrices

Incidence Matrix

ML Connections

Linear Equations (Ch 8)

System Ax=bAx = bAx=b

Three Cases

Normal Equation (OLS)

Ridge Regression: Fixing Singularity

ML Connections

Linear Dynamical Systems (Ch 9)

State-Space Model

Stability via Eigenvalues

ML Connections

Application Examples

Matrix Multiplication (Ch 10)

Matrix-Matrix Multiplication

Key Properties

Matrix Power

QR Factorization

ML Connections

Matrix Inverses (Ch 11)

Left and Right Inverses

Square Matrix Inverse

Pseudo-Inverse (Moore-Penrose)

ML Connections

Rank, Determinant & Positive Definiteness

Matrix Rank

Determinant

Positive Definiteness

ML Connections

Practice

Flashcards

Flashcards (1/10)

Quiz

Mark as Complete

System $Ax = b$