Decompositions & Matrix Calculus

Course Reference

Extends Introduction to Applied Linear Algebra by Boyd & Vandenberghe (Stanford) with decompositions and matrix calculus essential for ML/DL.

Why This Matters

Eigendecomposition 是 PCA 的數學基礎,SVD 是推薦系統和 LoRA 的核心,matrix calculus 是 backpropagation 的語言。理解這些讓你能從原理理解 ML 算法,而不只是會 call API。

Eigendecomposition

Definition

For a square matrix A\mathbf{A}, an eigenvector v\mathbf{v} and eigenvalue λ\lambda satisfy:

Av=λv\mathbf{A}\mathbf{v} = \lambda\mathbf{v}

直覺:A\mathbf{A} 作用在 v\mathbf{v} 上只是 scale(不改方向)。λ\lambda = scaling factor。

Eigendecomposition

For a symmetric matrix A\mathbf{A}:

A=VΛV\mathbf{A} = \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^\top
  • V\mathbf{V}: columns = eigenvectors(orthonormal if symmetric)
  • Λ\boldsymbol{\Lambda}: diagonal matrix of eigenvalues

ML Connections

ApplicationHow Eigendecomposition Is Used
PCAEigenvectors of covariance matrix = principal components; eigenvalues = explained variance
PageRankDominant eigenvector of transition matrix = page importance scores
Spectral clusteringEigenvectors of graph Laplacian = cluster indicators
Markov chains (RL)Eigenvector with λ=1\lambda=1 = stationary distribution
Convexity checkEigenvalues of Hessian: all > 0 → convex, mixed → saddle points
import numpy as np

A = np.array([[4, 2], [1, 3]])
eigenvalues, eigenvectors = np.linalg.eig(A)

# For symmetric matrices (covariance), use eigh (guaranteed real, sorted)
cov = np.cov(X.T)
eigenvalues, eigenvectors = np.linalg.eigh(cov)
# eigenvalues sorted ascending; last = largest = PC1
import numpy as np

# Eigendecomposition
A = np.array([[4, 2], [2, 3]])
eigenvalues, eigenvectors = np.linalg.eigh(A)  # eigh for symmetric (real eigenvalues, sorted)
# eigenvalues: [1.59, 5.41]
# eigenvectors: columns are the eigenvectors

# Verify: A @ v = λ * v
v = eigenvectors[:, 1]  # eigenvector for largest eigenvalue
lam = eigenvalues[1]
np.allclose(A @ v, lam * v)  # True ✓

# Reconstruct: A = V Λ V^T
A_reconstructed = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
np.allclose(A, A_reconstructed)  # True ✓

# PCA: eigendecomposition of covariance matrix
X = np.random.randn(100, 5)
X_centered = X - X.mean(axis=0)
cov = (X_centered.T @ X_centered) / (len(X) - 1)  # = np.cov(X.T)
eig_vals, eig_vecs = np.linalg.eigh(cov)
# eig_vals[-1] = largest = PC1's variance
# eig_vecs[:, -1] = PC1 direction
explained_ratio = eig_vals / eig_vals.sum()  # explained variance per PC

PCA via Eigendecomposition

C=1n1XX=VΛV\mathbf{C} = \frac{1}{n-1}\mathbf{X}^\top\mathbf{X} = \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^\top
  • PC1 = eigenvector with largest eigenvalue = direction of maximum variance
  • PC2 = eigenvector with second largest eigenvalue, orthogonal to PC1
  • Explained variance ratio = λi/jλj\lambda_i / \sum_j \lambda_j

面試連結:PCA

「PCA 的數學原理是什麼?」→ 對 centered data 的 covariance matrix 做 eigendecomposition。Eigenvectors = principal directions,eigenvalues = variance in those directions。降維 = 只保留 top-k eigenvectors。

Singular Value Decomposition (SVD)

Definition

Any matrix ARm×n\mathbf{A} \in \mathbb{R}^{m \times n} (not just square) can be decomposed as:

A=UΣV\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top
ComponentSizeProperties
U\mathbf{U}m×mm \times mLeft singular vectors (orthonormal)
Σ\boldsymbol{\Sigma}m×nm \times nDiagonal: singular values σ1σ20\sigma_1 \geq \sigma_2 \geq \cdots \geq 0
V\mathbf{V}n×nn \times nRight singular vectors (orthonormal) = PCA eigenvectors

Relationship to Eigendecomposition

AA=VΣ2V    λi=σi2/(n1)\mathbf{A}^\top\mathbf{A} = \mathbf{V}\boldsymbol{\Sigma}^2\mathbf{V}^\top \qquad \implies \qquad \lambda_i = \sigma_i^2 / (n-1)

PCA via SVD: X=UΣV\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top → principal components = columns of V\mathbf{V}。SVD 比 eigendecomposition 更 numerically stable — sklearn 的 PCA 底層就是用 SVD。

Truncated SVD: Low-Rank Approximation

Keep only top kk singular values:

AUkΣkVk\mathbf{A} \approx \mathbf{U}_k\boldsymbol{\Sigma}_k\mathbf{V}_k^\top

Eckart-Young theorem: This is the best rank-kk approximation of A\mathbf{A} in Frobenius norm.

ML Connections

ApplicationHow SVD Is Used
PCAX=UΣV\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top → columns of V\mathbf{V} = PCs
Recommender systemsRUkΣkVk\mathbf{R} \approx \mathbf{U}_k\boldsymbol{\Sigma}_k\mathbf{V}_k^\top — user/item latent factors
LSA (NLP)TruncatedSVD on TF-IDF → topic extraction
Image compressionKeep top kk singular values → reconstruct with fewer parameters
LoRA (LLM fine-tuning)ΔW=BA\Delta\mathbf{W} = \mathbf{B}\mathbf{A} — low-rank update to frozen weights
Pseudo-inverseA+=VΣ+U\mathbf{A}^+ = \mathbf{V}\boldsymbol{\Sigma}^+\mathbf{U}^\top
import numpy as np

# Full SVD
U, sigma, Vt = np.linalg.svd(X, full_matrices=False)
# Reconstruct: X ≈ U @ np.diag(sigma) @ Vt

# Truncated SVD (for sparse data — PCA alternative)
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=50)
X_reduced = svd.fit_transform(X_tfidf)  # keeps sparsity
import numpy as np

# Full SVD
X = np.random.randn(100, 10)
U, sigma, Vt = np.linalg.svd(X, full_matrices=False)
# U: (100, 10), sigma: (10,), Vt: (10, 10)

# Reconstruct
X_reconstructed = U @ np.diag(sigma) @ Vt
np.allclose(X, X_reconstructed)  # True ✓

# Truncated SVD: rank-k approximation
k = 3
X_approx = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :]
error = np.linalg.norm(X - X_approx, 'fro')
total = np.linalg.norm(X, 'fro')
print(f"Rank-{k} captures {1 - error**2/total**2:.1%} of variance")

# Relationship: eigenvalues of X^T X = sigma^2
XtX_eigs = np.linalg.eigvalsh(X.T @ X)
sigma_squared = sigma**2
# XtX_eigs (sorted ascending) ≈ sigma_squared (sorted descending, reversed)

# TruncatedSVD for sparse data (NLP)
from sklearn.decomposition import TruncatedSVD
# svd = TruncatedSVD(n_components=50).fit_transform(X_tfidf_sparse)

# LoRA: low-rank weight update
d, r = 768, 16  # model dim, LoRA rank
B = np.random.randn(d, r) * 0.01  # d × r
A = np.random.randn(r, d) * 0.01  # r × d
delta_W = B @ A  # rank-r update: (d × r) @ (r × d) = (d × d)
# Full W has d² = 589K params, LoRA has 2dr = 24K params (24x fewer)

面試連結:LoRA

LoRA (Low-Rank Adaptation) 用 low-rank matrices fine-tune LLMs:W=W+BA\mathbf{W}' = \mathbf{W} + \mathbf{BA},其中 BRd×r\mathbf{B} \in \mathbb{R}^{d \times r}, ARr×d\mathbf{A} \in \mathbb{R}^{r \times d}, rdr \ll d。只 train B\mathbf{B}A\mathbf{A}(far fewer params than full W\mathbf{W})。核心 insight = weight updates are often low-rank → don't need full-rank update matrix。

Matrix Calculus

Why Matrix Calculus for ML

Backpropagation = chain rule applied to matrix operations. 理解 matrix derivatives 讓你能手推 gradient(面試常考)和 debug gradient issues。

Scalar-by-Vector Gradient

For a scalar function f:RnRf: \mathbb{R}^n \to \mathbb{R}, the gradient is:

xf=[fx1fxn]\nabla_\mathbf{x} f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}

Gradient 指向 ff 增加最快的方向。Gradient descent: xxηf\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla f → move in opposite direction。

Common Gradients

FunctionGradientML Usage
f=axf = \mathbf{a}^\top\mathbf{x}xf=a\nabla_\mathbf{x} f = \mathbf{a}Linear model gradient
f=xAxf = \mathbf{x}^\top\mathbf{A}\mathbf{x}xf=(A+A)x\nabla_\mathbf{x} f = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x} (= 2Ax2\mathbf{A}\mathbf{x} if symmetric)Quadratic form (regularization)
f=Axb2f = \|\mathbf{Ax} - \mathbf{b}\|^2xf=2A(Axb)\nabla_\mathbf{x} f = 2\mathbf{A}^\top(\mathbf{Ax} - \mathbf{b})OLS gradient → set to 0 → normal equation
f=x2=xxf = \|\mathbf{x}\|^2 = \mathbf{x}^\top\mathbf{x}xf=2x\nabla_\mathbf{x} f = 2\mathbf{x}L2 regularization gradient

OLS Derivation via Matrix Calculus

L(β)=yXβ2=(yXβ)(yXβ)\mathcal{L}(\boldsymbol{\beta}) = \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) βL=2X(yXβ)=0\nabla_{\boldsymbol{\beta}} \mathcal{L} = -2\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = 0 XXβ=Xy    β^=(XX)1Xy\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^\top\mathbf{y} \implies \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}

面試中能手推 OLS closed-form 是 strong signal。

import numpy as np

# Gradient of f = a^T x → ∇f = a
a = np.array([2, -1, 3])
x = np.array([1, 2, 3])
# f = a @ x = 2 - 2 + 9 = 9
# gradient w.r.t. x = a = [2, -1, 3]

# Gradient of f = ||x||^2 = x^T x → ∇f = 2x
x = np.array([1, 2, 3])
grad = 2 * x  # [2, 4, 6]

# OLS: verify gradient = 0 at optimal beta
X = np.random.randn(100, 3)
y_true = X @ np.array([2, -1, 0.5]) + np.random.randn(100) * 0.1
beta = np.linalg.inv(X.T @ X) @ X.T @ y_true  # OLS closed form

# Gradient at optimal beta should be ≈ 0
gradient = -2 * X.T @ (y_true - X @ beta)
print(f"Gradient at optimal beta: {gradient}")  # ≈ [0, 0, 0]

# Ridge: gradient includes regularization term
lam = 1.0
# ∇L = -2X^T(y - Xβ) + 2λβ = 0
# → (X^TX + λI)β = X^Ty
beta_ridge = np.linalg.inv(X.T @ X + lam * np.eye(3)) @ X.T @ y_true

Taylor Approximation

Scalar Taylor Expansion

Any smooth function can be approximated locally by a polynomial:

f(x+Δx)f(x)+f(x)Δx+12f(x)(Δx)2+f(x + \Delta x) \approx f(x) + f'(x)\Delta x + \frac{1}{2}f''(x)(\Delta x)^2 + \cdots
OrderApproximationWhat It Captures
0thf(x)f(x)Value only
1st (linear)f(x)+f(x)Δxf(x) + f'(x)\Delta xSlope — used in gradient descent
2nd (quadratic)f(x)+f(x)Δx+12f(x)(Δx)2f(x) + f'(x)\Delta x + \frac{1}{2}f''(x)(\Delta x)^2Curvature — used in Newton's method, XGBoost

Multivariate Taylor Expansion

For f:RnRf: \mathbb{R}^n \to \mathbb{R}:

f(x+Δx)f(x)+f(x)Δx+12ΔxH(x)Δxf(\mathbf{x} + \Delta\mathbf{x}) \approx f(\mathbf{x}) + \nabla f(\mathbf{x})^\top \Delta\mathbf{x} + \frac{1}{2}\Delta\mathbf{x}^\top \mathbf{H}(\mathbf{x}) \Delta\mathbf{x}
  • 1st order term fΔx\nabla f^\top \Delta\mathbf{x}gradient descent 的理論基礎(local linear approximation)
  • 2nd order term 12ΔxHΔx\frac{1}{2}\Delta\mathbf{x}^\top \mathbf{H} \Delta\mathbf{x}Hessian 提供 curvature 資訊

ML Connections

ApplicationWhich OrderHow
Gradient descent1st orderθθηf\theta \leftarrow \theta - \eta\nabla f — move along negative gradient(linear approx)
Newton's method2nd orderθθH1f\theta \leftarrow \theta - \mathbf{H}^{-1}\nabla f — use curvature for better step
XGBoost2nd orderLoss ≈ gihi+12hi2\sum g_i h_i + \frac{1}{2}\sum h_i^2 — gradient gg + Hessian hh for optimal leaf values
Adam optimizer≈ diagonal 2nd orderSecond moment vtv_t \approx diagonal Hessian → adaptive lr per parameter
Natural gradient2nd order (Fisher)θθF1f\theta \leftarrow \theta - \mathbf{F}^{-1}\nabla f — curvature in parameter space

面試連結:XGBoost Second-Order

XGBoost 對 loss function 做 second-order Taylor expansion:Li[gif(xi)+12hif(xi)2]\mathcal{L} \approx \sum_i [g_i f(\mathbf{x}_i) + \frac{1}{2}h_i f(\mathbf{x}_i)^2],其中 gi=L/y^g_i = \partial L / \partial \hat{y} (gradient), hi=2L/y^2h_i = \partial^2 L / \partial \hat{y}^2 (Hessian)。用 gghh 同時找最佳 split 和 leaf value — 比只用 gradient 的 vanilla gradient boosting 更快 converge。

import numpy as np

# Taylor approximation of f(x) = exp(x) around x=0
x0 = 0
dx = 0.5

f_exact = np.exp(x0 + dx)           # exact: 1.6487
f_0th = np.exp(x0)                   # 0th order: 1.0
f_1st = np.exp(x0) * (1 + dx)       # 1st order: 1.5
f_2nd = np.exp(x0) * (1 + dx + dx**2/2)  # 2nd order: 1.625
# Higher order → closer to exact

# Gradient descent = 1st order Taylor
# Newton's method = 2nd order Taylor
# f(x + Δx) ≈ f(x) + f'(x)Δx + ½f''(x)Δx²
# Minimize: set derivative w.r.t. Δx to 0:
# f'(x) + f''(x)Δx = 0 → Δx = -f'(x)/f''(x) → Newton step

# XGBoost: second-order approximation for each sample
# g_i = gradient of loss at current prediction
# h_i = Hessian (second derivative) of loss
# For squared error: g_i = 2(y_hat - y), h_i = 2
# For log loss: g_i = y_hat - y, h_i = y_hat * (1 - y_hat)

Jacobian

Definition

For a vector function f:RnRm\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m, the Jacobian is the matrix of all partial derivatives:

J=[f1x1f1xnfmx1fmxn]Rm×n\mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} \in \mathbb{R}^{m \times n}

ML Connections

Where Jacobians AppearWhy
Backprop in RNNshtht1=diag(σ(zt))Whh\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} = \text{diag}(\sigma'(\mathbf{z}_t)) \cdot \mathbf{W}_{hh} — product of Jacobians → vanishing/exploding gradients
Change of variablesProbability transforms with Jacobian determinant — normalizing flows, VAE
Neural network layersEach layer's Jacobian describes how output changes with input

面試連結:Vanishing Gradient in RNNs

RNN 的 gradient 涉及 Jacobians 的乘積:i=ktJi\prod_{i=k}^{t} \mathbf{J}_i。每個 Ji\mathbf{J}_i 的 spectral norm(largest singular value)< 1 → 乘積 → 0(vanishing)。> 1 → 乘積 → ∞(exploding)。LSTM 的 additive cell state update 避免了這個 multiplicative chain。

import numpy as np

# Jacobian: matrix of partial derivatives
# f: R^2 → R^2, f(x) = [x1^2 + x2, x1*x2]
# J = [[2*x1, 1], [x2, x1]]
x = np.array([3.0, 2.0])
J = np.array([[2*x[0], 1],
              [x[1],   x[0]]])
# J = [[6, 1], [2, 3]]

# Spectral norm = largest singular value (controls gradient magnitude)
spectral_norm = np.linalg.svd(J, compute_uv=False)[0]  # largest singular value

# RNN vanishing gradient: product of Jacobians
# If spectral_norm < 1 for each J_i, product → 0
W_hh = np.random.randn(4, 4) * 0.5  # small weights
tanh_deriv = np.diag(np.random.uniform(0, 1, 4))  # tanh' ∈ (0, 1)
J_rnn = tanh_deriv @ W_hh
print(f"Spectral norm of J_rnn: {np.linalg.svd(J_rnn, compute_uv=False)[0]:.3f}")

# After 20 time steps:
J_product = np.eye(4)
for _ in range(20):
    J_product = J_rnn @ J_product
print(f"Norm after 20 steps: {np.linalg.norm(J_product):.6f}")  # → ≈ 0 (vanishing!)

# PyTorch autograd computes Jacobians automatically via .backward()

Hessian

Definition

For a scalar function f:RnRf: \mathbb{R}^n \to \mathbb{R}, the Hessian is the matrix of second derivatives:

Hij=2fxixj\mathbf{H}_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}

Hessian 描述 loss surface 的 curvature

Hessian and Optimization

Hessian PropertyWhat It Means
PD (all λ>0\lambda > 0)Local minimum → convex locally
ND (all λ<0\lambda < 0)Local maximum
Mixed signsSaddle point — minimum in some directions, maximum in others
Condition number κ=λmax/λmin\kappa = \lambda_{\max}/\lambda_{\min}大 → ill-conditioned → gradient descent slow(narrow valley)

ML Connections

Where Hessian AppearsHow
Convexity of lossMSE: H=2XX\mathbf{H} = 2\mathbf{X}^\top\mathbf{X} (PSD → convex)。Neural nets: non-convex → saddle points
Newton's methodθθH1f\theta \leftarrow \theta - \mathbf{H}^{-1}\nabla f — uses curvature for better updates
XGBoostUses second-order approximation (gradient + Hessian) for better splits
Natural gradientθθF1f\theta \leftarrow \theta - \mathbf{F}^{-1}\nabla f — Fisher information ≈ expected Hessian
Adam optimizerSecond moment vtv_t ≈ diagonal Hessian → adaptive learning rate per parameter

面試連結:Neural Networks 不是 Convex

Linear regression 的 Hessian = 2XX2\mathbf{X}^\top\mathbf{X}(PSD → convex → unique global minimum → GD 一定 converge)。Neural networks 的 Hessian 有 positive 和 negative eigenvalues → non-convex → many local minima 和 saddle points。但 empirically SGD 找到的 local minima 通常 generalize well(flat minima hypothesis)。

import numpy as np

# Hessian of OLS loss: H = 2 X^T X (constant — doesn't depend on beta)
X = np.random.randn(100, 3)
H_ols = 2 * X.T @ X  # 3×3 matrix

# Check convexity: all eigenvalues ≥ 0?
eigs = np.linalg.eigvalsh(H_ols)
print(f"OLS Hessian eigenvalues: {eigs.round(2)}")  # all > 0 → convex ✓

# Condition number: κ = λ_max / λ_min → large = ill-conditioned
kappa = eigs[-1] / eigs[0]
print(f"Condition number: {kappa:.1f}")
# Large κ → narrow valley → GD oscillates → slow convergence
# Solution: use Adam (adaptive lr per parameter) or preconditioning

# Numerical Hessian for arbitrary function
from scipy.optimize import approx_fprime

def loss(beta):
    return np.sum((X @ beta - np.ones(100))**2)

def numerical_hessian(f, x, eps=1e-5):
    n = len(x)
    H = np.zeros((n, n))
    for i in range(n):
        def grad_i(x):
            return approx_fprime(x, f, eps)[i]
        H[i] = approx_fprime(x, grad_i, eps)
    return H

H_num = numerical_hessian(loss, np.zeros(3))
np.allclose(H_num, H_ols, atol=0.1)  # True ✓

Quadratic Forms

Definition

q(x)=xAx=i,jAijxixjq(\mathbf{x}) = \mathbf{x}^\top \mathbf{A} \mathbf{x} = \sum_{i,j} A_{ij} x_i x_j

ML Connections

Quadratic FormWhere
xΣ1x\mathbf{x}^\top\boldsymbol{\Sigma}^{-1}\mathbf{x}Mahalanobis distance — distance accounting for covariance
Axb2=xAAx2bAx+b2\|\mathbf{Ax} - \mathbf{b}\|^2 = \mathbf{x}^\top\mathbf{A}^\top\mathbf{A}\mathbf{x} - 2\mathbf{b}^\top\mathbf{A}\mathbf{x} + \|\mathbf{b}\|^2MSE loss — quadratic in parameters
ββ=β2\boldsymbol{\beta}^\top\boldsymbol{\beta} = \|\boldsymbol{\beta}\|^2L2 regularization — simplest quadratic form
xHx\mathbf{x}^\top\mathbf{H}\mathbf{x}Second-order Taylor expansion of loss → curvature
import numpy as np

# Quadratic form: x^T A x
A = np.array([[4, 1], [1, 3]])
x = np.array([1, 2])
quadratic = x @ A @ x  # 1*4*1 + 1*1*2 + 2*1*1 + 2*3*2 = 4+2+2+12 = 20

# Mahalanobis distance: x^T Σ^{-1} x (accounts for covariance)
from scipy.spatial.distance import mahalanobis
mean = np.array([0, 0])
cov = np.array([[2, 1], [1, 3]])
x = np.array([1, 2])
d = mahalanobis(x, mean, np.linalg.inv(cov))  # ≠ Euclidean distance

# L2 regularization is simplest quadratic form: β^T β = ||β||^2
beta = np.array([0.5, -0.3, 0.8])
l2_penalty = beta @ beta  # = 0.25 + 0.09 + 0.64 = 0.98

Trace

Properties

tr(A)=iAii=iλi\text{tr}(\mathbf{A}) = \sum_i A_{ii} = \sum_i \lambda_i
PropertyFormulaUse
Cyclictr(ABC)=tr(BCA)=tr(CAB)\text{tr}(\mathbf{ABC}) = \text{tr}(\mathbf{BCA}) = \text{tr}(\mathbf{CAB})Simplify matrix derivatives
Frobenius normAF2=tr(AA)\|\mathbf{A}\|_F^2 = \text{tr}(\mathbf{A}^\top\mathbf{A})Matrix distance, weight decay
Sum of eigenvaluestr(A)=λi\text{tr}(\mathbf{A}) = \sum \lambda_iTotal variance in PCA = tr(C)
Transposetr(A)=tr(A)\text{tr}(\mathbf{A}^\top) = \text{tr}(\mathbf{A})Symmetry in derivatives
import numpy as np

A = np.array([[1, 2], [3, 4]])

# Trace = sum of diagonal elements = sum of eigenvalues
np.trace(A)  # 1 + 4 = 5
np.linalg.eigvalsh(A).sum()  # also 5 (for symmetric; use eig for general)

# Frobenius norm via trace
fro_norm = np.sqrt(np.trace(A.T @ A))  # same as np.linalg.norm(A, 'fro')

# Cyclic property: tr(ABC) = tr(BCA) = tr(CAB)
B = np.random.randn(2, 3)
C = np.random.randn(3, 2)
np.trace(A @ B @ C)  # = np.trace(B @ C @ A) = np.trace(C @ A @ B)

# Total variance in PCA = tr(covariance matrix)
X = np.random.randn(100, 5)
cov = np.cov(X.T)
total_variance = np.trace(cov)  # = sum of all eigenvalues = sum of all PC variances

Hadamard (Element-wise) Product

(AB)ij=AijBij(\mathbf{A} \odot \mathbf{B})_{ij} = A_{ij} \cdot B_{ij}

和 matrix multiplication 不同 — Hadamard product 是 element-wise。

ML Connections

Where \odot AppearsFormula
LSTM gatesct=ftct1+itc~t\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t — gates control element-wise
Dropouth~=hm\tilde{\mathbf{h}} = \mathbf{h} \odot \mathbf{m}, mBernoulli\mathbf{m} \sim \text{Bernoulli} — element-wise mask
Attention maskingMask \odot scores — set future positions to -\infty
Backprop with ReLUδ[l]=(W[l+1]δ[l+1])σ(z[l])\delta^{[l]} = (\mathbf{W}^{[l+1]\top}\delta^{[l+1]}) \odot \sigma'(\mathbf{z}^{[l]})
import numpy as np

# Hadamard (element-wise) product: A ⊙ B
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
hadamard = A * B  # [[5, 12], [21, 32]] — element-wise!
# vs matrix multiply:
matmul = A @ B    # [[19, 22], [43, 50]] — very different!

# LSTM gate: c_t = f_t ⊙ c_{t-1} + i_t ⊙ c_tilde
hidden_size = 4
f_t = np.array([0.9, 0.1, 0.8, 0.5])  # forget gate (sigmoid output)
c_prev = np.array([1.0, 2.0, -0.5, 3.0])  # previous cell state
i_t = np.array([0.2, 0.9, 0.1, 0.7])  # input gate
c_tilde = np.array([0.5, -1.0, 2.0, 0.3])  # candidate

c_new = f_t * c_prev + i_t * c_tilde  # element-wise! (Hadamard)
# f_t=0.9 → keep 90% of c_prev[0]; f_t=0.1 → forget 90% of c_prev[1]

# Dropout mask: element-wise multiply
mask = np.random.binomial(1, 0.5, size=4)  # [1, 0, 1, 0]
h_dropped = c_new * mask / 0.5  # inverted dropout (scale by 1/(1-p))

Interview Signals

What interviewers listen for:

  • 你能解釋 PCA 的 eigendecomposition / SVD 數學
  • 你知道 SVD 在推薦系統和 LoRA 中的應用
  • 你能手推 OLS gradient 並得到 normal equation
  • 你理解 Jacobian 和 vanishing gradient 的關係(RNN)
  • 你知道 Hessian 和 convexity/saddle points 的連結

Practice

Flashcards

Flashcards (1/10)

Eigendecomposition 在 PCA 中扮演什麼角色?

對 covariance matrix C 做 eigendecomposition: C = VΛVᵀ。Eigenvectors V = principal component directions。Eigenvalues Λ = variance in each direction。降維 = 只保留 top-k eigenvectors(最大 eigenvalues)。Explained variance ratio = λᵢ/Σλⱼ。

Click card to flip

Quiz

Question 1/10

PCA 的 principal components 是什麼的 eigenvectors?

Mark as Complete

3/5 — Okay