Decompositions & Matrix Calculus

Course Reference

Extends Introduction to Applied Linear Algebra by Boyd & Vandenberghe (Stanford) with decompositions and matrix calculus essential for ML/DL.

Why This Matters

Eigendecomposition 是 PCA 的數學基礎，SVD 是推薦系統和 LoRA 的核心，matrix calculus 是 backpropagation 的語言。理解這些讓你能從原理理解 ML 算法，而不只是會 call API。

Eigendecomposition

Definition

For a square matrix $\mathbf{A}$ , an eigenvector $\mathbf{v}$ and eigenvalue $\lambda$ satisfy:

\mathbf{A}\mathbf{v} = \lambda\mathbf{v}

直覺： $\mathbf{A}$ 作用在 $\mathbf{v}$ 上只是 scale（不改方向）。 $\lambda$ = scaling factor。

Eigendecomposition

For a symmetric matrix $\mathbf{A}$ :

\mathbf{A} = \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^\top

$\mathbf{V}$ : columns = eigenvectors（orthonormal if symmetric）
$\boldsymbol{\Lambda}$ : diagonal matrix of eigenvalues

ML Connections

Application	How Eigendecomposition Is Used
PCA	Eigenvectors of covariance matrix = principal components; eigenvalues = explained variance
PageRank	Dominant eigenvector of transition matrix = page importance scores
Spectral clustering	Eigenvectors of graph Laplacian = cluster indicators
Markov chains (RL)	Eigenvector with $\lambda=1$ = stationary distribution
Convexity check	Eigenvalues of Hessian: all > 0 → convex, mixed → saddle points

import numpy as np

A = np.array([[4, 2], [1, 3]])
eigenvalues, eigenvectors = np.linalg.eig(A)

# For symmetric matrices (covariance), use eigh (guaranteed real, sorted)
cov = np.cov(X.T)
eigenvalues, eigenvectors = np.linalg.eigh(cov)
# eigenvalues sorted ascending; last = largest = PC1

import numpy as np

# Eigendecomposition
A = np.array([[4, 2], [2, 3]])
eigenvalues, eigenvectors = np.linalg.eigh(A)  # eigh for symmetric (real eigenvalues, sorted)
# eigenvalues: [1.59, 5.41]
# eigenvectors: columns are the eigenvectors

# Verify: A @ v = λ * v
v = eigenvectors[:, 1]  # eigenvector for largest eigenvalue
lam = eigenvalues[1]
np.allclose(A @ v, lam * v)  # True ✓

# Reconstruct: A = V Λ V^T
A_reconstructed = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
np.allclose(A, A_reconstructed)  # True ✓

# PCA: eigendecomposition of covariance matrix
X = np.random.randn(100, 5)
X_centered = X - X.mean(axis=0)
cov = (X_centered.T @ X_centered) / (len(X) - 1)  # = np.cov(X.T)
eig_vals, eig_vecs = np.linalg.eigh(cov)
# eig_vals[-1] = largest = PC1's variance
# eig_vecs[:, -1] = PC1 direction
explained_ratio = eig_vals / eig_vals.sum()  # explained variance per PC

PCA via Eigendecomposition

\mathbf{C} = \frac{1}{n-1}\mathbf{X}^\top\mathbf{X} = \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^\top

PC1 = eigenvector with largest eigenvalue = direction of maximum variance
PC2 = eigenvector with second largest eigenvalue, orthogonal to PC1
Explained variance ratio = $\lambda_i / \sum_j \lambda_j$

面試連結：PCA

「PCA 的數學原理是什麼？」→ 對 centered data 的 covariance matrix 做 eigendecomposition。Eigenvectors = principal directions，eigenvalues = variance in those directions。降維 = 只保留 top-k eigenvectors。

Singular Value Decomposition (SVD)

Definition

Any matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ (not just square) can be decomposed as:

\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top

Component	Size	Properties
$\mathbf{U}$	$m \times m$	Left singular vectors (orthonormal)
$\boldsymbol{\Sigma}$	$m \times n$	Diagonal: singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$
$\mathbf{V}$	$n \times n$	Right singular vectors (orthonormal) = PCA eigenvectors

Relationship to Eigendecomposition

\mathbf{A}^\top\mathbf{A} = \mathbf{V}\boldsymbol{\Sigma}^2\mathbf{V}^\top \qquad \implies \qquad \lambda_i = \sigma_i^2 / (n-1)

PCA via SVD: $\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$ → principal components = columns of $\mathbf{V}$ 。SVD 比 eigendecomposition 更 numerically stable — sklearn 的 PCA 底層就是用 SVD。

Truncated SVD: Low-Rank Approximation

Keep only top $k$ singular values:

\mathbf{A} \approx \mathbf{U}_k\boldsymbol{\Sigma}_k\mathbf{V}_k^\top

Eckart-Young theorem: This is the best rank- $k$ approximation of $\mathbf{A}$ in Frobenius norm.

ML Connections

Application	How SVD Is Used
PCA	$\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$ → columns of $\mathbf{V}$ = PCs
Recommender systems	$\mathbf{R} \approx \mathbf{U}_k\boldsymbol{\Sigma}_k\mathbf{V}_k^\top$ — user/item latent factors
LSA (NLP)	TruncatedSVD on TF-IDF → topic extraction
Image compression	Keep top $k$ singular values → reconstruct with fewer parameters
LoRA (LLM fine-tuning)	$\Delta\mathbf{W} = \mathbf{B}\mathbf{A}$ — low-rank update to frozen weights
Pseudo-inverse	$\mathbf{A}^+ = \mathbf{V}\boldsymbol{\Sigma}^+\mathbf{U}^\top$

import numpy as np

# Full SVD
U, sigma, Vt = np.linalg.svd(X, full_matrices=False)
# Reconstruct: X ≈ U @ np.diag(sigma) @ Vt

# Truncated SVD (for sparse data — PCA alternative)
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=50)
X_reduced = svd.fit_transform(X_tfidf)  # keeps sparsity

import numpy as np

# Full SVD
X = np.random.randn(100, 10)
U, sigma, Vt = np.linalg.svd(X, full_matrices=False)
# U: (100, 10), sigma: (10,), Vt: (10, 10)

# Reconstruct
X_reconstructed = U @ np.diag(sigma) @ Vt
np.allclose(X, X_reconstructed)  # True ✓

# Truncated SVD: rank-k approximation
k = 3
X_approx = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :]
error = np.linalg.norm(X - X_approx, 'fro')
total = np.linalg.norm(X, 'fro')
print(f"Rank-{k} captures {1 - error**2/total**2:.1%} of variance")

# Relationship: eigenvalues of X^T X = sigma^2
XtX_eigs = np.linalg.eigvalsh(X.T @ X)
sigma_squared = sigma**2
# XtX_eigs (sorted ascending) ≈ sigma_squared (sorted descending, reversed)

# TruncatedSVD for sparse data (NLP)
from sklearn.decomposition import TruncatedSVD
# svd = TruncatedSVD(n_components=50).fit_transform(X_tfidf_sparse)

# LoRA: low-rank weight update
d, r = 768, 16  # model dim, LoRA rank
B = np.random.randn(d, r) * 0.01  # d × r
A = np.random.randn(r, d) * 0.01  # r × d
delta_W = B @ A  # rank-r update: (d × r) @ (r × d) = (d × d)
# Full W has d² = 589K params, LoRA has 2dr = 24K params (24x fewer)

面試連結：LoRA

LoRA (Low-Rank Adaptation) 用 low-rank matrices fine-tune LLMs： $\mathbf{W}' = \mathbf{W} + \mathbf{BA}$ ，其中 $\mathbf{B} \in \mathbb{R}^{d \times r}$ , $\mathbf{A} \in \mathbb{R}^{r \times d}$ , $r \ll d$ 。只 train $\mathbf{B}$ 和 $\mathbf{A}$ （far fewer params than full $\mathbf{W}$ ）。核心 insight = weight updates are often low-rank → don't need full-rank update matrix。

Matrix Calculus

Why Matrix Calculus for ML

Backpropagation = chain rule applied to matrix operations. 理解 matrix derivatives 讓你能手推 gradient（面試常考）和 debug gradient issues。

Scalar-by-Vector Gradient

For a scalar function $f: \mathbb{R}^n \to \mathbb{R}$ , the gradient is:

\nabla_\mathbf{x} f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}

Gradient 指向 $f$ 增加最快的方向。Gradient descent: $\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla f$ → move in opposite direction。

Common Gradients

Function	Gradient	ML Usage
$f = \mathbf{a}^\top\mathbf{x}$	$\nabla_\mathbf{x} f = \mathbf{a}$	Linear model gradient
$f = \mathbf{x}^\top\mathbf{A}\mathbf{x}$	$\nabla_\mathbf{x} f = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x}$ (= $2\mathbf{A}\mathbf{x}$ if symmetric)	Quadratic form (regularization)
$f = \\|\mathbf{Ax} - \mathbf{b}\\|^2$	$\nabla_\mathbf{x} f = 2\mathbf{A}^\top(\mathbf{Ax} - \mathbf{b})$	OLS gradient → set to 0 → normal equation
$f = \\|\mathbf{x}\\|^2 = \mathbf{x}^\top\mathbf{x}$	$\nabla_\mathbf{x} f = 2\mathbf{x}$	L2 regularization gradient

OLS Derivation via Matrix Calculus

\mathcal{L}(\boldsymbol{\beta}) = \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})

\nabla_{\boldsymbol{\beta}} \mathcal{L} = -2\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = 0

\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^\top\mathbf{y} \implies \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}

面試中能手推 OLS closed-form 是 strong signal。

import numpy as np

# Gradient of f = a^T x → ∇f = a
a = np.array([2, -1, 3])
x = np.array([1, 2, 3])
# f = a @ x = 2 - 2 + 9 = 9
# gradient w.r.t. x = a = [2, -1, 3]

# Gradient of f = ||x||^2 = x^T x → ∇f = 2x
x = np.array([1, 2, 3])
grad = 2 * x  # [2, 4, 6]

# OLS: verify gradient = 0 at optimal beta
X = np.random.randn(100, 3)
y_true = X @ np.array([2, -1, 0.5]) + np.random.randn(100) * 0.1
beta = np.linalg.inv(X.T @ X) @ X.T @ y_true  # OLS closed form

# Gradient at optimal beta should be ≈ 0
gradient = -2 * X.T @ (y_true - X @ beta)
print(f"Gradient at optimal beta: {gradient}")  # ≈ [0, 0, 0]

# Ridge: gradient includes regularization term
lam = 1.0
# ∇L = -2X^T(y - Xβ) + 2λβ = 0
# → (X^TX + λI)β = X^Ty
beta_ridge = np.linalg.inv(X.T @ X + lam * np.eye(3)) @ X.T @ y_true

Taylor Approximation

Scalar Taylor Expansion

Any smooth function can be approximated locally by a polynomial:

f(x + \Delta x) \approx f(x) + f'(x)\Delta x + \frac{1}{2}f''(x)(\Delta x)^2 + \cdots

Order	Approximation	What It Captures
0th	$f(x)$	Value only
1st (linear)	$f(x) + f'(x)\Delta x$	Slope — used in gradient descent
2nd (quadratic)	$f(x) + f'(x)\Delta x + \frac{1}{2}f''(x)(\Delta x)^2$	Curvature — used in Newton's method, XGBoost

Multivariate Taylor Expansion

For $f: \mathbb{R}^n \to \mathbb{R}$ :

f(\mathbf{x} + \Delta\mathbf{x}) \approx f(\mathbf{x}) + \nabla f(\mathbf{x})^\top \Delta\mathbf{x} + \frac{1}{2}\Delta\mathbf{x}^\top \mathbf{H}(\mathbf{x}) \Delta\mathbf{x}

1st order term $\nabla f^\top \Delta\mathbf{x}$ → gradient descent 的理論基礎（local linear approximation）
2nd order term $\frac{1}{2}\Delta\mathbf{x}^\top \mathbf{H} \Delta\mathbf{x}$ → Hessian 提供 curvature 資訊

ML Connections

Application	Which Order	How
Gradient descent	1st order	$\theta \leftarrow \theta - \eta\nabla f$ — move along negative gradient（linear approx）
Newton's method	2nd order	$\theta \leftarrow \theta - \mathbf{H}^{-1}\nabla f$ — use curvature for better step
XGBoost	2nd order	Loss ≈ $\sum g_i h_i + \frac{1}{2}\sum h_i^2$ — gradient $g$ + Hessian $h$ for optimal leaf values
Adam optimizer	≈ diagonal 2nd order	Second moment $v_t \approx$ diagonal Hessian → adaptive lr per parameter
Natural gradient	2nd order (Fisher)	$\theta \leftarrow \theta - \mathbf{F}^{-1}\nabla f$ — curvature in parameter space

面試連結：XGBoost Second-Order

XGBoost 對 loss function 做 second-order Taylor expansion： $\mathcal{L} \approx \sum_i [g_i f(\mathbf{x}_i) + \frac{1}{2}h_i f(\mathbf{x}_i)^2]$ ，其中 $g_i = \partial L / \partial \hat{y}$ (gradient), $h_i = \partial^2 L / \partial \hat{y}^2$ (Hessian)。用 $g$ 和 $h$ 同時找最佳 split 和 leaf value — 比只用 gradient 的 vanilla gradient boosting 更快 converge。

import numpy as np

# Taylor approximation of f(x) = exp(x) around x=0
x0 = 0
dx = 0.5

f_exact = np.exp(x0 + dx)           # exact: 1.6487
f_0th = np.exp(x0)                   # 0th order: 1.0
f_1st = np.exp(x0) * (1 + dx)       # 1st order: 1.5
f_2nd = np.exp(x0) * (1 + dx + dx**2/2)  # 2nd order: 1.625
# Higher order → closer to exact

# Gradient descent = 1st order Taylor
# Newton's method = 2nd order Taylor
# f(x + Δx) ≈ f(x) + f'(x)Δx + ½f''(x)Δx²
# Minimize: set derivative w.r.t. Δx to 0:
# f'(x) + f''(x)Δx = 0 → Δx = -f'(x)/f''(x) → Newton step

# XGBoost: second-order approximation for each sample
# g_i = gradient of loss at current prediction
# h_i = Hessian (second derivative) of loss
# For squared error: g_i = 2(y_hat - y), h_i = 2
# For log loss: g_i = y_hat - y, h_i = y_hat * (1 - y_hat)

Jacobian

Definition

For a vector function $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$ , the Jacobian is the matrix of all partial derivatives:

\mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} \in \mathbb{R}^{m \times n}

ML Connections

Where Jacobians Appear	Why
Backprop in RNNs	$\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} = \text{diag}(\sigma'(\mathbf{z}_t)) \cdot \mathbf{W}_{hh}$ — product of Jacobians → vanishing/exploding gradients
Change of variables	Probability transforms with Jacobian determinant — normalizing flows, VAE
Neural network layers	Each layer's Jacobian describes how output changes with input

面試連結：Vanishing Gradient in RNNs

RNN 的 gradient 涉及 Jacobians 的乘積： $\prod_{i=k}^{t} \mathbf{J}_i$ 。每個 $\mathbf{J}_i$ 的 spectral norm（largest singular value）< 1 → 乘積 → 0（vanishing）。> 1 → 乘積 → ∞（exploding）。LSTM 的 additive cell state update 避免了這個 multiplicative chain。

import numpy as np

# Jacobian: matrix of partial derivatives
# f: R^2 → R^2, f(x) = [x1^2 + x2, x1*x2]
# J = [[2*x1, 1], [x2, x1]]
x = np.array([3.0, 2.0])
J = np.array([[2*x[0], 1],
              [x[1],   x[0]]])
# J = [[6, 1], [2, 3]]

# Spectral norm = largest singular value (controls gradient magnitude)
spectral_norm = np.linalg.svd(J, compute_uv=False)[0]  # largest singular value

# RNN vanishing gradient: product of Jacobians
# If spectral_norm < 1 for each J_i, product → 0
W_hh = np.random.randn(4, 4) * 0.5  # small weights
tanh_deriv = np.diag(np.random.uniform(0, 1, 4))  # tanh' ∈ (0, 1)
J_rnn = tanh_deriv @ W_hh
print(f"Spectral norm of J_rnn: {np.linalg.svd(J_rnn, compute_uv=False)[0]:.3f}")

# After 20 time steps:
J_product = np.eye(4)
for _ in range(20):
    J_product = J_rnn @ J_product
print(f"Norm after 20 steps: {np.linalg.norm(J_product):.6f}")  # → ≈ 0 (vanishing!)

# PyTorch autograd computes Jacobians automatically via .backward()

Hessian

Definition

For a scalar function $f: \mathbb{R}^n \to \mathbb{R}$ , the Hessian is the matrix of second derivatives:

\mathbf{H}_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}

Hessian 描述 loss surface 的 curvature。

Hessian and Optimization

Hessian Property	What It Means
PD (all $\lambda > 0$ )	Local minimum → convex locally
ND (all $\lambda < 0$ )	Local maximum
Mixed signs	Saddle point — minimum in some directions, maximum in others
Condition number $\kappa = \lambda_{\max}/\lambda_{\min}$	大 → ill-conditioned → gradient descent slow（narrow valley）

ML Connections

Where Hessian Appears	How
Convexity of loss	MSE: $\mathbf{H} = 2\mathbf{X}^\top\mathbf{X}$ (PSD → convex)。Neural nets: non-convex → saddle points
Newton's method	$\theta \leftarrow \theta - \mathbf{H}^{-1}\nabla f$ — uses curvature for better updates
XGBoost	Uses second-order approximation (gradient + Hessian) for better splits
Natural gradient	$\theta \leftarrow \theta - \mathbf{F}^{-1}\nabla f$ — Fisher information ≈ expected Hessian
Adam optimizer	Second moment $v_t$ ≈ diagonal Hessian → adaptive learning rate per parameter

面試連結：Neural Networks 不是 Convex

Linear regression 的 Hessian = $2\mathbf{X}^\top\mathbf{X}$ （PSD → convex → unique global minimum → GD 一定 converge）。Neural networks 的 Hessian 有 positive 和 negative eigenvalues → non-convex → many local minima 和 saddle points。但 empirically SGD 找到的 local minima 通常 generalize well（flat minima hypothesis）。

import numpy as np

# Hessian of OLS loss: H = 2 X^T X (constant — doesn't depend on beta)
X = np.random.randn(100, 3)
H_ols = 2 * X.T @ X  # 3×3 matrix

# Check convexity: all eigenvalues ≥ 0?
eigs = np.linalg.eigvalsh(H_ols)
print(f"OLS Hessian eigenvalues: {eigs.round(2)}")  # all > 0 → convex ✓

# Condition number: κ = λ_max / λ_min → large = ill-conditioned
kappa = eigs[-1] / eigs[0]
print(f"Condition number: {kappa:.1f}")
# Large κ → narrow valley → GD oscillates → slow convergence
# Solution: use Adam (adaptive lr per parameter) or preconditioning

# Numerical Hessian for arbitrary function
from scipy.optimize import approx_fprime

def loss(beta):
    return np.sum((X @ beta - np.ones(100))**2)

def numerical_hessian(f, x, eps=1e-5):
    n = len(x)
    H = np.zeros((n, n))
    for i in range(n):
        def grad_i(x):
            return approx_fprime(x, f, eps)[i]
        H[i] = approx_fprime(x, grad_i, eps)
    return H

H_num = numerical_hessian(loss, np.zeros(3))
np.allclose(H_num, H_ols, atol=0.1)  # True ✓

Quadratic Forms

Definition

q(\mathbf{x}) = \mathbf{x}^\top \mathbf{A} \mathbf{x} = \sum_{i,j} A_{ij} x_i x_j

ML Connections

Quadratic Form	Where
$\mathbf{x}^\top\boldsymbol{\Sigma}^{-1}\mathbf{x}$	Mahalanobis distance — distance accounting for covariance
$\\|\mathbf{Ax} - \mathbf{b}\\|^2 = \mathbf{x}^\top\mathbf{A}^\top\mathbf{A}\mathbf{x} - 2\mathbf{b}^\top\mathbf{A}\mathbf{x} + \\|\mathbf{b}\\|^2$	MSE loss — quadratic in parameters
$\boldsymbol{\beta}^\top\boldsymbol{\beta} = \\|\boldsymbol{\beta}\\|^2$	L2 regularization — simplest quadratic form
$\mathbf{x}^\top\mathbf{H}\mathbf{x}$	Second-order Taylor expansion of loss → curvature

import numpy as np

# Quadratic form: x^T A x
A = np.array([[4, 1], [1, 3]])
x = np.array([1, 2])
quadratic = x @ A @ x  # 1*4*1 + 1*1*2 + 2*1*1 + 2*3*2 = 4+2+2+12 = 20

# Mahalanobis distance: x^T Σ^{-1} x (accounts for covariance)
from scipy.spatial.distance import mahalanobis
mean = np.array([0, 0])
cov = np.array([[2, 1], [1, 3]])
x = np.array([1, 2])
d = mahalanobis(x, mean, np.linalg.inv(cov))  # ≠ Euclidean distance

# L2 regularization is simplest quadratic form: β^T β = ||β||^2
beta = np.array([0.5, -0.3, 0.8])
l2_penalty = beta @ beta  # = 0.25 + 0.09 + 0.64 = 0.98

Trace

Properties

\text{tr}(\mathbf{A}) = \sum_i A_{ii} = \sum_i \lambda_i

Property	Formula	Use
Cyclic	$\text{tr}(\mathbf{ABC}) = \text{tr}(\mathbf{BCA}) = \text{tr}(\mathbf{CAB})$	Simplify matrix derivatives
Frobenius norm	$\\|\mathbf{A}\\|_F^2 = \text{tr}(\mathbf{A}^\top\mathbf{A})$	Matrix distance, weight decay
Sum of eigenvalues	$\text{tr}(\mathbf{A}) = \sum \lambda_i$	Total variance in PCA = tr(C)
Transpose	$\text{tr}(\mathbf{A}^\top) = \text{tr}(\mathbf{A})$	Symmetry in derivatives

import numpy as np

A = np.array([[1, 2], [3, 4]])

# Trace = sum of diagonal elements = sum of eigenvalues
np.trace(A)  # 1 + 4 = 5
np.linalg.eigvalsh(A).sum()  # also 5 (for symmetric; use eig for general)

# Frobenius norm via trace
fro_norm = np.sqrt(np.trace(A.T @ A))  # same as np.linalg.norm(A, 'fro')

# Cyclic property: tr(ABC) = tr(BCA) = tr(CAB)
B = np.random.randn(2, 3)
C = np.random.randn(3, 2)
np.trace(A @ B @ C)  # = np.trace(B @ C @ A) = np.trace(C @ A @ B)

# Total variance in PCA = tr(covariance matrix)
X = np.random.randn(100, 5)
cov = np.cov(X.T)
total_variance = np.trace(cov)  # = sum of all eigenvalues = sum of all PC variances

Hadamard (Element-wise) Product

(\mathbf{A} \odot \mathbf{B})_{ij} = A_{ij} \cdot B_{ij}

和 matrix multiplication 不同 — Hadamard product 是 element-wise。

ML Connections

Where $\odot$ Appears	Formula
LSTM gates	$\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$ — gates control element-wise
Dropout	$\tilde{\mathbf{h}} = \mathbf{h} \odot \mathbf{m}$ , $\mathbf{m} \sim \text{Bernoulli}$ — element-wise mask
Attention masking	Mask $\odot$ scores — set future positions to $-\infty$
Backprop with ReLU	$\delta^{[l]} = (\mathbf{W}^{[l+1]\top}\delta^{[l+1]}) \odot \sigma'(\mathbf{z}^{[l]})$

import numpy as np

# Hadamard (element-wise) product: A ⊙ B
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
hadamard = A * B  # [[5, 12], [21, 32]] — element-wise!
# vs matrix multiply:
matmul = A @ B    # [[19, 22], [43, 50]] — very different!

# LSTM gate: c_t = f_t ⊙ c_{t-1} + i_t ⊙ c_tilde
hidden_size = 4
f_t = np.array([0.9, 0.1, 0.8, 0.5])  # forget gate (sigmoid output)
c_prev = np.array([1.0, 2.0, -0.5, 3.0])  # previous cell state
i_t = np.array([0.2, 0.9, 0.1, 0.7])  # input gate
c_tilde = np.array([0.5, -1.0, 2.0, 0.3])  # candidate

c_new = f_t * c_prev + i_t * c_tilde  # element-wise! (Hadamard)
# f_t=0.9 → keep 90% of c_prev[0]; f_t=0.1 → forget 90% of c_prev[1]

# Dropout mask: element-wise multiply
mask = np.random.binomial(1, 0.5, size=4)  # [1, 0, 1, 0]
h_dropped = c_new * mask / 0.5  # inverted dropout (scale by 1/(1-p))

Interview Signals

What interviewers listen for:

你能解釋 PCA 的 eigendecomposition / SVD 數學
你知道 SVD 在推薦系統和 LoRA 中的應用
你能手推 OLS gradient 並得到 normal equation
你理解 Jacobian 和 vanishing gradient 的關係（RNN）
你知道 Hessian 和 convexity/saddle points 的連結

Practice

Flashcards

Flashcards (1/10)

Eigendecomposition 在 PCA 中扮演什麼角色？

對 covariance matrix C 做 eigendecomposition: C = VΛVᵀ。Eigenvectors V = principal component directions。Eigenvalues Λ = variance in each direction。降維 = 只保留 top-k eigenvectors（最大 eigenvalues）。Explained variance ratio = λᵢ/Σλⱼ。

Click card to flip

Quiz

Question 1/10

PCA 的 principal components 是什麼的 eigenvectors？

Mark as Complete

How confident are you with this topic?

3/5 — Okay