Vectors, Norms & Distance
Course Reference
This section is based on Introduction to Applied Linear Algebra — Vectors, Matrices, and Least Squares by Stephen Boyd and Lieven Vandenberghe (Stanford University). Covers Chapters 1-4.
Why Linear Algebra for ML?
Linear algebra 是 ML 的語言 — data 是 vector,similarity 是 dot product,regularization 是 norm,clustering 是 distance minimization。這些基礎概念在 ML 的每個角落出現。
Vectors
Vectors in ML
A vector is an ordered list of numbers. 在 ML 中:
| What | Vector Representation |
|---|---|
| A data point | Feature vector |
| Model weights | Weight vector |
| Word embedding | Dense vector (Word2Vec) |
| Gradient |
Dot Product
| ML Application | How Dot Product Is Used |
|---|---|
| Linear regression | — prediction = dot product of weights and features |
| Logistic regression | — dot product → sigmoid → probability |
| Attention | — query-key dot product = relevance |
| Cosine similarity | — embedding similarity |
| SVM | — decision boundary defined by dot product |
Cosine Similarity
Range: 。 = same direction, = orthogonal, = opposite。
在 NLP 中,cosine similarity 是 embedding similarity 的 standard metric — 因為它不受 vector magnitude 影響(只看方向)。
import numpy as np
# Vectors
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
# Dot product
np.dot(a, b) # 32 (= 1*4 + 2*5 + 3*6)
a @ b # 32 (@ operator = dot product for 1D)
# Cosine similarity
cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) # 0.974
# Linear regression prediction: y_hat = w^T x + b
w = np.array([0.5, -0.3, 0.8])
x = np.array([2.0, 1.5, 3.0])
y_hat = w @ x + 0.1 # dot product + bias
Norms
Norm Definitions
| Norm | Formula | Geometry | ML Connection |
|---|---|---|---|
| L1 () | Diamond | Lasso — L1 penalty promotes sparsity (feature selection) | |
| L2 () | Circle | Ridge — L2 penalty shrinks weights uniformly | |
| L∞ () | Square | Adversarial robustness — max perturbation per dimension | |
| Frobenius () | — | Matrix norm — equivalent to L2 on vectorized matrix |
Why L1 Gives Sparsity but L2 Doesn't
Geometric intuition:L1 constraint region 是 diamond(corners on axes)→ contour lines 容易 hit corners → some 。L2 constraint region 是 circle(no corners)→ coefficients shrink but never exactly zero。
這是 Lasso 做 feature selection、Ridge 不做的根本原因。
import numpy as np
x = np.array([3, -4])
np.linalg.norm(x, ord=1) # L1: |3| + |-4| = 7
np.linalg.norm(x, ord=2) # L2: sqrt(9 + 16) = 5
np.linalg.norm(x, ord=np.inf) # L∞: max(3, 4) = 4
面試連結:Regularization
「L1 和 L2 regularization 有什麼不同?」→ 本質上是 weight vector 的 L1 vs L2 norm 作為 penalty。L1 norm 的 diamond 形 constraint → corner solutions → sparsity。L2 norm 的 circle constraint → 均勻 shrink → no exact zeros。
import numpy as np
x = np.array([3, -4])
np.linalg.norm(x, ord=1) # L1: |3| + |-4| = 7
np.linalg.norm(x, ord=2) # L2: sqrt(9 + 16) = 5.0
np.linalg.norm(x, ord=np.inf) # L∞: max(|3|, |-4|) = 4
# Regularization in practice
weights = np.array([0.5, -0.3, 0.0, 0.8, 0.0, -0.1])
l1_penalty = np.linalg.norm(weights, ord=1) # 1.7 (Lasso)
l2_penalty = np.linalg.norm(weights, ord=2) # 1.0 (Ridge)
# Frobenius norm for a matrix
W = np.random.randn(3, 4)
np.linalg.norm(W, 'fro') # sqrt(sum of all elements squared)
Key Inequalities
These inequalities appear repeatedly in ML theory and interviews:
Triangle Inequality
For any vectors and :
直覺:走兩段路的總距離 ≥ 直線距離。
ML connections:
- Metric space property: Any valid distance function must satisfy triangle inequality. This is why KNN, K-Means, DBSCAN 用的 distance metric 必須滿足它。
- Cosine similarity is NOT a metric — 它不滿足 triangle inequality(但 cosine distance = 1 - cosine 也不完全滿足)。
import numpy as np
a = np.array([3, 0])
b = np.array([0, 4])
# ||a + b|| = ||(3,4)|| = 5
# ||a|| + ||b|| = 3 + 4 = 7
# 5 ≤ 7 ✓ (triangle inequality)
# For distance metric d, triangle inequality means:
# d(x, z) ≤ d(x, y) + d(y, z)
# This enables efficient search: if d(x,y) is large and d(y,z) is small,
# then d(x,z) must be at least d(x,y) - d(y,z) → prune search space
Cauchy-Schwarz Inequality
等號成立 if and only if and are linearly dependent(parallel)。
Consequences:
1. Cosine similarity is bounded: — 因為分子 分母(by Cauchy-Schwarz)
2. Correlation coefficient is bounded: — Pearson correlation 本質上就是 centered data 的 cosine similarity
3. Variance of linear combination: — Cauchy-Schwarz ensures this is always non-negative
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
# Cauchy-Schwarz: |a·b| ≤ ||a|| * ||b||
lhs = abs(np.dot(a, b)) # |32| = 32
rhs = np.linalg.norm(a) * np.linalg.norm(b) # 3.74 * 8.77 = 32.83
print(f"|a·b| = {lhs:.2f} ≤ ||a||·||b|| = {rhs:.2f}") # 32 ≤ 32.83 ✓
# Equality when vectors are parallel (linearly dependent)
c = 2 * a # c = [2, 4, 6], parallel to a
abs(np.dot(a, c)) # 28
np.linalg.norm(a) * np.linalg.norm(c) # 28.0 — equality! ✓
# This is WHY cosine similarity ∈ [-1, 1]:
cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# = 32 / 32.83 = 0.974, always in [-1, 1] by Cauchy-Schwarz
Correlation Coefficient (from Linear Algebra Perspective)
Pearson correlation is cosine similarity of centered vectors:
by Cauchy-Schwarz。 iff and are linearly related。
| Property | Meaning | ML Implication |
|---|---|---|
| Perfect positive linear relationship | Features are redundant(multicollinearity) | |
| No linear relationship | But may have nonlinear relationship! | |
| Perfect negative linear relationship | Features carry same info in opposite direction |
import numpy as np
x = np.array([1, 2, 3, 4, 5], dtype=float)
y = np.array([2, 4, 5, 4, 5], dtype=float)
# Method 1: np.corrcoef
corr_matrix = np.corrcoef(x, y) # [[1, ρ], [ρ, 1]]
rho = corr_matrix[0, 1]
# Method 2: cosine similarity of centered vectors
x_c = x - x.mean()
y_c = y - y.mean()
rho_manual = np.dot(x_c, y_c) / (np.linalg.norm(x_c) * np.linalg.norm(y_c))
# rho ≈ rho_manual ✓
# Correlation matrix of features = cosine similarity matrix of centered features
X = np.random.randn(100, 5)
corr = np.corrcoef(X.T) # 5×5 correlation matrix
# High off-diagonal values → multicollinearity → VIF high → unstable OLS
面試經典:Correlation ≠ Independence
只代表沒有線性關係。, : 但 完全由 決定(nonlinear dependence)。獨立 → ,但 ≠ 獨立。
K-Means: Clustering as Distance Minimization
K-Means 的數學本質是一個 distance minimization problem — 直接連結 vectors、norms、distance 的概念:
- Assign step: 每個 point 分給 L2 距離最近的 centroid(Voronoi tessellation)
- Update step: centroid = mean of assigned points(因為 mean minimizes squared L2 distance)
- 如果用 L1 distance → optimal center = median → K-Medoids
import numpy as np
# K-Means from scratch using vector operations
def kmeans(X, K, max_iters=100):
# Initialize centroids randomly
centroids = X[np.random.choice(len(X), K, replace=False)]
for _ in range(max_iters):
# Assign: compute L2 distance from each point to each centroid
distances = np.linalg.norm(X[:, None] - centroids[None, :], axis=2) # [n, K]
labels = distances.argmin(axis=1) # nearest centroid
# Update: new centroid = mean of assigned points
new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])
if np.allclose(centroids, new_centroids):
break
centroids = new_centroids
return labels, centroids
X = np.random.randn(200, 2)
labels, centroids = kmeans(X, K=3)
每一步都用到本頁的概念:vector subtraction()、L2 norm()、mean as L2 minimizer。
Interview Signals
What interviewers listen for:
- 你能連結 dot product 和 ML predictions(linear regression, attention, cosine similarity)
- 你知道 L1 vs L2 norm 的 geometric 差異和對 regularization 的影響
- 你能說出 Cauchy-Schwarz 為什麼讓 cosine similarity 在 [-1,1]
- 你理解 correlation = cosine of centered vectors
- 你知道 Chebyshev inequality 提供 distribution-free bounds
Practice
Flashcards
Flashcards (1/10)
Dot product 在 ML 中最重要的應用有哪些?
(1)Linear regression: prediction = wᵀx + b。(2)Attention: score = qᵀk。(3)Cosine similarity: embedding similarity。(4)SVM: decision boundary wᵀx + b = 0。幾乎所有 ML 的 'score' 都是某種 dot product。
Quiz
Linear regression prediction 用了什麼 linear algebra operation?