Vectors, Norms & Distance

Course Reference

This section is based on Introduction to Applied Linear Algebra — Vectors, Matrices, and Least Squares by Stephen Boyd and Lieven Vandenberghe (Stanford University). Covers Chapters 1-4.

Why Linear Algebra for ML?

Linear algebra 是 ML 的語言 — data 是 vector，similarity 是 dot product，regularization 是 norm，clustering 是 distance minimization。這些基礎概念在 ML 的每個角落出現。

Vectors

Vectors in ML

A vector $\mathbf{x} \in \mathbb{R}^n$ is an ordered list of $n$ numbers. 在 ML 中：

What	Vector Representation
A data point	Feature vector $\mathbf{x} = [x_1, x_2, \ldots, x_n]$
Model weights	Weight vector $\mathbf{w} = [w_1, w_2, \ldots, w_n]$
Word embedding	Dense vector $\mathbf{e} \in \mathbb{R}^{300}$ (Word2Vec)
Gradient	$\nabla_\theta \mathcal{L} = [\frac{\partial \mathcal{L}}{\partial \theta_1}, \ldots]$

Dot Product

\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = \|\mathbf{a}\| \|\mathbf{b}\| \cos\theta

ML Application	How Dot Product Is Used
Linear regression	$\hat{y} = \mathbf{w}^\top \mathbf{x} + b$ — prediction = dot product of weights and features
Logistic regression	$\sigma(\mathbf{w}^\top \mathbf{x})$ — dot product → sigmoid → probability
Attention	$\text{score} = \mathbf{q}^\top \mathbf{k}$ — query-key dot product = relevance
Cosine similarity	$\cos\theta = \frac{\mathbf{a} \cdot \mathbf{b}}{\\|\mathbf{a}\\| \\|\mathbf{b}\\|}$ — embedding similarity
SVM	$\mathbf{w}^\top \mathbf{x} + b = 0$ — decision boundary defined by dot product

Cosine Similarity

\cos(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}

Range: $[-1, 1]$ 。 $1$ = same direction， $0$ = orthogonal， $-1$ = opposite。

在 NLP 中，cosine similarity 是 embedding similarity 的 standard metric — 因為它不受 vector magnitude 影響（只看方向）。

import numpy as np

# Vectors
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Dot product
np.dot(a, b)          # 32  (= 1*4 + 2*5 + 3*6)
a @ b                 # 32  (@ operator = dot product for 1D)

# Cosine similarity
cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))  # 0.974

# Linear regression prediction: y_hat = w^T x + b
w = np.array([0.5, -0.3, 0.8])
x = np.array([2.0, 1.5, 3.0])
y_hat = w @ x + 0.1   # dot product + bias

Norms

Norm Definitions

Norm	Formula	Geometry	ML Connection
L1 ( $\\|\mathbf{x}\\|_1$ )	$\sum_i \\|x_i\\|$	Diamond	Lasso — L1 penalty promotes sparsity (feature selection)
L2 ( $\\|\mathbf{x}\\|_2$ )	$\sqrt{\sum_i x_i^2}$	Circle	Ridge — L2 penalty shrinks weights uniformly
L∞ ( $\\|\mathbf{x}\\|_\infty$ )	$\max_i \\|x_i\\|$	Square	Adversarial robustness — max perturbation per dimension
Frobenius ( $\\|\mathbf{A}\\|_F$ )	$\sqrt{\sum_{i,j} a_{ij}^2}$	—	Matrix norm — equivalent to L2 on vectorized matrix

Why L1 Gives Sparsity but L2 Doesn't

Geometric intuition：L1 constraint region 是 diamond（corners on axes）→ contour lines 容易 hit corners → some $w_j = 0$ 。L2 constraint region 是 circle（no corners）→ coefficients shrink but never exactly zero。

這是 Lasso 做 feature selection、Ridge 不做的根本原因。

import numpy as np

x = np.array([3, -4])
np.linalg.norm(x, ord=1)   # L1: |3| + |-4| = 7
np.linalg.norm(x, ord=2)   # L2: sqrt(9 + 16) = 5
np.linalg.norm(x, ord=np.inf)  # L∞: max(3, 4) = 4

面試連結：Regularization

「L1 和 L2 regularization 有什麼不同？」→ 本質上是 weight vector 的 L1 vs L2 norm 作為 penalty。L1 norm 的 diamond 形 constraint → corner solutions → sparsity。L2 norm 的 circle constraint → 均勻 shrink → no exact zeros。

import numpy as np

x = np.array([3, -4])
np.linalg.norm(x, ord=1)      # L1: |3| + |-4| = 7
np.linalg.norm(x, ord=2)      # L2: sqrt(9 + 16) = 5.0
np.linalg.norm(x, ord=np.inf) # L∞: max(|3|, |-4|) = 4

# Regularization in practice
weights = np.array([0.5, -0.3, 0.0, 0.8, 0.0, -0.1])
l1_penalty = np.linalg.norm(weights, ord=1)  # 1.7 (Lasso)
l2_penalty = np.linalg.norm(weights, ord=2)  # 1.0 (Ridge)

# Frobenius norm for a matrix
W = np.random.randn(3, 4)
np.linalg.norm(W, 'fro')  # sqrt(sum of all elements squared)

Key Inequalities

These inequalities appear repeatedly in ML theory and interviews:

Triangle Inequality

For any vectors $\mathbf{a}$ and $\mathbf{b}$ :

\|\mathbf{a} + \mathbf{b}\| \leq \|\mathbf{a}\| + \|\mathbf{b}\|

直覺：走兩段路的總距離 ≥ 直線距離。

ML connections:

Metric space property: Any valid distance function must satisfy triangle inequality. This is why KNN, K-Means, DBSCAN 用的 distance metric 必須滿足它。
Cosine similarity is NOT a metric — 它不滿足 triangle inequality（但 cosine distance = 1 - cosine 也不完全滿足）。

import numpy as np

a = np.array([3, 0])
b = np.array([0, 4])
# ||a + b|| = ||(3,4)|| = 5
# ||a|| + ||b|| = 3 + 4 = 7
# 5 ≤ 7 ✓ (triangle inequality)

# For distance metric d, triangle inequality means:
# d(x, z) ≤ d(x, y) + d(y, z)
# This enables efficient search: if d(x,y) is large and d(y,z) is small,
# then d(x,z) must be at least d(x,y) - d(y,z) → prune search space

Cauchy-Schwarz Inequality

|\mathbf{a} \cdot \mathbf{b}| \leq \|\mathbf{a}\| \cdot \|\mathbf{b}\|

等號成立 if and only if $\mathbf{a}$ and $\mathbf{b}$ are linearly dependent（parallel）。

Consequences:

1. Cosine similarity is bounded: $-1 \leq \cos(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\|\|\mathbf{b}\|} \leq 1$ — 因為分子 $\leq$ 分母（by Cauchy-Schwarz）

2. Correlation coefficient is bounded: $-1 \leq \rho_{XY} \leq 1$ — Pearson correlation 本質上就是 centered data 的 cosine similarity

3. Variance of linear combination: $\text{Var}(aX + bY) = a^2\sigma_X^2 + b^2\sigma_Y^2 + 2ab\text{Cov}(X,Y) \geq 0$ — Cauchy-Schwarz ensures this is always non-negative

import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Cauchy-Schwarz: |a·b| ≤ ||a|| * ||b||
lhs = abs(np.dot(a, b))            # |32| = 32
rhs = np.linalg.norm(a) * np.linalg.norm(b)  # 3.74 * 8.77 = 32.83
print(f"|a·b| = {lhs:.2f} ≤ ||a||·||b|| = {rhs:.2f}")  # 32 ≤ 32.83 ✓

# Equality when vectors are parallel (linearly dependent)
c = 2 * a  # c = [2, 4, 6], parallel to a
abs(np.dot(a, c))  # 28
np.linalg.norm(a) * np.linalg.norm(c)  # 28.0 — equality! ✓

# This is WHY cosine similarity ∈ [-1, 1]:
cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# = 32 / 32.83 = 0.974, always in [-1, 1] by Cauchy-Schwarz

Correlation Coefficient (from Linear Algebra Perspective)

Pearson correlation is cosine similarity of centered vectors:

\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y} = \frac{(\mathbf{x} - \bar{x})^\top(\mathbf{y} - \bar{y})}{\|\mathbf{x} - \bar{x}\| \cdot \|\mathbf{y} - \bar{y}\|} = \cos\theta

$\rho \in [-1, 1]$ by Cauchy-Schwarz。 $|\rho| = 1$ iff $X$ and $Y$ are linearly related。

Property	Meaning	ML Implication
$\rho = 1$	Perfect positive linear relationship	Features are redundant（multicollinearity）
$\rho = 0$	No linear relationship	But may have nonlinear relationship!
$\rho = -1$	Perfect negative linear relationship	Features carry same info in opposite direction

import numpy as np

x = np.array([1, 2, 3, 4, 5], dtype=float)
y = np.array([2, 4, 5, 4, 5], dtype=float)

# Method 1: np.corrcoef
corr_matrix = np.corrcoef(x, y)  # [[1, ρ], [ρ, 1]]
rho = corr_matrix[0, 1]

# Method 2: cosine similarity of centered vectors
x_c = x - x.mean()
y_c = y - y.mean()
rho_manual = np.dot(x_c, y_c) / (np.linalg.norm(x_c) * np.linalg.norm(y_c))
# rho ≈ rho_manual ✓

# Correlation matrix of features = cosine similarity matrix of centered features
X = np.random.randn(100, 5)
corr = np.corrcoef(X.T)  # 5×5 correlation matrix
# High off-diagonal values → multicollinearity → VIF high → unstable OLS

面試經典：Correlation ≠ Independence

$\rho = 0$ 只代表沒有線性關係。 $X \sim \text{Uniform}(-1,1)$ , $Y = X^2$ : $\rho = 0$ 但 $Y$ 完全由 $X$ 決定（nonlinear dependence）。獨立 → $\rho = 0$ ，但 $\rho = 0$ ≠ 獨立。

K-Means: Clustering as Distance Minimization

K-Means 的數學本質是一個 distance minimization problem — 直接連結 vectors、norms、distance 的概念：

\min_{\boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_K} \sum_{k=1}^{K} \sum_{i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|_2^2

Assign step: 每個 point 分給 L2 距離最近的 centroid（Voronoi tessellation）
Update step: centroid = mean of assigned points（因為 mean minimizes squared L2 distance）
如果用 L1 distance → optimal center = median → K-Medoids

import numpy as np

# K-Means from scratch using vector operations
def kmeans(X, K, max_iters=100):
    # Initialize centroids randomly
    centroids = X[np.random.choice(len(X), K, replace=False)]

    for _ in range(max_iters):
        # Assign: compute L2 distance from each point to each centroid
        distances = np.linalg.norm(X[:, None] - centroids[None, :], axis=2)  # [n, K]
        labels = distances.argmin(axis=1)  # nearest centroid

        # Update: new centroid = mean of assigned points
        new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])

        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids

    return labels, centroids

X = np.random.randn(200, 2)
labels, centroids = kmeans(X, K=3)

每一步都用到本頁的概念：vector subtraction（ $\mathbf{x}_i - \boldsymbol{\mu}_k$ ）、L2 norm（ $\|\cdot\|_2$ ）、mean as L2 minimizer。

Interview Signals

What interviewers listen for:

你能連結 dot product 和 ML predictions（linear regression, attention, cosine similarity）
你知道 L1 vs L2 norm 的 geometric 差異和對 regularization 的影響
你能說出 Cauchy-Schwarz 為什麼讓 cosine similarity 在 [-1,1]
你理解 correlation = cosine of centered vectors
你知道 Chebyshev inequality 提供 distribution-free bounds

Practice

Flashcards

Flashcards (1/10)

Dot product 在 ML 中最重要的應用有哪些？

（1）Linear regression: prediction = wᵀx + b。（2）Attention: score = qᵀk。（3）Cosine similarity: embedding similarity。（4）SVM: decision boundary wᵀx + b = 0。幾乎所有 ML 的 'score' 都是某種 dot product。

Click card to flip

Quiz

Question 1/10

Linear regression prediction 用了什麼 linear algebra operation？

Mark as Complete

How confident are you with this topic?

3/5 — Okay