Vectors, Norms & Distance

Course Reference

This section is based on Introduction to Applied Linear Algebra — Vectors, Matrices, and Least Squares by Stephen Boyd and Lieven Vandenberghe (Stanford University). Covers Chapters 1-4.

Why Linear Algebra for ML?

Linear algebra 是 ML 的語言 — data 是 vector,similarity 是 dot product,regularization 是 norm,clustering 是 distance minimization。這些基礎概念在 ML 的每個角落出現。

Vectors

Vectors in ML

A vector xRn\mathbf{x} \in \mathbb{R}^n is an ordered list of nn numbers. 在 ML 中:

WhatVector Representation
A data pointFeature vector x=[x1,x2,,xn]\mathbf{x} = [x_1, x_2, \ldots, x_n]
Model weightsWeight vector w=[w1,w2,,wn]\mathbf{w} = [w_1, w_2, \ldots, w_n]
Word embeddingDense vector eR300\mathbf{e} \in \mathbb{R}^{300} (Word2Vec)
GradientθL=[Lθ1,]\nabla_\theta \mathcal{L} = [\frac{\partial \mathcal{L}}{\partial \theta_1}, \ldots]

Dot Product

ab=i=1naibi=abcosθ\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = \|\mathbf{a}\| \|\mathbf{b}\| \cos\theta
ML ApplicationHow Dot Product Is Used
Linear regressiony^=wx+b\hat{y} = \mathbf{w}^\top \mathbf{x} + b — prediction = dot product of weights and features
Logistic regressionσ(wx)\sigma(\mathbf{w}^\top \mathbf{x}) — dot product → sigmoid → probability
Attentionscore=qk\text{score} = \mathbf{q}^\top \mathbf{k} — query-key dot product = relevance
Cosine similaritycosθ=abab\cos\theta = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|} — embedding similarity
SVMwx+b=0\mathbf{w}^\top \mathbf{x} + b = 0 — decision boundary defined by dot product

Cosine Similarity

cos(a,b)=abab\cos(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}

Range: [1,1][-1, 1]11 = same direction,00 = orthogonal,1-1 = opposite。

在 NLP 中,cosine similarity 是 embedding similarity 的 standard metric — 因為它不受 vector magnitude 影響(只看方向)。

import numpy as np

# Vectors
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Dot product
np.dot(a, b)          # 32  (= 1*4 + 2*5 + 3*6)
a @ b                 # 32  (@ operator = dot product for 1D)

# Cosine similarity
cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))  # 0.974

# Linear regression prediction: y_hat = w^T x + b
w = np.array([0.5, -0.3, 0.8])
x = np.array([2.0, 1.5, 3.0])
y_hat = w @ x + 0.1   # dot product + bias

Norms

Norm Definitions

NormFormulaGeometryML Connection
L1 (x1\|\mathbf{x}\|_1)ixi\sum_i \|x_i\|DiamondLasso — L1 penalty promotes sparsity (feature selection)
L2 (x2\|\mathbf{x}\|_2)ixi2\sqrt{\sum_i x_i^2}CircleRidge — L2 penalty shrinks weights uniformly
L∞ (x\|\mathbf{x}\|_\infty)maxixi\max_i \|x_i\|SquareAdversarial robustness — max perturbation per dimension
Frobenius (AF\|\mathbf{A}\|_F)i,jaij2\sqrt{\sum_{i,j} a_{ij}^2}Matrix norm — equivalent to L2 on vectorized matrix

Why L1 Gives Sparsity but L2 Doesn't

Geometric intuition:L1 constraint region 是 diamond(corners on axes)→ contour lines 容易 hit corners → some wj=0w_j = 0。L2 constraint region 是 circle(no corners)→ coefficients shrink but never exactly zero。

這是 Lasso 做 feature selection、Ridge 不做的根本原因。

import numpy as np

x = np.array([3, -4])
np.linalg.norm(x, ord=1)   # L1: |3| + |-4| = 7
np.linalg.norm(x, ord=2)   # L2: sqrt(9 + 16) = 5
np.linalg.norm(x, ord=np.inf)  # L∞: max(3, 4) = 4

面試連結:Regularization

「L1 和 L2 regularization 有什麼不同?」→ 本質上是 weight vector 的 L1 vs L2 norm 作為 penalty。L1 norm 的 diamond 形 constraint → corner solutions → sparsity。L2 norm 的 circle constraint → 均勻 shrink → no exact zeros。

import numpy as np

x = np.array([3, -4])
np.linalg.norm(x, ord=1)      # L1: |3| + |-4| = 7
np.linalg.norm(x, ord=2)      # L2: sqrt(9 + 16) = 5.0
np.linalg.norm(x, ord=np.inf) # L∞: max(|3|, |-4|) = 4

# Regularization in practice
weights = np.array([0.5, -0.3, 0.0, 0.8, 0.0, -0.1])
l1_penalty = np.linalg.norm(weights, ord=1)  # 1.7 (Lasso)
l2_penalty = np.linalg.norm(weights, ord=2)  # 1.0 (Ridge)

# Frobenius norm for a matrix
W = np.random.randn(3, 4)
np.linalg.norm(W, 'fro')  # sqrt(sum of all elements squared)

Key Inequalities

These inequalities appear repeatedly in ML theory and interviews:

Triangle Inequality

For any vectors a\mathbf{a} and b\mathbf{b}:

a+ba+b\|\mathbf{a} + \mathbf{b}\| \leq \|\mathbf{a}\| + \|\mathbf{b}\|

直覺:走兩段路的總距離 ≥ 直線距離。

ML connections:

  • Metric space property: Any valid distance function must satisfy triangle inequality. This is why KNN, K-Means, DBSCAN 用的 distance metric 必須滿足它。
  • Cosine similarity is NOT a metric — 它不滿足 triangle inequality(但 cosine distance = 1 - cosine 也不完全滿足)。
import numpy as np

a = np.array([3, 0])
b = np.array([0, 4])
# ||a + b|| = ||(3,4)|| = 5
# ||a|| + ||b|| = 3 + 4 = 7
# 5 ≤ 7 ✓ (triangle inequality)

# For distance metric d, triangle inequality means:
# d(x, z) ≤ d(x, y) + d(y, z)
# This enables efficient search: if d(x,y) is large and d(y,z) is small,
# then d(x,z) must be at least d(x,y) - d(y,z) → prune search space

Cauchy-Schwarz Inequality

abab|\mathbf{a} \cdot \mathbf{b}| \leq \|\mathbf{a}\| \cdot \|\mathbf{b}\|

等號成立 if and only if a\mathbf{a} and b\mathbf{b} are linearly dependent(parallel)。

Consequences:

1. Cosine similarity is bounded: 1cos(a,b)=abab1-1 \leq \cos(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\|\|\mathbf{b}\|} \leq 1 — 因為分子 \leq 分母(by Cauchy-Schwarz)

2. Correlation coefficient is bounded: 1ρXY1-1 \leq \rho_{XY} \leq 1 — Pearson correlation 本質上就是 centered data 的 cosine similarity

3. Variance of linear combination: Var(aX+bY)=a2σX2+b2σY2+2abCov(X,Y)0\text{Var}(aX + bY) = a^2\sigma_X^2 + b^2\sigma_Y^2 + 2ab\text{Cov}(X,Y) \geq 0 — Cauchy-Schwarz ensures this is always non-negative

import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Cauchy-Schwarz: |a·b| ≤ ||a|| * ||b||
lhs = abs(np.dot(a, b))            # |32| = 32
rhs = np.linalg.norm(a) * np.linalg.norm(b)  # 3.74 * 8.77 = 32.83
print(f"|a·b| = {lhs:.2f} ≤ ||a||·||b|| = {rhs:.2f}")  # 32 ≤ 32.83 ✓

# Equality when vectors are parallel (linearly dependent)
c = 2 * a  # c = [2, 4, 6], parallel to a
abs(np.dot(a, c))  # 28
np.linalg.norm(a) * np.linalg.norm(c)  # 28.0 — equality! ✓

# This is WHY cosine similarity ∈ [-1, 1]:
cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# = 32 / 32.83 = 0.974, always in [-1, 1] by Cauchy-Schwarz

Correlation Coefficient (from Linear Algebra Perspective)

Pearson correlation is cosine similarity of centered vectors:

ρXY=Cov(X,Y)σXσY=(xxˉ)(yyˉ)xxˉyyˉ=cosθ\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y} = \frac{(\mathbf{x} - \bar{x})^\top(\mathbf{y} - \bar{y})}{\|\mathbf{x} - \bar{x}\| \cdot \|\mathbf{y} - \bar{y}\|} = \cos\theta

ρ[1,1]\rho \in [-1, 1] by Cauchy-Schwarz。ρ=1|\rho| = 1 iff XX and YY are linearly related。

PropertyMeaningML Implication
ρ=1\rho = 1Perfect positive linear relationshipFeatures are redundant(multicollinearity)
ρ=0\rho = 0No linear relationshipBut may have nonlinear relationship!
ρ=1\rho = -1Perfect negative linear relationshipFeatures carry same info in opposite direction
import numpy as np

x = np.array([1, 2, 3, 4, 5], dtype=float)
y = np.array([2, 4, 5, 4, 5], dtype=float)

# Method 1: np.corrcoef
corr_matrix = np.corrcoef(x, y)  # [[1, ρ], [ρ, 1]]
rho = corr_matrix[0, 1]

# Method 2: cosine similarity of centered vectors
x_c = x - x.mean()
y_c = y - y.mean()
rho_manual = np.dot(x_c, y_c) / (np.linalg.norm(x_c) * np.linalg.norm(y_c))
# rho ≈ rho_manual ✓

# Correlation matrix of features = cosine similarity matrix of centered features
X = np.random.randn(100, 5)
corr = np.corrcoef(X.T)  # 5×5 correlation matrix
# High off-diagonal values → multicollinearity → VIF high → unstable OLS

面試經典:Correlation ≠ Independence

ρ=0\rho = 0 只代表沒有線性關係XUniform(1,1)X \sim \text{Uniform}(-1,1), Y=X2Y = X^2: ρ=0\rho = 0YY 完全由 XX 決定(nonlinear dependence)。獨立 → ρ=0\rho = 0,但 ρ=0\rho = 0 ≠ 獨立。

K-Means: Clustering as Distance Minimization

K-Means 的數學本質是一個 distance minimization problem — 直接連結 vectors、norms、distance 的概念:

minμ1,,μKk=1KiCkxiμk22\min_{\boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_K} \sum_{k=1}^{K} \sum_{i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|_2^2
  • Assign step: 每個 point 分給 L2 距離最近的 centroid(Voronoi tessellation)
  • Update step: centroid = mean of assigned points(因為 mean minimizes squared L2 distance)
  • 如果用 L1 distance → optimal center = median → K-Medoids
import numpy as np

# K-Means from scratch using vector operations
def kmeans(X, K, max_iters=100):
    # Initialize centroids randomly
    centroids = X[np.random.choice(len(X), K, replace=False)]

    for _ in range(max_iters):
        # Assign: compute L2 distance from each point to each centroid
        distances = np.linalg.norm(X[:, None] - centroids[None, :], axis=2)  # [n, K]
        labels = distances.argmin(axis=1)  # nearest centroid

        # Update: new centroid = mean of assigned points
        new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])

        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids

    return labels, centroids

X = np.random.randn(200, 2)
labels, centroids = kmeans(X, K=3)

每一步都用到本頁的概念:vector subtraction(xiμk\mathbf{x}_i - \boldsymbol{\mu}_k)、L2 norm(2\|\cdot\|_2)、mean as L2 minimizer。

Interview Signals

What interviewers listen for:

  • 你能連結 dot product 和 ML predictions(linear regression, attention, cosine similarity)
  • 你知道 L1 vs L2 norm 的 geometric 差異和對 regularization 的影響
  • 你能說出 Cauchy-Schwarz 為什麼讓 cosine similarity 在 [-1,1]
  • 你理解 correlation = cosine of centered vectors
  • 你知道 Chebyshev inequality 提供 distribution-free bounds

Practice

Flashcards

Flashcards (1/10)

Dot product 在 ML 中最重要的應用有哪些?

(1)Linear regression: prediction = wᵀx + b。(2)Attention: score = qᵀk。(3)Cosine similarity: embedding similarity。(4)SVM: decision boundary wᵀx + b = 0。幾乎所有 ML 的 'score' 都是某種 dot product。

Click card to flip

Quiz

Question 1/10

Linear regression prediction 用了什麼 linear algebra operation?

Mark as Complete

3/5 — Okay