Least Squares & Applications

Course Reference

Based on Introduction to Applied Linear Algebra by Boyd & Vandenberghe (Stanford). Covers Chapters 12-19.

Why This Matters

Least squares 是 ML 最基礎的優化框架 — linear regression 是 least squares,Ridge/Lasso 是 regularized least squares,neural network training 是 nonlinear least squares。理解這些讓你能用統一的數學語言看懂 ML 的核心。

The Least Squares Problem (Ch 12)

Problem Formulation

The least squares problem: find x\mathbf{x} that minimizes the sum of squared residuals:

minxAxb2\min_{\mathbf{x}} \|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2

where ARm×n\mathbf{A} \in \mathbb{R}^{m \times n}, bRm\mathbf{b} \in \mathbb{R}^m, xRn\mathbf{x} \in \mathbb{R}^n.

展開後:Axb2=xAAx2bAx+bb\|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 = \mathbf{x}^\top \mathbf{A}^\top \mathbf{A} \mathbf{x} - 2\mathbf{b}^\top \mathbf{A} \mathbf{x} + \mathbf{b}^\top \mathbf{b}

這是關於 x\mathbf{x} 的 quadratic function,且 AA\mathbf{A}^\top \mathbf{A} 是 PSD → convex → 有 unique global minimum(如果 A\mathbf{A} full column rank)。

Solution via Normal Equation

取 gradient 令其為零:

xAxb2=2A(Axb)=0\nabla_{\mathbf{x}} \|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 = 2\mathbf{A}^\top(\mathbf{A}\mathbf{x} - \mathbf{b}) = \mathbf{0}

得到 normal equation

AAx=Ab\mathbf{A}^\top \mathbf{A} \mathbf{x} = \mathbf{A}^\top \mathbf{b}

如果 AA\mathbf{A}^\top \mathbf{A} invertible(即 A\mathbf{A} full column rank):

x^=(AA)1Ab\hat{\mathbf{x}} = (\mathbf{A}^\top \mathbf{A})^{-1} \mathbf{A}^\top \mathbf{b}

(AA)1A(\mathbf{A}^\top \mathbf{A})^{-1} \mathbf{A}^\top 就是 A\mathbf{A}pseudo-inverse A+\mathbf{A}^+

Geometric Interpretation

最重要的直覺:least squares solution x^\hat{\mathbf{x}} 使得 Ax^\mathbf{A}\hat{\mathbf{x}}b\mathbf{b} 在 column space of A\mathbf{A} 上的 orthogonal projection

Residual r=bAx^\mathbf{r} = \mathbf{b} - \mathbf{A}\hat{\mathbf{x}} 和 column space of A\mathbf{A} 正交:

A(bAx^)=0\mathbf{A}^\top (\mathbf{b} - \mathbf{A}\hat{\mathbf{x}}) = \mathbf{0}

這就是 normal equation 的幾何意義:residual 必須和 column space 正交,才能讓 projection 最近。

import numpy as np

# Problem: fit y = a + b*x to data
x_data = np.array([1, 2, 3, 4, 5], dtype=float)
y_data = np.array([2.1, 3.9, 6.2, 7.8, 10.1])

# Design matrix: [1, x]
A = np.column_stack([np.ones_like(x_data), x_data])
b = y_data

# Method 1: Normal equation (explicit)
x_hat = np.linalg.inv(A.T @ A) @ A.T @ b
print(f"Normal equation: intercept={x_hat[0]:.3f}, slope={x_hat[1]:.3f}")

# Method 2: np.linalg.lstsq (numerically stable, uses SVD internally)
x_lstsq, residuals, rank, sv = np.linalg.lstsq(A, b, rcond=None)
print(f"lstsq: intercept={x_lstsq[0]:.3f}, slope={x_lstsq[1]:.3f}")

# Method 3: np.linalg.solve (solve normal equation directly)
x_solve = np.linalg.solve(A.T @ A, A.T @ b)

# Verify all three give the same answer
assert np.allclose(x_hat, x_lstsq)
assert np.allclose(x_hat, x_solve)

# Verify residual orthogonality: A^T (b - Ax) = 0
residual = b - A @ x_hat
orthogonality = A.T @ residual
print(f"A^T * residual = {orthogonality}")  # ≈ [0, 0] (machine precision)
assert np.allclose(orthogonality, 0, atol=1e-10)  # True ✓

面試連結:Normal Equation vs Gradient Descent

「什麼時候用 normal equation,什麼時候用 gradient descent?」→ Normal equation: O(n3)O(n^3) 計算 inverse,適合 features 少(n 小)。Gradient descent: O(kn2)O(kn^2),適合 features 多或 data 太大放不進記憶體。Sklearn LinearRegression 底層用 SVD-based lstsq(不直接 invert ATA)。

Least Squares Data Fitting (Ch 13)

Polynomial Fitting

Polynomial regression 本質上還是 linear least squares — 只是 features 變成 1,x,x2,,xp1, x, x^2, \ldots, x^p

Design matrix for degree-pp polynomial:

A=[1x1x12x1p1x2x22x2p1xmxm2xmp]\mathbf{A} = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^p \\ 1 & x_2 & x_2^2 & \cdots & x_2^p \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_m & x_m^2 & \cdots & x_m^p \end{bmatrix}

模型是 parameters 的 linear function → 可以用 normal equation 解。雖然 prediction 是 input 的 nonlinear function,但 fitting 是 linear in parameters。

Overfitting and Model Complexity

High-degree polynomial 可以 perfectly fit training data(n points → degree n-1 polynomial passes through all),但 generalize 很差。

核心 tradeoff:

  • Underfitting(degree too low):bias high, variance low → training error 和 test error 都大
  • Overfitting(degree too high):bias low, variance high → training error 小但 test error 大
  • Sweet spot:bias-variance balance → cross-validation 選 degree

ML Connections

ConceptML Connection
Polynomial fittingPolynomial regression — design matrix with polynomial features
Basis expansionFeature engineering: create nonlinear features, then linear fit. Kernel trick in SVM: implicit high-dim features
OverfittingUniversal problem — regularization, early stopping, dropout 都是對策
Cross-validationStandard model selection — K-fold CV for hyperparameter tuning
Design matrixFeature matrix X in sklearn — each row = sample, each column = feature
import numpy as np
from numpy.polynomial import polynomial as P

np.random.seed(42)

# True function: y = 0.5 + 2x - 0.3x^2 + noise
n = 30
x = np.linspace(0, 5, n)
y_true = 0.5 + 2 * x - 0.3 * x**2
y = y_true + np.random.randn(n) * 0.5

# Polynomial fitting with different degrees
for degree in [1, 3, 15]:
    # Build design matrix: [1, x, x^2, ..., x^degree]
    A = np.column_stack([x**k for k in range(degree + 1)])
    coeffs = np.linalg.lstsq(A, y, rcond=None)[0]
    y_pred = A @ coeffs
    train_error = np.mean((y - y_pred)**2)
    print(f"Degree {degree:2d}: train MSE = {train_error:.4f}, coeffs range = [{coeffs.min():.2f}, {coeffs.max():.2f}]")

# Overfitting demo: train/test split
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

x_train, x_test = x[:20].reshape(-1, 1), x[20:].reshape(-1, 1)
y_train, y_test = y[:20], y[20:]

for degree in [1, 2, 5, 15]:
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(x_train, y_train)
    train_mse = np.mean((model.predict(x_train) - y_train)**2)
    test_mse = np.mean((model.predict(x_test) - y_test)**2)
    print(f"Degree {degree:2d}: train MSE={train_mse:.3f}, test MSE={test_mse:.3f}")
    # High degree: train MSE ≈ 0 but test MSE explodes → overfitting!

# Cross-validation for degree selection
best_degree, best_cv = 1, -np.inf
for degree in range(1, 10):
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    scores = cross_val_score(model, x.reshape(-1, 1), y, cv=5, scoring='neg_mean_squared_error')
    cv_mse = -scores.mean()
    if scores.mean() > best_cv:
        best_cv = scores.mean()
        best_degree = degree
    print(f"Degree {degree}: CV MSE = {cv_mse:.3f}")
print(f"Best degree by CV: {best_degree}")

面試連結:Polynomial Regression 不是 Nonlinear Regression

常見誤解:polynomial regression 不是 nonlinear regression。它是 linear regression with polynomial features。「Linear」指的是 parameters 的 linear combination,不是 input 的 linear function。y=w0+w1x+w2x2y = w_0 + w_1 x + w_2 x^2 是 linear in w0,w1,w2w_0, w_1, w_2,所以能用 normal equation 解。

Least Squares Classification (Ch 14)

Binary Classification via Least Squares

把 classification 硬用 least squares 解:labels 設為 +1+11-1,minimize Xwy2\|\mathbf{X}\mathbf{w} - \mathbf{y}\|^2

Decision boundary: wx=0\mathbf{w}^\top \mathbf{x} = 0

直覺:least squares 試圖讓 prediction Xw\mathbf{X}\mathbf{w} 盡量接近 target values(+1 或 -1)。但問題是:

Why LS Classification is Suboptimal

IssueExplanation
Unbounded predictionsLS prediction 可以是任何實數,不是 probability。Correctly classified 但 confidence 很高的 points 的 residual 很大 → 反而會被 penalized
Outlier sensitivity離很遠的正確分類點也產生大 residual → distort decision boundary
Squared loss for classificationSquared loss 不是 classification 的 natural loss — 遠離 boundary 的正確 points 不應該有 large loss

Better Alternatives

LS: minimize (wxiyi)2vsLogistic: minimize log(1+eyiwxi)\text{LS: minimize } \sum (w^\top x_i - y_i)^2 \quad \text{vs} \quad \text{Logistic: minimize } \sum \log(1 + e^{-y_i w^\top x_i})

Logistic loss 的好處:correctly classified points with high confidence 的 loss 趨近 0(不會被 penalized)。

import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

np.random.seed(42)

# Generate binary classification data
X, y_01 = make_classification(n_samples=200, n_features=2, n_redundant=0,
                               n_informative=2, random_state=42, n_clusters_per_class=1)
y_pm1 = 2 * y_01 - 1  # Convert to {-1, +1}

# Split
X_train, X_test = X[:150], X[150:]
y_train, y_test = y_pm1[:150], y_pm1[150:]
y_train_01, y_test_01 = y_01[:150], y_01[150:]

# LS Classifier: minimize ||Xw - y||^2
# Add bias column
X_train_b = np.column_stack([np.ones(len(X_train)), X_train])
X_test_b = np.column_stack([np.ones(len(X_test)), X_test])

w_ls = np.linalg.lstsq(X_train_b, y_train, rcond=None)[0]
y_pred_ls = np.sign(X_test_b @ w_ls)
acc_ls = accuracy_score(y_test, y_pred_ls)

# Logistic Regression
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train_01)
acc_lr = accuracy_score(y_test_01, lr.predict(X_test))

print(f"LS Classifier accuracy:       {acc_ls:.3f}")
print(f"Logistic Regression accuracy:  {acc_lr:.3f}")

# Show why LS is suboptimal: predictions are unbounded
ls_predictions = X_test_b @ w_ls
print(f"LS prediction range: [{ls_predictions.min():.2f}, {ls_predictions.max():.2f}]")
print("LS predictions are unbounded — not probabilities!")

# Logistic outputs proper probabilities
lr_probs = lr.predict_proba(X_test)[:, 1]
print(f"Logistic prob range: [{lr_probs.min():.3f}, {lr_probs.max():.3f}]")

# Add outliers to see LS sensitivity
X_outlier = np.array([[10, 10], [12, 12], [15, 15]])
y_outlier = np.array([1, 1, 1])  # Correct class but far away
X_train2 = np.vstack([X_train, X_outlier])
y_train2 = np.concatenate([y_train, y_outlier])
X_train2_b = np.column_stack([np.ones(len(X_train2)), X_train2])

w_ls2 = np.linalg.lstsq(X_train2_b, y_train2, rcond=None)[0]
y_pred_ls2 = np.sign(X_test_b @ w_ls2)
acc_ls2 = accuracy_score(y_test, y_pred_ls2)
print(f"LS accuracy with outliers: {acc_ls2:.3f}")
# Outliers distort the LS boundary → accuracy degrades

面試連結:為什麼不用 MSE 做 Classification?

「為什麼 classification 不用 MSE loss?」→ MSE 對遠離 decision boundary 的正確分類點也給 large penalty(因為 prediction far from +1/-1)。Logistic/cross-entropy loss 對 confident correct predictions 的 loss 趨近 0。此外 MSE loss 在 classification 中的 gradient 在高 confidence 區域趨近 0(gradient vanishing),不利優化。

Regularized Data Fitting (Ch 15) — Multi-Objective Least Squares

The Core Tradeoff

Regularization 是 multi-objective optimization:同時 minimize 兩個目標:

  1. Data fidelity: Axb2\|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 — fit data well
  2. Regularization: x2\|\mathbf{x}\|^2 (or x1\|\mathbf{x}\|_1) — keep model simple

Ridge Regression (Tikhonov Regularization, L2)

minxAxb2+λx2\min_{\mathbf{x}} \|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 + \lambda\|\mathbf{x}\|^2

Closed-form solution:

x^ridge=(AA+λI)1Ab\hat{\mathbf{x}}_{\text{ridge}} = (\mathbf{A}^\top\mathbf{A} + \lambda\mathbf{I})^{-1}\mathbf{A}^\top\mathbf{b}

λ\lambda 的效果:

  • λ=0\lambda = 0: 退化為 OLS
  • λ\lambda \to \infty: x^0\hat{\mathbf{x}} \to \mathbf{0}(infinite shrinkage)
  • λI\lambda\mathbf{I}AA+λI\mathbf{A}^\top\mathbf{A} + \lambda\mathbf{I} 一定 invertible → 解決 multicollinearity

Lasso (L1 Regularization)

minxAxb2+λx1\min_{\mathbf{x}} \|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 + \lambda\|\mathbf{x}\|_1

No closed-form(L1 not differentiable at 0)→ 用 coordinate descent 或 ISTA 解。

核心差異:L1 產生 sparse solutions(many coefficients exactly 0)→ automatic feature selection。

Elastic Net

minxAxb2+λ1x1+λ2x2\min_{\mathbf{x}} \|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 + \lambda_1\|\mathbf{x}\|_1 + \lambda_2\|\mathbf{x}\|^2

結合 L1 的 sparsity 和 L2 的 stability(correlated features 時 Lasso 不穩定,Elastic Net 更 robust)。

Bayesian Interpretation

RegularizationPrior DistributionMAP Estimation
Ridge (L2)wjN(0,σw2)w_j \sim \mathcal{N}(0, \sigma_w^2) — Gaussian priorMAP with Gaussian prior = Ridge
Lasso (L1)wjLaplace(0,b)w_j \sim \text{Laplace}(0, b) — Laplace prior (sharp peak at 0)MAP with Laplace prior = Lasso
No regularizationUniform (improper) priorMLE = OLS

Bayesian 觀點:regularization = incorporating prior belief about parameters。Ridge 相信 weights 從 Gaussian 分佈來(不太大但不 exactly 0)。Lasso 相信 weights 從 Laplace 分佈來(很多 exactly 0,少數 nonzero)。

Bias-Variance Tradeoff

MSE=Bias2+Variance+Irreducible Noise\text{MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}
  • λ=0\lambda = 0(OLS):zero bias, high variance → overfitting
  • λ\lambda large:high bias, low variance → underfitting
  • Optimal λ\lambda:minimize total MSE → cross-validation

ML Connections

ConceptML Application
Ridgesklearn Ridge, weight decay in neural nets (λw2\lambda\|\mathbf{w}\|^2 in loss)
LassoFeature selection, sparse models, compressed sensing
Elastic NetGenomics (correlated genes), text (correlated words)
Regularization pathPlot coefficients vs λ\lambda → see which features survive
CV for lambdasklearn RidgeCV, LassoCV — automated λ\lambda selection
Weight decaySame as L2 regularization — standard in all neural network training
import numpy as np
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.linear_model import RidgeCV, LassoCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

np.random.seed(42)

# Generate data with many features (some irrelevant)
n, p = 100, 20
X = np.random.randn(n, p)
true_w = np.zeros(p)
true_w[:5] = [3, -2, 1.5, -1, 0.5]  # Only first 5 features matter
y = X @ true_w + np.random.randn(n) * 0.5

# Compare OLS vs Ridge vs Lasso
models = {
    'OLS': LinearRegression(),
    'Ridge (lambda=1)': Ridge(alpha=1.0),
    'Lasso (lambda=0.1)': Lasso(alpha=0.1),
    'ElasticNet': ElasticNet(alpha=0.1, l1_ratio=0.5),
}

for name, model in models.items():
    model.fit(X, y)
    coefs = model.coef_ if hasattr(model, 'coef_') else model.coef_
    nonzero = np.sum(np.abs(coefs) > 0.01)
    cv = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
    print(f"{name:25s}: nonzero={nonzero:2d}/{p}, CV MSE={-cv.mean():.3f}")

# Lasso produces sparse solution: only ~5 nonzero coefficients ✓
print(f"\nTrue weights:  {true_w.round(2)}")
print(f"Lasso weights: {Lasso(alpha=0.1).fit(X, y).coef_.round(2)}")

# Ridge closed-form: x = (A^T A + lambda I)^{-1} A^T b
lam = 1.0
x_ridge_manual = np.linalg.solve(X.T @ X + lam * np.eye(p), X.T @ y)
x_ridge_sklearn = Ridge(alpha=lam, fit_intercept=False).fit(X, y).coef_
assert np.allclose(x_ridge_manual, x_ridge_sklearn, atol=1e-6)

# Cross-validation for lambda selection
ridge_cv = RidgeCV(alphas=np.logspace(-3, 3, 50), cv=5)
ridge_cv.fit(X, y)
print(f"\nBest Ridge lambda by CV: {ridge_cv.alpha_:.4f}")

lasso_cv = LassoCV(alphas=np.logspace(-3, 1, 50), cv=5, random_state=42)
lasso_cv.fit(X, y)
print(f"Best Lasso lambda by CV: {lasso_cv.alpha_:.4f}")

# Regularization path: how coefficients change with lambda
alphas = np.logspace(-3, 2, 50)
ridge_coefs = []
for a in alphas:
    model = Ridge(alpha=a, fit_intercept=False)
    model.fit(X, y)
    ridge_coefs.append(model.coef_.copy())
ridge_coefs = np.array(ridge_coefs)
# As lambda increases: all coefficients shrink toward 0 (but never exactly 0)
print(f"\nRidge path: at lambda=0.001, max|coef|={np.abs(ridge_coefs[0]).max():.3f}")
print(f"Ridge path: at lambda=100,   max|coef|={np.abs(ridge_coefs[-1]).max():.3f}")

面試連結:Weight Decay = L2 Regularization

「Neural network 的 weight decay 和 Ridge regression 有什麼關係?」→ 完全一樣。PyTorch optimizer 的 weight_decay 參數就是在 loss 加 λw2\lambda\|\mathbf{w}\|^2。AdamW 把 weight decay 和 Adam 的 adaptive learning rate 分開(decoupled weight decay),效果更好。

Control via Least Squares

Control 問題:給定一個 dynamical system xt+1=Axt+But\mathbf{x}_{t+1} = A\mathbf{x}_t + B\mathbf{u}_t,找 input sequence u0,,uT1\mathbf{u}_0, \ldots, \mathbf{u}_{T-1} 使得 state 達到目標。

可以 formulate 成 least squares:minimize txtxtarget2+ρut2\sum_t \|\mathbf{x}_t - \mathbf{x}_{\text{target}}\|^2 + \rho\|\mathbf{u}_t\|^2。第一項要 state 接近 target,第二項 penalize 過大的 input(節省能量 / 避免 extreme actions)。

和 RL 的連結:RL 的 reward shaping + action penalty 本質上就是 multi-objective least squares。ρ\rho 大 → 保守的 control(small actions);ρ\rho 小 → aggressive control。

Estimation and Inversion

Estimation 問題:從 noisy observations y=Ax+v\mathbf{y} = A\mathbf{x} + \mathbf{v} 恢復 true signal x\mathbf{x}

x^=argminyAx2+λx2\hat{\mathbf{x}} = \arg\min \|\mathbf{y} - A\mathbf{x}\|^2 + \lambda\|\mathbf{x}\|^2

第一項要 fit observations,第二項 regularize(避免 noise amplification)。和 denoising 的精神一致 — 在 noise reduction 和 signal preservation 之間 tradeoff。

ML 中的應用:denoising autoencoder、signal processing、inverse problems(CT scan reconstruction)。

import numpy as np

# Control: steer system from x0 to target using least squares
A = np.array([[1, 0.1], [0, 1]])  # simple dynamics
B = np.array([[0], [0.1]])        # input affects velocity

T = 10  # time horizon
n, m = 2, 1
x0 = np.array([0, 0])
x_target = np.array([1, 0])  # reach position 1, stop

# Build big system: stack all time steps into one LS problem
# x_T = A^T x_0 + sum of A^(T-k-1) B u_k
# Formulate as: minimize ||x_T - target||^2 + rho * sum ||u_k||^2
rho = 0.1
# ... (full solution requires building controllability matrix)

# Estimation: recover signal from noisy observation
x_true = np.array([1.0, 2.0, 3.0])
A_obs = np.random.randn(10, 3)  # observation matrix (more observations than unknowns)
noise = np.random.randn(10) * 0.5
y_obs = A_obs @ x_true + noise

# Simple LS estimate
x_hat = np.linalg.lstsq(A_obs, y_obs, rcond=None)[0]
print(f"True:      {x_true}")
print(f"Estimated: {x_hat.round(3)}")  # close to true

# Regularized estimate (Ridge) — more robust to noise
lam = 0.1
x_hat_reg = np.linalg.inv(A_obs.T @ A_obs + lam * np.eye(3)) @ A_obs.T @ y_obs
print(f"Ridge est: {x_hat_reg.round(3)}")

Constrained Least Squares (Ch 16-17)

Problem Formulation

minxAxb2subject toCx=d\min_{\mathbf{x}} \|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 \quad \text{subject to} \quad \mathbf{C}\mathbf{x} = \mathbf{d}

有 equality constraints 的 least squares。用 Lagrangian 解:

L(x,ν)=Axb2+ν(Cxd)\mathcal{L}(\mathbf{x}, \boldsymbol{\nu}) = \|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 + \boldsymbol{\nu}^\top(\mathbf{C}\mathbf{x} - \mathbf{d})

KKT Conditions

設 gradient 為零得到 KKT system:

[2AACC0][xν]=[2Abd]\begin{bmatrix} 2\mathbf{A}^\top\mathbf{A} & \mathbf{C}^\top \\ \mathbf{C} & \mathbf{0} \end{bmatrix} \begin{bmatrix} \mathbf{x} \\ \boldsymbol{\nu} \end{bmatrix} = \begin{bmatrix} 2\mathbf{A}^\top\mathbf{b} \\ \mathbf{d} \end{bmatrix}

解這個 block system 就同時得到 optimal x\mathbf{x} 和 Lagrange multipliers ν\boldsymbol{\nu}

直覺:ν\boldsymbol{\nu} 告訴你 constraint 對 optimal value 的 sensitivity — 如果 constraint 稍微放鬆,objective 能改善多少。

ML Connections

ConceptML Application
Constrained optimizationSVM: minimize w2\|\mathbf{w}\|^2 s.t. yi(wxi+b)1y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 (constrained QP)
Lagrange multipliersSupport vectors = points where multiplier αi>0\alpha_i > 0 (active constraints)
Portfolio optimizationMinimize variance s.t. expected return = target (Markowitz)
Equality constraintsSoftmax normalization (pi=1\sum p_i = 1), probability simplex constraints
KKT conditionsFoundation of convex optimization — used in interior-point methods
import numpy as np
from scipy.optimize import minimize

np.random.seed(42)

# Constrained LS: minimize ||Ax - b||^2 subject to Cx = d
A = np.random.randn(10, 4)
b = np.random.randn(10)
C = np.array([[1, 1, 1, 1]])  # constraint: sum of x_i = 1
d = np.array([1.0])

# Method 1: Solve KKT system directly
n = A.shape[1]
m = C.shape[0]
KKT = np.block([
    [2 * A.T @ A, C.T],
    [C, np.zeros((m, m))]
])
rhs = np.concatenate([2 * A.T @ b, d])
sol = np.linalg.solve(KKT, rhs)
x_kkt = sol[:n]
nu = sol[n:]  # Lagrange multiplier
print(f"KKT solution: x = {x_kkt.round(4)}")
print(f"Sum of x = {x_kkt.sum():.6f}")  # = 1.0 ✓
print(f"Lagrange multiplier nu = {nu.round(4)}")

# Method 2: scipy.optimize with constraint
def objective(x):
    return np.sum((A @ x - b)**2)

constraint = {'type': 'eq', 'fun': lambda x: C @ x - d}
result = minimize(objective, np.zeros(4), constraints=constraint)
print(f"scipy solution: x = {result.x.round(4)}")
assert np.allclose(x_kkt, result.x, atol=1e-4)

# Portfolio optimization example
# 5 assets, minimize variance s.t. expected return = target, weights sum to 1
n_assets = 5
returns = np.array([0.10, 0.12, 0.08, 0.15, 0.11])  # expected returns
cov_matrix = np.array([
    [0.04, 0.006, 0.002, 0.01, 0.003],
    [0.006, 0.09, 0.004, 0.015, 0.005],
    [0.002, 0.004, 0.01, 0.003, 0.002],
    [0.01, 0.015, 0.003, 0.16, 0.008],
    [0.003, 0.005, 0.002, 0.008, 0.025],
])

target_return = 0.12

def portfolio_variance(w):
    return w @ cov_matrix @ w

constraints = [
    {'type': 'eq', 'fun': lambda w: np.sum(w) - 1},         # weights sum to 1
    {'type': 'eq', 'fun': lambda w: w @ returns - target_return},  # target return
]
bounds = [(0, 1)] * n_assets  # no short selling

result = minimize(portfolio_variance, np.ones(n_assets) / n_assets,
                  constraints=constraints, bounds=bounds, method='SLSQP')
print(f"\nOptimal portfolio weights: {result.x.round(4)}")
print(f"Portfolio return: {result.x @ returns:.4f}")
print(f"Portfolio std:    {np.sqrt(result.fun):.4f}")

Linear Quadratic Regulator (LQR)

LQR 是 constrained optimization 在 control 中的經典應用:

minu0,,uT1t=0TxtQxt+t=0T1utRuts.t. xt+1=Axt+But\min_{\mathbf{u}_0, \ldots, \mathbf{u}_{T-1}} \sum_{t=0}^{T} \mathbf{x}_t^\top Q \mathbf{x}_t + \sum_{t=0}^{T-1} \mathbf{u}_t^\top R \mathbf{u}_t \quad \text{s.t. } \mathbf{x}_{t+1} = A\mathbf{x}_t + B\mathbf{u}_t
  • QQ: penalizes state deviation(PSD matrix — 哪些 state 需要接近 0)
  • RR: penalizes control effort(PD matrix — 避免 extreme actions)
  • 解是 linear feedback: ut=Kxt\mathbf{u}_t = -K\mathbf{x}_t(optimal gain matrix KK 由 Riccati equation 決定)

ML 連結: RL 中的 continuous control(robotics)本質上就是 LQR 的 nonlinear 推廣。Policy gradient methods learn KK when dynamics are unknown。Model-based RL 用 learned dynamics + LQR 作為 planning module。

Linear Quadratic State Estimation (Kalman Filter)

Kalman filter = 最優 state estimation for linear systems with Gaussian noise:

Predict: x^tt1=Ax^t1+But1\text{Predict: } \hat{\mathbf{x}}_{t|t-1} = A\hat{\mathbf{x}}_{t-1} + B\mathbf{u}_{t-1} Update: x^tt=x^tt1+Kt(ytCx^tt1)\text{Update: } \hat{\mathbf{x}}_{t|t} = \hat{\mathbf{x}}_{t|t-1} + K_t(\mathbf{y}_t - C\hat{\mathbf{x}}_{t|t-1})

KtK_t = Kalman gain — balances prediction vs observation。Prediction uncertainty 高 → trust observation more。Observation noise 高 → trust prediction more。

ML 連結:

ApplicationConnection
Time seriesKalman filter = Bayesian state estimation for linear time series
Sensor fusionCombine GPS + IMU readings optimally
Object trackingPredict + correct object position in video
RNN analogyKalman: predict → observe → update state。RNN: predict → input → update hidden state
import numpy as np

# Simple Kalman Filter: track position from noisy measurements
A = np.array([[1, 1], [0, 1]])     # state transition: [pos, vel]
C = np.array([[1, 0]])              # observe position only
Q_noise = np.eye(2) * 0.01         # process noise
R_noise = np.array([[1.0]])         # measurement noise

# Initialize
x_hat = np.array([0.0, 0.0])      # initial state estimate
P = np.eye(2)                       # initial covariance

# True trajectory
x_true = np.array([0.0, 1.0])      # start at 0, velocity 1
measurements = []
estimates = []

for t in range(20):
    # True state evolution
    x_true = A @ x_true + np.random.multivariate_normal([0,0], Q_noise)
    z = C @ x_true + np.random.normal(0, np.sqrt(R_noise[0,0]))

    # Predict
    x_pred = A @ x_hat
    P_pred = A @ P @ A.T + Q_noise

    # Update (Kalman gain)
    S = C @ P_pred @ C.T + R_noise
    K = P_pred @ C.T @ np.linalg.inv(S)
    x_hat = x_pred + K.flatten() * (z - C @ x_pred)
    P = (np.eye(2) - K @ C) @ P_pred

    estimates.append(x_hat.copy())

print(f"True final state:      {x_true.round(2)}")
print(f"Estimated final state: {x_hat.round(2)}")

LQR + Kalman = LQG

LQR(optimal control)+ Kalman filter(optimal estimation)= LQG(Linear Quadratic Gaussian)— separation principle 讓你可以 independently design controller 和 estimator。這是 control theory 的 fundamental result,也是 model-based RL 的理論基礎。

Nonlinear Least Squares (Ch 18-19)

Problem Formulation

When the model is nonlinear in parameters:

minθi=1m(f(xi;θ)yi)2=f(θ)y2\min_{\boldsymbol{\theta}} \sum_{i=1}^{m} (f(\mathbf{x}_i; \boldsymbol{\theta}) - y_i)^2 = \|\mathbf{f}(\boldsymbol{\theta}) - \mathbf{y}\|^2

不能用 normal equation(因為 ff 不是 linear in θ\boldsymbol{\theta})→ 需要 iterative methods。

Gauss-Newton Algorithm

核心 idea:在當前 θk\boldsymbol{\theta}_k 附近把 f(θ)\mathbf{f}(\boldsymbol{\theta}) 線性化(Taylor expansion):

f(θ)f(θk)+Jk(θθk)\mathbf{f}(\boldsymbol{\theta}) \approx \mathbf{f}(\boldsymbol{\theta}_k) + \mathbf{J}_k (\boldsymbol{\theta} - \boldsymbol{\theta}_k)

where Jk\mathbf{J}_k is the Jacobian of f\mathbf{f} at θk\boldsymbol{\theta}_k

代入 least squares problem → 變成 linear least squares:

minΔθJkΔθ(yf(θk))2\min_{\Delta\boldsymbol{\theta}} \|\mathbf{J}_k \Delta\boldsymbol{\theta} - (\mathbf{y} - \mathbf{f}(\boldsymbol{\theta}_k))\|^2

Normal equation for update:JkJkΔθ=Jk(yf(θk))\mathbf{J}_k^\top \mathbf{J}_k \Delta\boldsymbol{\theta} = \mathbf{J}_k^\top (\mathbf{y} - \mathbf{f}(\boldsymbol{\theta}_k))

然後 θk+1=θk+Δθ\boldsymbol{\theta}_{k+1} = \boldsymbol{\theta}_k + \Delta\boldsymbol{\theta},repeat。

Levenberg-Marquardt

Gauss-Newton + regularization(trust region):

(JJ+μI)Δθ=J(yf(θ))(\mathbf{J}^\top \mathbf{J} + \mu\mathbf{I})\Delta\boldsymbol{\theta} = \mathbf{J}^\top(\mathbf{y} - \mathbf{f}(\boldsymbol{\theta}))
  • μ\mu 大 → 接近 gradient descent(small careful steps)
  • μ\mu 小 → 接近 Gauss-Newton(aggressive steps)
  • 自動調節 μ\mu:step 好 → 減小 μ\mu(更 aggressive),step 差 → 增大 μ\mu(更 cautious)

Connection to Neural Networks

Neural network training 本質上就是 nonlinear least squares(用 MSE loss 時):

minθNN(X;θ)y2\min_{\boldsymbol{\theta}} \|\text{NN}(\mathbf{X}; \boldsymbol{\theta}) - \mathbf{y}\|^2
  • Backpropagation = computing gradient of this nonlinear LS problem
  • SGD = simplified version of Gauss-Newton(only first-order, no Jacobian/Hessian)
  • Natural gradient / K-FAC = use curvature information similar to Gauss-Newton

Logistic regression 也是 nonlinear LS 的特例:f(x;w)=σ(wx)f(\mathbf{x}; \mathbf{w}) = \sigma(\mathbf{w}^\top\mathbf{x})(sigmoid 使其 nonlinear in w\mathbf{w}... 但通常用 cross-entropy loss 而非 squared loss)。

ML Connections

ConceptML Application
Nonlinear LSNeural network training with MSE loss
Gauss-NewtonFoundation for second-order optimizers (K-FAC, natural gradient)
Levenberg-Marquardtscipy curve fitting, robotics, computer vision (bundle adjustment)
JacobianBackpropagation computes product of Jacobians through layers
Logistic regressionNonlinear LS with sigmoid (but usually trained with cross-entropy)
Trust region methodsMore robust than line search for non-convex problems
import numpy as np
from scipy.optimize import least_squares, curve_fit

np.random.seed(42)

# Example 1: Nonlinear curve fitting
# True model: y = a * exp(-b * x) + c
def model(x, a, b, c):
    return a * np.exp(-b * x) + c

x_data = np.linspace(0, 5, 50)
y_true = model(x_data, 2.5, 1.3, 0.5)
y_data = y_true + np.random.randn(50) * 0.1

# Method 1: scipy.optimize.curve_fit (uses Levenberg-Marquardt)
popt, pcov = curve_fit(model, x_data, y_data, p0=[1.0, 1.0, 0.0])
print(f"curve_fit: a={popt[0]:.3f}, b={popt[1]:.3f}, c={popt[2]:.3f}")
print(f"True:      a=2.500, b=1.300, c=0.500")

# Method 2: scipy.optimize.least_squares (more control)
def residuals(params):
    a, b, c = params
    return model(x_data, a, b, c) - y_data

result = least_squares(residuals, x0=[1.0, 1.0, 0.0], method='lm')  # 'lm' = Levenberg-Marquardt
print(f"least_squares: a={result.x[0]:.3f}, b={result.x[1]:.3f}, c={result.x[2]:.3f}")
print(f"Cost (sum of squared residuals): {result.cost:.4f}")
print(f"Optimality (gradient norm): {result.optimality:.2e}")

# Example 2: Logistic regression as nonlinear LS
def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

# Generate binary classification data
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=200, n_features=2, n_redundant=0, random_state=42)

# Logistic regression via nonlinear least squares (minimize ||sigmoid(Xw) - y||^2)
def logistic_residuals(params):
    w = params[:2]
    b = params[2]
    pred = sigmoid(X @ w + b)
    return pred - y

result_logistic = least_squares(logistic_residuals, x0=np.zeros(3))
w_nls = result_logistic.x[:2]
b_nls = result_logistic.x[2]
pred_nls = (sigmoid(X @ w_nls + b_nls) > 0.5).astype(int)
acc_nls = np.mean(pred_nls == y)
print(f"\nLogistic via nonlinear LS: accuracy={acc_nls:.3f}")

# Compare with sklearn LogisticRegression (uses cross-entropy, not squared loss)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state=42)
lr.fit(X, y)
acc_lr = lr.score(X, y)
print(f"sklearn LogisticRegression:  accuracy={acc_lr:.3f}")

# Example 3: Neural network connection
# Simple 1-hidden-layer network: f(x) = W2 * relu(W1 * x + b1) + b2
# This is nonlinear LS when using MSE loss

def nn_predict(params, X):
    """Simple neural net: 2 -> 4 -> 1"""
    W1 = params[:8].reshape(4, 2)
    b1 = params[8:12]
    W2 = params[12:16].reshape(1, 4)
    b2 = params[16]
    h = np.maximum(0, X @ W1.T + b1)  # ReLU
    return (h @ W2.T + b2).flatten()

def nn_residuals(params):
    return nn_predict(params, X) - y

# Solve with Levenberg-Marquardt-like method
result_nn = least_squares(nn_residuals, x0=np.random.randn(17) * 0.1,
                          method='trf', max_nfev=1000)
pred_nn = (nn_predict(result_nn.x, X) > 0.5).astype(int)
acc_nn = np.mean(pred_nn == y)
print(f"Simple NN via nonlinear LS:  accuracy={acc_nn:.3f}")
print("Neural net training = nonlinear least squares (with MSE loss)")

面試連結:Second-Order vs First-Order Optimization

「SGD 和 Newton's method 的差異?」→ SGD 只用 gradient(first-order),步驟簡單但收斂慢。Newton/Gauss-Newton 用 Hessian/Jacobian information(second-order),每步計算貴但收斂快。Adam 用 running average of squared gradients 來 approximate diagonal Hessian → 是 first-order 和 second-order 的折衷。

Constrained Nonlinear Least Squares

When the optimization has both nonlinear objective and constraints:

minθf(θ)2s.t. g(θ)=0,h(θ)0\min_{\boldsymbol{\theta}} \|\mathbf{f}(\boldsymbol{\theta})\|^2 \quad \text{s.t. } g(\boldsymbol{\theta}) = 0, \quad h(\boldsymbol{\theta}) \leq 0

Penalty Method

Convert constrained problem into unconstrained by adding penalty:

minθf(θ)2+μg(θ)2\min_{\boldsymbol{\theta}} \|\mathbf{f}(\boldsymbol{\theta})\|^2 + \mu \|g(\boldsymbol{\theta})\|^2

μ\mu 越大 → constraint violation penalty 越重 → 解越接近 feasible,但 optimization 越 ill-conditioned。

實務中 progressively increase μ\mu:先用小 μ\mu(easy optimization),逐步加大直到 constraint 幾乎被滿足。

ML 連結: Regularization 就是一種 penalty method — 原始問題是 constrained(wt\|\mathbf{w}\| \leq t),用 penalty λw2\lambda\|\mathbf{w}\|^2 轉成 unconstrained(Ridge regression)。

Augmented Lagrangian

結合 Lagrange multiplier 和 penalty method — 更好的 convergence:

L(θ,λ)=f(θ)2+λg(θ)+μ2g(θ)2\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\lambda}) = \|\mathbf{f}(\boldsymbol{\theta})\|^2 + \boldsymbol{\lambda}^\top g(\boldsymbol{\theta}) + \frac{\mu}{2}\|g(\boldsymbol{\theta})\|^2

Alternately update θ\boldsymbol{\theta}(minimize augmented Lagrangian)and λ\boldsymbol{\lambda}(dual update: λλ+μg(θ)\boldsymbol{\lambda} \leftarrow \boldsymbol{\lambda} + \mu g(\boldsymbol{\theta}))。

比 pure penalty method 好:不需要 μ\mu \to \infty 就能得到 exact solution(Lagrange multiplier 提供了 additional degree of freedom)。

ML 連結: ADMM(Alternating Direction Method of Multipliers)是 augmented Lagrangian 的推廣 — 在 distributed optimization 和 large-scale ML 中使用。

Nonlinear Control

Extend LQR to nonlinear dynamics: xt+1=f(xt,ut)\mathbf{x}_{t+1} = f(\mathbf{x}_t, \mathbf{u}_t)(不再假設 linear)。

Methods:

  • iLQR (iterative LQR): Linearize dynamics around current trajectory → solve LQR → update trajectory → repeat
  • Model Predictive Control (MPC): Online re-solve at each time step with updated state
  • Policy gradient (RL): Learn control policy without knowing dynamics model

iLQR 和 MPC 是 model-based RL 的核心 — 知道(或學到)dynamics model 後,用 optimization 而非 trial-and-error 找 optimal policy。

from scipy.optimize import minimize
import numpy as np

# Penalty method: minimize x^2 subject to x = 3
# Unconstrained: minimize x^2 + mu * (x - 3)^2
for mu in [0.1, 1, 10, 100, 1000]:
    result = minimize(lambda x: x[0]**2 + mu * (x[0] - 3)**2, x0=[0])
    x_opt = result.x[0]
    print(f"mu={mu:>6}: x={x_opt:.4f}, constraint violation={abs(x_opt-3):.4f}")
    # As mu increases: x → 3 (constraint satisfied), but conditioning worsens

# Augmented Lagrangian: better convergence
x = 0.0
lam = 0.0  # Lagrange multiplier
mu = 1.0
for iteration in range(10):
    # Minimize augmented Lagrangian w.r.t. x
    result = minimize(lambda x: x[0]**2 + lam*(x[0]-3) + mu/2*(x[0]-3)**2, x0=[x])
    x = result.x[0]
    # Dual update
    lam = lam + mu * (x - 3)
    if abs(x - 3) < 1e-6:
        break
print(f"\nAugmented Lagrangian: x={x:.6f} (should be close to 3)")

Interview Signals

What interviewers listen for:

  • 你能從 normal equation 推導 OLS closed-form solution
  • 你理解 residual orthogonality 的幾何意義(projection onto column space)
  • 你能解釋 Ridge vs Lasso 的差異(L2 shrink vs L1 sparsity)以及 Bayesian interpretation
  • 你知道 polynomial regression 仍然是 linear regression(linear in parameters)
  • 你理解 LS classification 的缺陷和為什麼用 logistic loss 更好
  • 你能連結 neural network training 和 nonlinear least squares
  • 你知道 weight decay = L2 regularization

Practice

Flashcards

Flashcards (1/10)

Least squares 的 normal equation 是什麼?如何推導?

Normal equation: ATAx = ATb,解為 x = (ATA)^(-1)ATb。推導:對 ||Ax-b||^2 取 gradient 令為零 → 2AT(Ax-b) = 0 → ATAx = ATb。幾何意義:residual b-Ax 必須和 A 的 column space 正交。

Click card to flip

Quiz

Question 1/10

Normal equation ATAx = ATb 的幾何意義是什麼?

Mark as Complete

3/5 — Okay