Least Squares & Applications

Course Reference

Based on Introduction to Applied Linear Algebra by Boyd & Vandenberghe (Stanford). Covers Chapters 12-19.

Why This Matters

Least squares 是 ML 最基礎的優化框架 — linear regression 是 least squares，Ridge/Lasso 是 regularized least squares，neural network training 是 nonlinear least squares。理解這些讓你能用統一的數學語言看懂 ML 的核心。

The Least Squares Problem (Ch 12)

Problem Formulation

The least squares problem: find $\mathbf{x}$ that minimizes the sum of squared residuals:

\min_{\mathbf{x}} \|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2

where $\mathbf{A} \in \mathbb{R}^{m \times n}$ , $\mathbf{b} \in \mathbb{R}^m$ , $\mathbf{x} \in \mathbb{R}^n$ .

展開後： $\|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 = \mathbf{x}^\top \mathbf{A}^\top \mathbf{A} \mathbf{x} - 2\mathbf{b}^\top \mathbf{A} \mathbf{x} + \mathbf{b}^\top \mathbf{b}$

這是關於 $\mathbf{x}$ 的 quadratic function，且 $\mathbf{A}^\top \mathbf{A}$ 是 PSD → convex → 有 unique global minimum（如果 $\mathbf{A}$ full column rank）。

Solution via Normal Equation

取 gradient 令其為零：

\nabla_{\mathbf{x}} \|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 = 2\mathbf{A}^\top(\mathbf{A}\mathbf{x} - \mathbf{b}) = \mathbf{0}

得到 normal equation：

\mathbf{A}^\top \mathbf{A} \mathbf{x} = \mathbf{A}^\top \mathbf{b}

如果 $\mathbf{A}^\top \mathbf{A}$ invertible（即 $\mathbf{A}$ full column rank）：

\hat{\mathbf{x}} = (\mathbf{A}^\top \mathbf{A})^{-1} \mathbf{A}^\top \mathbf{b}

$(\mathbf{A}^\top \mathbf{A})^{-1} \mathbf{A}^\top$ 就是 $\mathbf{A}$ 的 pseudo-inverse $\mathbf{A}^+$ 。

Geometric Interpretation

最重要的直覺：least squares solution $\hat{\mathbf{x}}$ 使得 $\mathbf{A}\hat{\mathbf{x}}$ 是 $\mathbf{b}$ 在 column space of $\mathbf{A}$ 上的 orthogonal projection。

Residual $\mathbf{r} = \mathbf{b} - \mathbf{A}\hat{\mathbf{x}}$ 和 column space of $\mathbf{A}$ 正交：

\mathbf{A}^\top (\mathbf{b} - \mathbf{A}\hat{\mathbf{x}}) = \mathbf{0}

這就是 normal equation 的幾何意義：residual 必須和 column space 正交，才能讓 projection 最近。

import numpy as np

# Problem: fit y = a + b*x to data
x_data = np.array([1, 2, 3, 4, 5], dtype=float)
y_data = np.array([2.1, 3.9, 6.2, 7.8, 10.1])

# Design matrix: [1, x]
A = np.column_stack([np.ones_like(x_data), x_data])
b = y_data

# Method 1: Normal equation (explicit)
x_hat = np.linalg.inv(A.T @ A) @ A.T @ b
print(f"Normal equation: intercept={x_hat[0]:.3f}, slope={x_hat[1]:.3f}")

# Method 2: np.linalg.lstsq (numerically stable, uses SVD internally)
x_lstsq, residuals, rank, sv = np.linalg.lstsq(A, b, rcond=None)
print(f"lstsq: intercept={x_lstsq[0]:.3f}, slope={x_lstsq[1]:.3f}")

# Method 3: np.linalg.solve (solve normal equation directly)
x_solve = np.linalg.solve(A.T @ A, A.T @ b)

# Verify all three give the same answer
assert np.allclose(x_hat, x_lstsq)
assert np.allclose(x_hat, x_solve)

# Verify residual orthogonality: A^T (b - Ax) = 0
residual = b - A @ x_hat
orthogonality = A.T @ residual
print(f"A^T * residual = {orthogonality}")  # ≈ [0, 0] (machine precision)
assert np.allclose(orthogonality, 0, atol=1e-10)  # True ✓

面試連結：Normal Equation vs Gradient Descent

「什麼時候用 normal equation，什麼時候用 gradient descent？」→ Normal equation: $O(n^3)$ 計算 inverse，適合 features 少（n 小）。Gradient descent: $O(kn^2)$ ，適合 features 多或 data 太大放不進記憶體。Sklearn LinearRegression 底層用 SVD-based lstsq（不直接 invert ATA）。

Least Squares Data Fitting (Ch 13)

Polynomial Fitting

Polynomial regression 本質上還是 linear least squares — 只是 features 變成 $1, x, x^2, \ldots, x^p$ 。

Design matrix for degree- $p$ polynomial：

\mathbf{A} = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^p \\ 1 & x_2 & x_2^2 & \cdots & x_2^p \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_m & x_m^2 & \cdots & x_m^p \end{bmatrix}

模型是 parameters 的 linear function → 可以用 normal equation 解。雖然 prediction 是 input 的 nonlinear function，但 fitting 是 linear in parameters。

Overfitting and Model Complexity

High-degree polynomial 可以 perfectly fit training data（n points → degree n-1 polynomial passes through all），但 generalize 很差。

核心 tradeoff：

Underfitting（degree too low）：bias high, variance low → training error 和 test error 都大
Overfitting（degree too high）：bias low, variance high → training error 小但 test error 大
Sweet spot：bias-variance balance → cross-validation 選 degree

ML Connections

Concept	ML Connection
Polynomial fitting	Polynomial regression — design matrix with polynomial features
Basis expansion	Feature engineering: create nonlinear features, then linear fit. Kernel trick in SVM: implicit high-dim features
Overfitting	Universal problem — regularization, early stopping, dropout 都是對策
Cross-validation	Standard model selection — K-fold CV for hyperparameter tuning
Design matrix	Feature matrix X in sklearn — each row = sample, each column = feature

import numpy as np
from numpy.polynomial import polynomial as P

np.random.seed(42)

# True function: y = 0.5 + 2x - 0.3x^2 + noise
n = 30
x = np.linspace(0, 5, n)
y_true = 0.5 + 2 * x - 0.3 * x**2
y = y_true + np.random.randn(n) * 0.5

# Polynomial fitting with different degrees
for degree in [1, 3, 15]:
    # Build design matrix: [1, x, x^2, ..., x^degree]
    A = np.column_stack([x**k for k in range(degree + 1)])
    coeffs = np.linalg.lstsq(A, y, rcond=None)[0]
    y_pred = A @ coeffs
    train_error = np.mean((y - y_pred)**2)
    print(f"Degree {degree:2d}: train MSE = {train_error:.4f}, coeffs range = [{coeffs.min():.2f}, {coeffs.max():.2f}]")

# Overfitting demo: train/test split
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

x_train, x_test = x[:20].reshape(-1, 1), x[20:].reshape(-1, 1)
y_train, y_test = y[:20], y[20:]

for degree in [1, 2, 5, 15]:
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(x_train, y_train)
    train_mse = np.mean((model.predict(x_train) - y_train)**2)
    test_mse = np.mean((model.predict(x_test) - y_test)**2)
    print(f"Degree {degree:2d}: train MSE={train_mse:.3f}, test MSE={test_mse:.3f}")
    # High degree: train MSE ≈ 0 but test MSE explodes → overfitting!

# Cross-validation for degree selection
best_degree, best_cv = 1, -np.inf
for degree in range(1, 10):
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    scores = cross_val_score(model, x.reshape(-1, 1), y, cv=5, scoring='neg_mean_squared_error')
    cv_mse = -scores.mean()
    if scores.mean() > best_cv:
        best_cv = scores.mean()
        best_degree = degree
    print(f"Degree {degree}: CV MSE = {cv_mse:.3f}")
print(f"Best degree by CV: {best_degree}")

面試連結：Polynomial Regression 不是 Nonlinear Regression

常見誤解：polynomial regression 不是 nonlinear regression。它是 linear regression with polynomial features。「Linear」指的是 parameters 的 linear combination，不是 input 的 linear function。 $y = w_0 + w_1 x + w_2 x^2$ 是 linear in $w_0, w_1, w_2$ ，所以能用 normal equation 解。

Least Squares Classification (Ch 14)

Binary Classification via Least Squares

把 classification 硬用 least squares 解：labels 設為 $+1$ 或 $-1$ ，minimize $\|\mathbf{X}\mathbf{w} - \mathbf{y}\|^2$ 。

Decision boundary: $\mathbf{w}^\top \mathbf{x} = 0$

直覺：least squares 試圖讓 prediction $\mathbf{X}\mathbf{w}$ 盡量接近 target values（+1 或 -1）。但問題是：

Why LS Classification is Suboptimal

Issue	Explanation
Unbounded predictions	LS prediction 可以是任何實數，不是 probability。Correctly classified 但 confidence 很高的 points 的 residual 很大 → 反而會被 penalized
Outlier sensitivity	離很遠的正確分類點也產生大 residual → distort decision boundary
Squared loss for classification	Squared loss 不是 classification 的 natural loss — 遠離 boundary 的正確 points 不應該有 large loss

Better Alternatives

\text{LS: minimize } \sum (w^\top x_i - y_i)^2 \quad \text{vs} \quad \text{Logistic: minimize } \sum \log(1 + e^{-y_i w^\top x_i})

Logistic loss 的好處：correctly classified points with high confidence 的 loss 趨近 0（不會被 penalized）。

import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

np.random.seed(42)

# Generate binary classification data
X, y_01 = make_classification(n_samples=200, n_features=2, n_redundant=0,
                               n_informative=2, random_state=42, n_clusters_per_class=1)
y_pm1 = 2 * y_01 - 1  # Convert to {-1, +1}

# Split
X_train, X_test = X[:150], X[150:]
y_train, y_test = y_pm1[:150], y_pm1[150:]
y_train_01, y_test_01 = y_01[:150], y_01[150:]

# LS Classifier: minimize ||Xw - y||^2
# Add bias column
X_train_b = np.column_stack([np.ones(len(X_train)), X_train])
X_test_b = np.column_stack([np.ones(len(X_test)), X_test])

w_ls = np.linalg.lstsq(X_train_b, y_train, rcond=None)[0]
y_pred_ls = np.sign(X_test_b @ w_ls)
acc_ls = accuracy_score(y_test, y_pred_ls)

# Logistic Regression
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train_01)
acc_lr = accuracy_score(y_test_01, lr.predict(X_test))

print(f"LS Classifier accuracy:       {acc_ls:.3f}")
print(f"Logistic Regression accuracy:  {acc_lr:.3f}")

# Show why LS is suboptimal: predictions are unbounded
ls_predictions = X_test_b @ w_ls
print(f"LS prediction range: [{ls_predictions.min():.2f}, {ls_predictions.max():.2f}]")
print("LS predictions are unbounded — not probabilities!")

# Logistic outputs proper probabilities
lr_probs = lr.predict_proba(X_test)[:, 1]
print(f"Logistic prob range: [{lr_probs.min():.3f}, {lr_probs.max():.3f}]")

# Add outliers to see LS sensitivity
X_outlier = np.array([[10, 10], [12, 12], [15, 15]])
y_outlier = np.array([1, 1, 1])  # Correct class but far away
X_train2 = np.vstack([X_train, X_outlier])
y_train2 = np.concatenate([y_train, y_outlier])
X_train2_b = np.column_stack([np.ones(len(X_train2)), X_train2])

w_ls2 = np.linalg.lstsq(X_train2_b, y_train2, rcond=None)[0]
y_pred_ls2 = np.sign(X_test_b @ w_ls2)
acc_ls2 = accuracy_score(y_test, y_pred_ls2)
print(f"LS accuracy with outliers: {acc_ls2:.3f}")
# Outliers distort the LS boundary → accuracy degrades

面試連結：為什麼不用 MSE 做 Classification？

「為什麼 classification 不用 MSE loss？」→ MSE 對遠離 decision boundary 的正確分類點也給 large penalty（因為 prediction far from +1/-1）。Logistic/cross-entropy loss 對 confident correct predictions 的 loss 趨近 0。此外 MSE loss 在 classification 中的 gradient 在高 confidence 區域趨近 0（gradient vanishing），不利優化。

Regularized Data Fitting (Ch 15) — Multi-Objective Least Squares

The Core Tradeoff

Regularization 是 multi-objective optimization：同時 minimize 兩個目標：

Data fidelity: $\|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2$ — fit data well
Regularization: $\|\mathbf{x}\|^2$ (or $\|\mathbf{x}\|_1$ ) — keep model simple

Ridge Regression (Tikhonov Regularization, L2)

\min_{\mathbf{x}} \|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 + \lambda\|\mathbf{x}\|^2

Closed-form solution：

\hat{\mathbf{x}}_{\text{ridge}} = (\mathbf{A}^\top\mathbf{A} + \lambda\mathbf{I})^{-1}\mathbf{A}^\top\mathbf{b}

$\lambda$ 的效果：

$\lambda = 0$ : 退化為 OLS
$\lambda \to \infty$ : $\hat{\mathbf{x}} \to \mathbf{0}$ （infinite shrinkage）
加 $\lambda\mathbf{I}$ 讓 $\mathbf{A}^\top\mathbf{A} + \lambda\mathbf{I}$ 一定 invertible → 解決 multicollinearity

Lasso (L1 Regularization)

\min_{\mathbf{x}} \|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 + \lambda\|\mathbf{x}\|_1

No closed-form（L1 not differentiable at 0）→ 用 coordinate descent 或 ISTA 解。

核心差異：L1 產生 sparse solutions（many coefficients exactly 0）→ automatic feature selection。

Elastic Net

\min_{\mathbf{x}} \|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 + \lambda_1\|\mathbf{x}\|_1 + \lambda_2\|\mathbf{x}\|^2

結合 L1 的 sparsity 和 L2 的 stability（correlated features 時 Lasso 不穩定，Elastic Net 更 robust）。

Bayesian Interpretation

Regularization	Prior Distribution	MAP Estimation
Ridge (L2)	$w_j \sim \mathcal{N}(0, \sigma_w^2)$ — Gaussian prior	MAP with Gaussian prior = Ridge
Lasso (L1)	$w_j \sim \text{Laplace}(0, b)$ — Laplace prior (sharp peak at 0)	MAP with Laplace prior = Lasso
No regularization	Uniform (improper) prior	MLE = OLS

Bayesian 觀點：regularization = incorporating prior belief about parameters。Ridge 相信 weights 從 Gaussian 分佈來（不太大但不 exactly 0）。Lasso 相信 weights 從 Laplace 分佈來（很多 exactly 0，少數 nonzero）。

Bias-Variance Tradeoff

\text{MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}

$\lambda = 0$ （OLS）：zero bias, high variance → overfitting
$\lambda$ large：high bias, low variance → underfitting
Optimal $\lambda$ ：minimize total MSE → cross-validation

ML Connections

Concept	ML Application
Ridge	sklearn Ridge, weight decay in neural nets ( $\lambda\\|\mathbf{w}\\|^2$ in loss)
Lasso	Feature selection, sparse models, compressed sensing
Elastic Net	Genomics (correlated genes), text (correlated words)
Regularization path	Plot coefficients vs $\lambda$ → see which features survive
CV for lambda	sklearn RidgeCV, LassoCV — automated $\lambda$ selection
Weight decay	Same as L2 regularization — standard in all neural network training

import numpy as np
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.linear_model import RidgeCV, LassoCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

np.random.seed(42)

# Generate data with many features (some irrelevant)
n, p = 100, 20
X = np.random.randn(n, p)
true_w = np.zeros(p)
true_w[:5] = [3, -2, 1.5, -1, 0.5]  # Only first 5 features matter
y = X @ true_w + np.random.randn(n) * 0.5

# Compare OLS vs Ridge vs Lasso
models = {
    'OLS': LinearRegression(),
    'Ridge (lambda=1)': Ridge(alpha=1.0),
    'Lasso (lambda=0.1)': Lasso(alpha=0.1),
    'ElasticNet': ElasticNet(alpha=0.1, l1_ratio=0.5),
}

for name, model in models.items():
    model.fit(X, y)
    coefs = model.coef_ if hasattr(model, 'coef_') else model.coef_
    nonzero = np.sum(np.abs(coefs) > 0.01)
    cv = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
    print(f"{name:25s}: nonzero={nonzero:2d}/{p}, CV MSE={-cv.mean():.3f}")

# Lasso produces sparse solution: only ~5 nonzero coefficients ✓
print(f"\nTrue weights:  {true_w.round(2)}")
print(f"Lasso weights: {Lasso(alpha=0.1).fit(X, y).coef_.round(2)}")

# Ridge closed-form: x = (A^T A + lambda I)^{-1} A^T b
lam = 1.0
x_ridge_manual = np.linalg.solve(X.T @ X + lam * np.eye(p), X.T @ y)
x_ridge_sklearn = Ridge(alpha=lam, fit_intercept=False).fit(X, y).coef_
assert np.allclose(x_ridge_manual, x_ridge_sklearn, atol=1e-6)

# Cross-validation for lambda selection
ridge_cv = RidgeCV(alphas=np.logspace(-3, 3, 50), cv=5)
ridge_cv.fit(X, y)
print(f"\nBest Ridge lambda by CV: {ridge_cv.alpha_:.4f}")

lasso_cv = LassoCV(alphas=np.logspace(-3, 1, 50), cv=5, random_state=42)
lasso_cv.fit(X, y)
print(f"Best Lasso lambda by CV: {lasso_cv.alpha_:.4f}")

# Regularization path: how coefficients change with lambda
alphas = np.logspace(-3, 2, 50)
ridge_coefs = []
for a in alphas:
    model = Ridge(alpha=a, fit_intercept=False)
    model.fit(X, y)
    ridge_coefs.append(model.coef_.copy())
ridge_coefs = np.array(ridge_coefs)
# As lambda increases: all coefficients shrink toward 0 (but never exactly 0)
print(f"\nRidge path: at lambda=0.001, max|coef|={np.abs(ridge_coefs[0]).max():.3f}")
print(f"Ridge path: at lambda=100,   max|coef|={np.abs(ridge_coefs[-1]).max():.3f}")

面試連結：Weight Decay = L2 Regularization

「Neural network 的 weight decay 和 Ridge regression 有什麼關係？」→ 完全一樣。PyTorch optimizer 的 weight_decay 參數就是在 loss 加 $\lambda\|\mathbf{w}\|^2$ 。AdamW 把 weight decay 和 Adam 的 adaptive learning rate 分開（decoupled weight decay），效果更好。

Control via Least Squares

Control 問題：給定一個 dynamical system $\mathbf{x}_{t+1} = A\mathbf{x}_t + B\mathbf{u}_t$ ，找 input sequence $\mathbf{u}_0, \ldots, \mathbf{u}_{T-1}$ 使得 state 達到目標。

可以 formulate 成 least squares：minimize $\sum_t \|\mathbf{x}_t - \mathbf{x}_{\text{target}}\|^2 + \rho\|\mathbf{u}_t\|^2$ 。第一項要 state 接近 target，第二項 penalize 過大的 input（節省能量 / 避免 extreme actions）。

和 RL 的連結：RL 的 reward shaping + action penalty 本質上就是 multi-objective least squares。 $\rho$ 大 → 保守的 control（small actions）； $\rho$ 小 → aggressive control。

Estimation and Inversion

Estimation 問題：從 noisy observations $\mathbf{y} = A\mathbf{x} + \mathbf{v}$ 恢復 true signal $\mathbf{x}$ 。

\hat{\mathbf{x}} = \arg\min \|\mathbf{y} - A\mathbf{x}\|^2 + \lambda\|\mathbf{x}\|^2

第一項要 fit observations，第二項 regularize（避免 noise amplification）。和 denoising 的精神一致 — 在 noise reduction 和 signal preservation 之間 tradeoff。

ML 中的應用：denoising autoencoder、signal processing、inverse problems（CT scan reconstruction）。

import numpy as np

# Control: steer system from x0 to target using least squares
A = np.array([[1, 0.1], [0, 1]])  # simple dynamics
B = np.array([[0], [0.1]])        # input affects velocity

T = 10  # time horizon
n, m = 2, 1
x0 = np.array([0, 0])
x_target = np.array([1, 0])  # reach position 1, stop

# Build big system: stack all time steps into one LS problem
# x_T = A^T x_0 + sum of A^(T-k-1) B u_k
# Formulate as: minimize ||x_T - target||^2 + rho * sum ||u_k||^2
rho = 0.1
# ... (full solution requires building controllability matrix)

# Estimation: recover signal from noisy observation
x_true = np.array([1.0, 2.0, 3.0])
A_obs = np.random.randn(10, 3)  # observation matrix (more observations than unknowns)
noise = np.random.randn(10) * 0.5
y_obs = A_obs @ x_true + noise

# Simple LS estimate
x_hat = np.linalg.lstsq(A_obs, y_obs, rcond=None)[0]
print(f"True:      {x_true}")
print(f"Estimated: {x_hat.round(3)}")  # close to true

# Regularized estimate (Ridge) — more robust to noise
lam = 0.1
x_hat_reg = np.linalg.inv(A_obs.T @ A_obs + lam * np.eye(3)) @ A_obs.T @ y_obs
print(f"Ridge est: {x_hat_reg.round(3)}")

Constrained Least Squares (Ch 16-17)

Problem Formulation

\min_{\mathbf{x}} \|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 \quad \text{subject to} \quad \mathbf{C}\mathbf{x} = \mathbf{d}

有 equality constraints 的 least squares。用 Lagrangian 解：

\mathcal{L}(\mathbf{x}, \boldsymbol{\nu}) = \|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 + \boldsymbol{\nu}^\top(\mathbf{C}\mathbf{x} - \mathbf{d})

KKT Conditions

設 gradient 為零得到 KKT system：

\begin{bmatrix} 2\mathbf{A}^\top\mathbf{A} & \mathbf{C}^\top \\ \mathbf{C} & \mathbf{0} \end{bmatrix} \begin{bmatrix} \mathbf{x} \\ \boldsymbol{\nu} \end{bmatrix} = \begin{bmatrix} 2\mathbf{A}^\top\mathbf{b} \\ \mathbf{d} \end{bmatrix}

解這個 block system 就同時得到 optimal $\mathbf{x}$ 和 Lagrange multipliers $\boldsymbol{\nu}$ 。

直覺： $\boldsymbol{\nu}$ 告訴你 constraint 對 optimal value 的 sensitivity — 如果 constraint 稍微放鬆，objective 能改善多少。

ML Connections

Concept	ML Application
Constrained optimization	SVM: minimize $\\|\mathbf{w}\\|^2$ s.t. $y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1$ (constrained QP)
Lagrange multipliers	Support vectors = points where multiplier $\alpha_i > 0$ (active constraints)
Portfolio optimization	Minimize variance s.t. expected return = target (Markowitz)
Equality constraints	Softmax normalization ( $\sum p_i = 1$ ), probability simplex constraints
KKT conditions	Foundation of convex optimization — used in interior-point methods

import numpy as np
from scipy.optimize import minimize

np.random.seed(42)

# Constrained LS: minimize ||Ax - b||^2 subject to Cx = d
A = np.random.randn(10, 4)
b = np.random.randn(10)
C = np.array([[1, 1, 1, 1]])  # constraint: sum of x_i = 1
d = np.array([1.0])

# Method 1: Solve KKT system directly
n = A.shape[1]
m = C.shape[0]
KKT = np.block([
    [2 * A.T @ A, C.T],
    [C, np.zeros((m, m))]
])
rhs = np.concatenate([2 * A.T @ b, d])
sol = np.linalg.solve(KKT, rhs)
x_kkt = sol[:n]
nu = sol[n:]  # Lagrange multiplier
print(f"KKT solution: x = {x_kkt.round(4)}")
print(f"Sum of x = {x_kkt.sum():.6f}")  # = 1.0 ✓
print(f"Lagrange multiplier nu = {nu.round(4)}")

# Method 2: scipy.optimize with constraint
def objective(x):
    return np.sum((A @ x - b)**2)

constraint = {'type': 'eq', 'fun': lambda x: C @ x - d}
result = minimize(objective, np.zeros(4), constraints=constraint)
print(f"scipy solution: x = {result.x.round(4)}")
assert np.allclose(x_kkt, result.x, atol=1e-4)

# Portfolio optimization example
# 5 assets, minimize variance s.t. expected return = target, weights sum to 1
n_assets = 5
returns = np.array([0.10, 0.12, 0.08, 0.15, 0.11])  # expected returns
cov_matrix = np.array([
    [0.04, 0.006, 0.002, 0.01, 0.003],
    [0.006, 0.09, 0.004, 0.015, 0.005],
    [0.002, 0.004, 0.01, 0.003, 0.002],
    [0.01, 0.015, 0.003, 0.16, 0.008],
    [0.003, 0.005, 0.002, 0.008, 0.025],
])

target_return = 0.12

def portfolio_variance(w):
    return w @ cov_matrix @ w

constraints = [
    {'type': 'eq', 'fun': lambda w: np.sum(w) - 1},         # weights sum to 1
    {'type': 'eq', 'fun': lambda w: w @ returns - target_return},  # target return
]
bounds = [(0, 1)] * n_assets  # no short selling

result = minimize(portfolio_variance, np.ones(n_assets) / n_assets,
                  constraints=constraints, bounds=bounds, method='SLSQP')
print(f"\nOptimal portfolio weights: {result.x.round(4)}")
print(f"Portfolio return: {result.x @ returns:.4f}")
print(f"Portfolio std:    {np.sqrt(result.fun):.4f}")

Linear Quadratic Regulator (LQR)

LQR 是 constrained optimization 在 control 中的經典應用：

\min_{\mathbf{u}_0, \ldots, \mathbf{u}_{T-1}} \sum_{t=0}^{T} \mathbf{x}_t^\top Q \mathbf{x}_t + \sum_{t=0}^{T-1} \mathbf{u}_t^\top R \mathbf{u}_t \quad \text{s.t. } \mathbf{x}_{t+1} = A\mathbf{x}_t + B\mathbf{u}_t

$Q$ : penalizes state deviation（PSD matrix — 哪些 state 需要接近 0）
$R$ : penalizes control effort（PD matrix — 避免 extreme actions）
解是 linear feedback: $\mathbf{u}_t = -K\mathbf{x}_t$ （optimal gain matrix $K$ 由 Riccati equation 決定）

ML 連結: RL 中的 continuous control（robotics）本質上就是 LQR 的 nonlinear 推廣。Policy gradient methods learn $K$ when dynamics are unknown。Model-based RL 用 learned dynamics + LQR 作為 planning module。

Linear Quadratic State Estimation (Kalman Filter)

Kalman filter = 最優 state estimation for linear systems with Gaussian noise:

\text{Predict: } \hat{\mathbf{x}}_{t|t-1} = A\hat{\mathbf{x}}_{t-1} + B\mathbf{u}_{t-1}

\text{Update: } \hat{\mathbf{x}}_{t|t} = \hat{\mathbf{x}}_{t|t-1} + K_t(\mathbf{y}_t - C\hat{\mathbf{x}}_{t|t-1})

$K_t$ = Kalman gain — balances prediction vs observation。Prediction uncertainty 高 → trust observation more。Observation noise 高 → trust prediction more。

ML 連結:

Application	Connection
Time series	Kalman filter = Bayesian state estimation for linear time series
Sensor fusion	Combine GPS + IMU readings optimally
Object tracking	Predict + correct object position in video
RNN analogy	Kalman: predict → observe → update state。RNN: predict → input → update hidden state

import numpy as np

# Simple Kalman Filter: track position from noisy measurements
A = np.array([[1, 1], [0, 1]])     # state transition: [pos, vel]
C = np.array([[1, 0]])              # observe position only
Q_noise = np.eye(2) * 0.01         # process noise
R_noise = np.array([[1.0]])         # measurement noise

# Initialize
x_hat = np.array([0.0, 0.0])      # initial state estimate
P = np.eye(2)                       # initial covariance

# True trajectory
x_true = np.array([0.0, 1.0])      # start at 0, velocity 1
measurements = []
estimates = []

for t in range(20):
    # True state evolution
    x_true = A @ x_true + np.random.multivariate_normal([0,0], Q_noise)
    z = C @ x_true + np.random.normal(0, np.sqrt(R_noise[0,0]))

    # Predict
    x_pred = A @ x_hat
    P_pred = A @ P @ A.T + Q_noise

    # Update (Kalman gain)
    S = C @ P_pred @ C.T + R_noise
    K = P_pred @ C.T @ np.linalg.inv(S)
    x_hat = x_pred + K.flatten() * (z - C @ x_pred)
    P = (np.eye(2) - K @ C) @ P_pred

    estimates.append(x_hat.copy())

print(f"True final state:      {x_true.round(2)}")
print(f"Estimated final state: {x_hat.round(2)}")

LQR + Kalman = LQG

LQR（optimal control）+ Kalman filter（optimal estimation）= LQG（Linear Quadratic Gaussian）— separation principle 讓你可以 independently design controller 和 estimator。這是 control theory 的 fundamental result，也是 model-based RL 的理論基礎。

Nonlinear Least Squares (Ch 18-19)

Problem Formulation

When the model is nonlinear in parameters:

\min_{\boldsymbol{\theta}} \sum_{i=1}^{m} (f(\mathbf{x}_i; \boldsymbol{\theta}) - y_i)^2 = \|\mathbf{f}(\boldsymbol{\theta}) - \mathbf{y}\|^2

不能用 normal equation（因為 $f$ 不是 linear in $\boldsymbol{\theta}$ ）→ 需要 iterative methods。

Gauss-Newton Algorithm

核心 idea：在當前 $\boldsymbol{\theta}_k$ 附近把 $\mathbf{f}(\boldsymbol{\theta})$ 線性化（Taylor expansion）：

\mathbf{f}(\boldsymbol{\theta}) \approx \mathbf{f}(\boldsymbol{\theta}_k) + \mathbf{J}_k (\boldsymbol{\theta} - \boldsymbol{\theta}_k)

where $\mathbf{J}_k$ is the Jacobian of $\mathbf{f}$ at $\boldsymbol{\theta}_k$ 。

代入 least squares problem → 變成 linear least squares：

\min_{\Delta\boldsymbol{\theta}} \|\mathbf{J}_k \Delta\boldsymbol{\theta} - (\mathbf{y} - \mathbf{f}(\boldsymbol{\theta}_k))\|^2

Normal equation for update： $\mathbf{J}_k^\top \mathbf{J}_k \Delta\boldsymbol{\theta} = \mathbf{J}_k^\top (\mathbf{y} - \mathbf{f}(\boldsymbol{\theta}_k))$

然後 $\boldsymbol{\theta}_{k+1} = \boldsymbol{\theta}_k + \Delta\boldsymbol{\theta}$ ，repeat。

Levenberg-Marquardt

Gauss-Newton + regularization（trust region）：

(\mathbf{J}^\top \mathbf{J} + \mu\mathbf{I})\Delta\boldsymbol{\theta} = \mathbf{J}^\top(\mathbf{y} - \mathbf{f}(\boldsymbol{\theta}))

$\mu$ 大 → 接近 gradient descent（small careful steps）
$\mu$ 小 → 接近 Gauss-Newton（aggressive steps）
自動調節 $\mu$ ：step 好 → 減小 $\mu$ （更 aggressive），step 差 → 增大 $\mu$ （更 cautious）

Connection to Neural Networks

Neural network training 本質上就是 nonlinear least squares（用 MSE loss 時）：

\min_{\boldsymbol{\theta}} \|\text{NN}(\mathbf{X}; \boldsymbol{\theta}) - \mathbf{y}\|^2

Backpropagation = computing gradient of this nonlinear LS problem
SGD = simplified version of Gauss-Newton（only first-order, no Jacobian/Hessian）
Natural gradient / K-FAC = use curvature information similar to Gauss-Newton

Logistic regression 也是 nonlinear LS 的特例： $f(\mathbf{x}; \mathbf{w}) = \sigma(\mathbf{w}^\top\mathbf{x})$ （sigmoid 使其 nonlinear in $\mathbf{w}$ ... 但通常用 cross-entropy loss 而非 squared loss）。

ML Connections

Concept	ML Application
Nonlinear LS	Neural network training with MSE loss
Gauss-Newton	Foundation for second-order optimizers (K-FAC, natural gradient)
Levenberg-Marquardt	scipy curve fitting, robotics, computer vision (bundle adjustment)
Jacobian	Backpropagation computes product of Jacobians through layers
Logistic regression	Nonlinear LS with sigmoid (but usually trained with cross-entropy)
Trust region methods	More robust than line search for non-convex problems

import numpy as np
from scipy.optimize import least_squares, curve_fit

np.random.seed(42)

# Example 1: Nonlinear curve fitting
# True model: y = a * exp(-b * x) + c
def model(x, a, b, c):
    return a * np.exp(-b * x) + c

x_data = np.linspace(0, 5, 50)
y_true = model(x_data, 2.5, 1.3, 0.5)
y_data = y_true + np.random.randn(50) * 0.1

# Method 1: scipy.optimize.curve_fit (uses Levenberg-Marquardt)
popt, pcov = curve_fit(model, x_data, y_data, p0=[1.0, 1.0, 0.0])
print(f"curve_fit: a={popt[0]:.3f}, b={popt[1]:.3f}, c={popt[2]:.3f}")
print(f"True:      a=2.500, b=1.300, c=0.500")

# Method 2: scipy.optimize.least_squares (more control)
def residuals(params):
    a, b, c = params
    return model(x_data, a, b, c) - y_data

result = least_squares(residuals, x0=[1.0, 1.0, 0.0], method='lm')  # 'lm' = Levenberg-Marquardt
print(f"least_squares: a={result.x[0]:.3f}, b={result.x[1]:.3f}, c={result.x[2]:.3f}")
print(f"Cost (sum of squared residuals): {result.cost:.4f}")
print(f"Optimality (gradient norm): {result.optimality:.2e}")

# Example 2: Logistic regression as nonlinear LS
def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

# Generate binary classification data
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=200, n_features=2, n_redundant=0, random_state=42)

# Logistic regression via nonlinear least squares (minimize ||sigmoid(Xw) - y||^2)
def logistic_residuals(params):
    w = params[:2]
    b = params[2]
    pred = sigmoid(X @ w + b)
    return pred - y

result_logistic = least_squares(logistic_residuals, x0=np.zeros(3))
w_nls = result_logistic.x[:2]
b_nls = result_logistic.x[2]
pred_nls = (sigmoid(X @ w_nls + b_nls) > 0.5).astype(int)
acc_nls = np.mean(pred_nls == y)
print(f"\nLogistic via nonlinear LS: accuracy={acc_nls:.3f}")

# Compare with sklearn LogisticRegression (uses cross-entropy, not squared loss)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state=42)
lr.fit(X, y)
acc_lr = lr.score(X, y)
print(f"sklearn LogisticRegression:  accuracy={acc_lr:.3f}")

# Example 3: Neural network connection
# Simple 1-hidden-layer network: f(x) = W2 * relu(W1 * x + b1) + b2
# This is nonlinear LS when using MSE loss

def nn_predict(params, X):
    """Simple neural net: 2 -> 4 -> 1"""
    W1 = params[:8].reshape(4, 2)
    b1 = params[8:12]
    W2 = params[12:16].reshape(1, 4)
    b2 = params[16]
    h = np.maximum(0, X @ W1.T + b1)  # ReLU
    return (h @ W2.T + b2).flatten()

def nn_residuals(params):
    return nn_predict(params, X) - y

# Solve with Levenberg-Marquardt-like method
result_nn = least_squares(nn_residuals, x0=np.random.randn(17) * 0.1,
                          method='trf', max_nfev=1000)
pred_nn = (nn_predict(result_nn.x, X) > 0.5).astype(int)
acc_nn = np.mean(pred_nn == y)
print(f"Simple NN via nonlinear LS:  accuracy={acc_nn:.3f}")
print("Neural net training = nonlinear least squares (with MSE loss)")

面試連結：Second-Order vs First-Order Optimization

「SGD 和 Newton's method 的差異？」→ SGD 只用 gradient（first-order），步驟簡單但收斂慢。Newton/Gauss-Newton 用 Hessian/Jacobian information（second-order），每步計算貴但收斂快。Adam 用 running average of squared gradients 來 approximate diagonal Hessian → 是 first-order 和 second-order 的折衷。

Constrained Nonlinear Least Squares

When the optimization has both nonlinear objective and constraints:

\min_{\boldsymbol{\theta}} \|\mathbf{f}(\boldsymbol{\theta})\|^2 \quad \text{s.t. } g(\boldsymbol{\theta}) = 0, \quad h(\boldsymbol{\theta}) \leq 0

Penalty Method

Convert constrained problem into unconstrained by adding penalty:

\min_{\boldsymbol{\theta}} \|\mathbf{f}(\boldsymbol{\theta})\|^2 + \mu \|g(\boldsymbol{\theta})\|^2

$\mu$ 越大 → constraint violation penalty 越重 → 解越接近 feasible，但 optimization 越 ill-conditioned。

實務中 progressively increase $\mu$ ：先用小 $\mu$ （easy optimization），逐步加大直到 constraint 幾乎被滿足。

ML 連結: Regularization 就是一種 penalty method — 原始問題是 constrained（ $\|\mathbf{w}\| \leq t$ ），用 penalty $\lambda\|\mathbf{w}\|^2$ 轉成 unconstrained（Ridge regression）。

Augmented Lagrangian

結合 Lagrange multiplier 和 penalty method — 更好的 convergence：

\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\lambda}) = \|\mathbf{f}(\boldsymbol{\theta})\|^2 + \boldsymbol{\lambda}^\top g(\boldsymbol{\theta}) + \frac{\mu}{2}\|g(\boldsymbol{\theta})\|^2

Alternately update $\boldsymbol{\theta}$ （minimize augmented Lagrangian）and $\boldsymbol{\lambda}$ （dual update: $\boldsymbol{\lambda} \leftarrow \boldsymbol{\lambda} + \mu g(\boldsymbol{\theta})$ ）。

比 pure penalty method 好：不需要 $\mu \to \infty$ 就能得到 exact solution（Lagrange multiplier 提供了 additional degree of freedom）。

ML 連結: ADMM（Alternating Direction Method of Multipliers）是 augmented Lagrangian 的推廣 — 在 distributed optimization 和 large-scale ML 中使用。

Nonlinear Control

Extend LQR to nonlinear dynamics: $\mathbf{x}_{t+1} = f(\mathbf{x}_t, \mathbf{u}_t)$ （不再假設 linear）。

Methods:

iLQR (iterative LQR): Linearize dynamics around current trajectory → solve LQR → update trajectory → repeat
Model Predictive Control (MPC): Online re-solve at each time step with updated state
Policy gradient (RL): Learn control policy without knowing dynamics model

iLQR 和 MPC 是 model-based RL 的核心 — 知道（或學到）dynamics model 後，用 optimization 而非 trial-and-error 找 optimal policy。

from scipy.optimize import minimize
import numpy as np

# Penalty method: minimize x^2 subject to x = 3
# Unconstrained: minimize x^2 + mu * (x - 3)^2
for mu in [0.1, 1, 10, 100, 1000]:
    result = minimize(lambda x: x[0]**2 + mu * (x[0] - 3)**2, x0=[0])
    x_opt = result.x[0]
    print(f"mu={mu:>6}: x={x_opt:.4f}, constraint violation={abs(x_opt-3):.4f}")
    # As mu increases: x → 3 (constraint satisfied), but conditioning worsens

# Augmented Lagrangian: better convergence
x = 0.0
lam = 0.0  # Lagrange multiplier
mu = 1.0
for iteration in range(10):
    # Minimize augmented Lagrangian w.r.t. x
    result = minimize(lambda x: x[0]**2 + lam*(x[0]-3) + mu/2*(x[0]-3)**2, x0=[x])
    x = result.x[0]
    # Dual update
    lam = lam + mu * (x - 3)
    if abs(x - 3) < 1e-6:
        break
print(f"\nAugmented Lagrangian: x={x:.6f} (should be close to 3)")

Interview Signals

What interviewers listen for:

你能從 normal equation 推導 OLS closed-form solution
你理解 residual orthogonality 的幾何意義（projection onto column space）
你能解釋 Ridge vs Lasso 的差異（L2 shrink vs L1 sparsity）以及 Bayesian interpretation
你知道 polynomial regression 仍然是 linear regression（linear in parameters）
你理解 LS classification 的缺陷和為什麼用 logistic loss 更好
你能連結 neural network training 和 nonlinear least squares
你知道 weight decay = L2 regularization

Practice

Flashcards

Flashcards (1/10)

Least squares 的 normal equation 是什麼？如何推導？

Normal equation: ATAx = ATb，解為 x = (ATA)^(-1)ATb。推導：對 ||Ax-b||^2 取 gradient 令為零 → 2AT(Ax-b) = 0 → ATAx = ATb。幾何意義：residual b-Ax 必須和 A 的 column space 正交。

Click card to flip

Quiz

Question 1/10

Normal equation ATAx = ATb 的幾何意義是什麼？

Mark as Complete

How confident are you with this topic?

3/5 — Okay