Least Squares & Applications
Course Reference
Based on Introduction to Applied Linear Algebra by Boyd & Vandenberghe (Stanford). Covers Chapters 12-19.
Why This Matters
Least squares 是 ML 最基礎的優化框架 — linear regression 是 least squares,Ridge/Lasso 是 regularized least squares,neural network training 是 nonlinear least squares。理解這些讓你能用統一的數學語言看懂 ML 的核心。
The Least Squares Problem (Ch 12)
Problem Formulation
The least squares problem: find that minimizes the sum of squared residuals:
where , , .
展開後:
這是關於 的 quadratic function,且 是 PSD → convex → 有 unique global minimum(如果 full column rank)。
Solution via Normal Equation
取 gradient 令其為零:
得到 normal equation:
如果 invertible(即 full column rank):
就是 的 pseudo-inverse 。
Geometric Interpretation
最重要的直覺:least squares solution 使得 是 在 column space of 上的 orthogonal projection。
Residual 和 column space of 正交:
這就是 normal equation 的幾何意義:residual 必須和 column space 正交,才能讓 projection 最近。
import numpy as np
# Problem: fit y = a + b*x to data
x_data = np.array([1, 2, 3, 4, 5], dtype=float)
y_data = np.array([2.1, 3.9, 6.2, 7.8, 10.1])
# Design matrix: [1, x]
A = np.column_stack([np.ones_like(x_data), x_data])
b = y_data
# Method 1: Normal equation (explicit)
x_hat = np.linalg.inv(A.T @ A) @ A.T @ b
print(f"Normal equation: intercept={x_hat[0]:.3f}, slope={x_hat[1]:.3f}")
# Method 2: np.linalg.lstsq (numerically stable, uses SVD internally)
x_lstsq, residuals, rank, sv = np.linalg.lstsq(A, b, rcond=None)
print(f"lstsq: intercept={x_lstsq[0]:.3f}, slope={x_lstsq[1]:.3f}")
# Method 3: np.linalg.solve (solve normal equation directly)
x_solve = np.linalg.solve(A.T @ A, A.T @ b)
# Verify all three give the same answer
assert np.allclose(x_hat, x_lstsq)
assert np.allclose(x_hat, x_solve)
# Verify residual orthogonality: A^T (b - Ax) = 0
residual = b - A @ x_hat
orthogonality = A.T @ residual
print(f"A^T * residual = {orthogonality}") # ≈ [0, 0] (machine precision)
assert np.allclose(orthogonality, 0, atol=1e-10) # True ✓
面試連結:Normal Equation vs Gradient Descent
「什麼時候用 normal equation,什麼時候用 gradient descent?」→ Normal equation: 計算 inverse,適合 features 少(n 小)。Gradient descent: ,適合 features 多或 data 太大放不進記憶體。Sklearn LinearRegression 底層用 SVD-based lstsq(不直接 invert ATA)。
Least Squares Data Fitting (Ch 13)
Polynomial Fitting
Polynomial regression 本質上還是 linear least squares — 只是 features 變成 。
Design matrix for degree- polynomial:
模型是 parameters 的 linear function → 可以用 normal equation 解。雖然 prediction 是 input 的 nonlinear function,但 fitting 是 linear in parameters。
Overfitting and Model Complexity
High-degree polynomial 可以 perfectly fit training data(n points → degree n-1 polynomial passes through all),但 generalize 很差。
核心 tradeoff:
- Underfitting(degree too low):bias high, variance low → training error 和 test error 都大
- Overfitting(degree too high):bias low, variance high → training error 小但 test error 大
- Sweet spot:bias-variance balance → cross-validation 選 degree
ML Connections
| Concept | ML Connection |
|---|---|
| Polynomial fitting | Polynomial regression — design matrix with polynomial features |
| Basis expansion | Feature engineering: create nonlinear features, then linear fit. Kernel trick in SVM: implicit high-dim features |
| Overfitting | Universal problem — regularization, early stopping, dropout 都是對策 |
| Cross-validation | Standard model selection — K-fold CV for hyperparameter tuning |
| Design matrix | Feature matrix X in sklearn — each row = sample, each column = feature |
import numpy as np
from numpy.polynomial import polynomial as P
np.random.seed(42)
# True function: y = 0.5 + 2x - 0.3x^2 + noise
n = 30
x = np.linspace(0, 5, n)
y_true = 0.5 + 2 * x - 0.3 * x**2
y = y_true + np.random.randn(n) * 0.5
# Polynomial fitting with different degrees
for degree in [1, 3, 15]:
# Build design matrix: [1, x, x^2, ..., x^degree]
A = np.column_stack([x**k for k in range(degree + 1)])
coeffs = np.linalg.lstsq(A, y, rcond=None)[0]
y_pred = A @ coeffs
train_error = np.mean((y - y_pred)**2)
print(f"Degree {degree:2d}: train MSE = {train_error:.4f}, coeffs range = [{coeffs.min():.2f}, {coeffs.max():.2f}]")
# Overfitting demo: train/test split
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
x_train, x_test = x[:20].reshape(-1, 1), x[20:].reshape(-1, 1)
y_train, y_test = y[:20], y[20:]
for degree in [1, 2, 5, 15]:
model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
model.fit(x_train, y_train)
train_mse = np.mean((model.predict(x_train) - y_train)**2)
test_mse = np.mean((model.predict(x_test) - y_test)**2)
print(f"Degree {degree:2d}: train MSE={train_mse:.3f}, test MSE={test_mse:.3f}")
# High degree: train MSE ≈ 0 but test MSE explodes → overfitting!
# Cross-validation for degree selection
best_degree, best_cv = 1, -np.inf
for degree in range(1, 10):
model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
scores = cross_val_score(model, x.reshape(-1, 1), y, cv=5, scoring='neg_mean_squared_error')
cv_mse = -scores.mean()
if scores.mean() > best_cv:
best_cv = scores.mean()
best_degree = degree
print(f"Degree {degree}: CV MSE = {cv_mse:.3f}")
print(f"Best degree by CV: {best_degree}")
面試連結:Polynomial Regression 不是 Nonlinear Regression
常見誤解:polynomial regression 不是 nonlinear regression。它是 linear regression with polynomial features。「Linear」指的是 parameters 的 linear combination,不是 input 的 linear function。 是 linear in ,所以能用 normal equation 解。
Least Squares Classification (Ch 14)
Binary Classification via Least Squares
把 classification 硬用 least squares 解:labels 設為 或 ,minimize 。
Decision boundary:
直覺:least squares 試圖讓 prediction 盡量接近 target values(+1 或 -1)。但問題是:
Why LS Classification is Suboptimal
| Issue | Explanation |
|---|---|
| Unbounded predictions | LS prediction 可以是任何實數,不是 probability。Correctly classified 但 confidence 很高的 points 的 residual 很大 → 反而會被 penalized |
| Outlier sensitivity | 離很遠的正確分類點也產生大 residual → distort decision boundary |
| Squared loss for classification | Squared loss 不是 classification 的 natural loss — 遠離 boundary 的正確 points 不應該有 large loss |
Better Alternatives
Logistic loss 的好處:correctly classified points with high confidence 的 loss 趨近 0(不會被 penalized)。
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
np.random.seed(42)
# Generate binary classification data
X, y_01 = make_classification(n_samples=200, n_features=2, n_redundant=0,
n_informative=2, random_state=42, n_clusters_per_class=1)
y_pm1 = 2 * y_01 - 1 # Convert to {-1, +1}
# Split
X_train, X_test = X[:150], X[150:]
y_train, y_test = y_pm1[:150], y_pm1[150:]
y_train_01, y_test_01 = y_01[:150], y_01[150:]
# LS Classifier: minimize ||Xw - y||^2
# Add bias column
X_train_b = np.column_stack([np.ones(len(X_train)), X_train])
X_test_b = np.column_stack([np.ones(len(X_test)), X_test])
w_ls = np.linalg.lstsq(X_train_b, y_train, rcond=None)[0]
y_pred_ls = np.sign(X_test_b @ w_ls)
acc_ls = accuracy_score(y_test, y_pred_ls)
# Logistic Regression
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train_01)
acc_lr = accuracy_score(y_test_01, lr.predict(X_test))
print(f"LS Classifier accuracy: {acc_ls:.3f}")
print(f"Logistic Regression accuracy: {acc_lr:.3f}")
# Show why LS is suboptimal: predictions are unbounded
ls_predictions = X_test_b @ w_ls
print(f"LS prediction range: [{ls_predictions.min():.2f}, {ls_predictions.max():.2f}]")
print("LS predictions are unbounded — not probabilities!")
# Logistic outputs proper probabilities
lr_probs = lr.predict_proba(X_test)[:, 1]
print(f"Logistic prob range: [{lr_probs.min():.3f}, {lr_probs.max():.3f}]")
# Add outliers to see LS sensitivity
X_outlier = np.array([[10, 10], [12, 12], [15, 15]])
y_outlier = np.array([1, 1, 1]) # Correct class but far away
X_train2 = np.vstack([X_train, X_outlier])
y_train2 = np.concatenate([y_train, y_outlier])
X_train2_b = np.column_stack([np.ones(len(X_train2)), X_train2])
w_ls2 = np.linalg.lstsq(X_train2_b, y_train2, rcond=None)[0]
y_pred_ls2 = np.sign(X_test_b @ w_ls2)
acc_ls2 = accuracy_score(y_test, y_pred_ls2)
print(f"LS accuracy with outliers: {acc_ls2:.3f}")
# Outliers distort the LS boundary → accuracy degrades
面試連結:為什麼不用 MSE 做 Classification?
「為什麼 classification 不用 MSE loss?」→ MSE 對遠離 decision boundary 的正確分類點也給 large penalty(因為 prediction far from +1/-1)。Logistic/cross-entropy loss 對 confident correct predictions 的 loss 趨近 0。此外 MSE loss 在 classification 中的 gradient 在高 confidence 區域趨近 0(gradient vanishing),不利優化。
Regularized Data Fitting (Ch 15) — Multi-Objective Least Squares
The Core Tradeoff
Regularization 是 multi-objective optimization:同時 minimize 兩個目標:
- Data fidelity: — fit data well
- Regularization: (or ) — keep model simple
Ridge Regression (Tikhonov Regularization, L2)
Closed-form solution:
的效果:
- : 退化為 OLS
- : (infinite shrinkage)
- 加 讓 一定 invertible → 解決 multicollinearity
Lasso (L1 Regularization)
No closed-form(L1 not differentiable at 0)→ 用 coordinate descent 或 ISTA 解。
核心差異:L1 產生 sparse solutions(many coefficients exactly 0)→ automatic feature selection。
Elastic Net
結合 L1 的 sparsity 和 L2 的 stability(correlated features 時 Lasso 不穩定,Elastic Net 更 robust)。
Bayesian Interpretation
| Regularization | Prior Distribution | MAP Estimation |
|---|---|---|
| Ridge (L2) | — Gaussian prior | MAP with Gaussian prior = Ridge |
| Lasso (L1) | — Laplace prior (sharp peak at 0) | MAP with Laplace prior = Lasso |
| No regularization | Uniform (improper) prior | MLE = OLS |
Bayesian 觀點:regularization = incorporating prior belief about parameters。Ridge 相信 weights 從 Gaussian 分佈來(不太大但不 exactly 0)。Lasso 相信 weights 從 Laplace 分佈來(很多 exactly 0,少數 nonzero)。
Bias-Variance Tradeoff
- (OLS):zero bias, high variance → overfitting
- large:high bias, low variance → underfitting
- Optimal :minimize total MSE → cross-validation
ML Connections
| Concept | ML Application |
|---|---|
| Ridge | sklearn Ridge, weight decay in neural nets ( in loss) |
| Lasso | Feature selection, sparse models, compressed sensing |
| Elastic Net | Genomics (correlated genes), text (correlated words) |
| Regularization path | Plot coefficients vs → see which features survive |
| CV for lambda | sklearn RidgeCV, LassoCV — automated selection |
| Weight decay | Same as L2 regularization — standard in all neural network training |
import numpy as np
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.linear_model import RidgeCV, LassoCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
np.random.seed(42)
# Generate data with many features (some irrelevant)
n, p = 100, 20
X = np.random.randn(n, p)
true_w = np.zeros(p)
true_w[:5] = [3, -2, 1.5, -1, 0.5] # Only first 5 features matter
y = X @ true_w + np.random.randn(n) * 0.5
# Compare OLS vs Ridge vs Lasso
models = {
'OLS': LinearRegression(),
'Ridge (lambda=1)': Ridge(alpha=1.0),
'Lasso (lambda=0.1)': Lasso(alpha=0.1),
'ElasticNet': ElasticNet(alpha=0.1, l1_ratio=0.5),
}
for name, model in models.items():
model.fit(X, y)
coefs = model.coef_ if hasattr(model, 'coef_') else model.coef_
nonzero = np.sum(np.abs(coefs) > 0.01)
cv = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
print(f"{name:25s}: nonzero={nonzero:2d}/{p}, CV MSE={-cv.mean():.3f}")
# Lasso produces sparse solution: only ~5 nonzero coefficients ✓
print(f"\nTrue weights: {true_w.round(2)}")
print(f"Lasso weights: {Lasso(alpha=0.1).fit(X, y).coef_.round(2)}")
# Ridge closed-form: x = (A^T A + lambda I)^{-1} A^T b
lam = 1.0
x_ridge_manual = np.linalg.solve(X.T @ X + lam * np.eye(p), X.T @ y)
x_ridge_sklearn = Ridge(alpha=lam, fit_intercept=False).fit(X, y).coef_
assert np.allclose(x_ridge_manual, x_ridge_sklearn, atol=1e-6)
# Cross-validation for lambda selection
ridge_cv = RidgeCV(alphas=np.logspace(-3, 3, 50), cv=5)
ridge_cv.fit(X, y)
print(f"\nBest Ridge lambda by CV: {ridge_cv.alpha_:.4f}")
lasso_cv = LassoCV(alphas=np.logspace(-3, 1, 50), cv=5, random_state=42)
lasso_cv.fit(X, y)
print(f"Best Lasso lambda by CV: {lasso_cv.alpha_:.4f}")
# Regularization path: how coefficients change with lambda
alphas = np.logspace(-3, 2, 50)
ridge_coefs = []
for a in alphas:
model = Ridge(alpha=a, fit_intercept=False)
model.fit(X, y)
ridge_coefs.append(model.coef_.copy())
ridge_coefs = np.array(ridge_coefs)
# As lambda increases: all coefficients shrink toward 0 (but never exactly 0)
print(f"\nRidge path: at lambda=0.001, max|coef|={np.abs(ridge_coefs[0]).max():.3f}")
print(f"Ridge path: at lambda=100, max|coef|={np.abs(ridge_coefs[-1]).max():.3f}")
面試連結:Weight Decay = L2 Regularization
「Neural network 的 weight decay 和 Ridge regression 有什麼關係?」→ 完全一樣。PyTorch optimizer 的 weight_decay 參數就是在 loss 加 。AdamW 把 weight decay 和 Adam 的 adaptive learning rate 分開(decoupled weight decay),效果更好。
Control via Least Squares
Control 問題:給定一個 dynamical system ,找 input sequence 使得 state 達到目標。
可以 formulate 成 least squares:minimize 。第一項要 state 接近 target,第二項 penalize 過大的 input(節省能量 / 避免 extreme actions)。
和 RL 的連結:RL 的 reward shaping + action penalty 本質上就是 multi-objective least squares。 大 → 保守的 control(small actions); 小 → aggressive control。
Estimation and Inversion
Estimation 問題:從 noisy observations 恢復 true signal 。
第一項要 fit observations,第二項 regularize(避免 noise amplification)。和 denoising 的精神一致 — 在 noise reduction 和 signal preservation 之間 tradeoff。
ML 中的應用:denoising autoencoder、signal processing、inverse problems(CT scan reconstruction)。
import numpy as np
# Control: steer system from x0 to target using least squares
A = np.array([[1, 0.1], [0, 1]]) # simple dynamics
B = np.array([[0], [0.1]]) # input affects velocity
T = 10 # time horizon
n, m = 2, 1
x0 = np.array([0, 0])
x_target = np.array([1, 0]) # reach position 1, stop
# Build big system: stack all time steps into one LS problem
# x_T = A^T x_0 + sum of A^(T-k-1) B u_k
# Formulate as: minimize ||x_T - target||^2 + rho * sum ||u_k||^2
rho = 0.1
# ... (full solution requires building controllability matrix)
# Estimation: recover signal from noisy observation
x_true = np.array([1.0, 2.0, 3.0])
A_obs = np.random.randn(10, 3) # observation matrix (more observations than unknowns)
noise = np.random.randn(10) * 0.5
y_obs = A_obs @ x_true + noise
# Simple LS estimate
x_hat = np.linalg.lstsq(A_obs, y_obs, rcond=None)[0]
print(f"True: {x_true}")
print(f"Estimated: {x_hat.round(3)}") # close to true
# Regularized estimate (Ridge) — more robust to noise
lam = 0.1
x_hat_reg = np.linalg.inv(A_obs.T @ A_obs + lam * np.eye(3)) @ A_obs.T @ y_obs
print(f"Ridge est: {x_hat_reg.round(3)}")
Constrained Least Squares (Ch 16-17)
Problem Formulation
有 equality constraints 的 least squares。用 Lagrangian 解:
KKT Conditions
設 gradient 為零得到 KKT system:
解這個 block system 就同時得到 optimal 和 Lagrange multipliers 。
直覺: 告訴你 constraint 對 optimal value 的 sensitivity — 如果 constraint 稍微放鬆,objective 能改善多少。
ML Connections
| Concept | ML Application |
|---|---|
| Constrained optimization | SVM: minimize s.t. (constrained QP) |
| Lagrange multipliers | Support vectors = points where multiplier (active constraints) |
| Portfolio optimization | Minimize variance s.t. expected return = target (Markowitz) |
| Equality constraints | Softmax normalization (), probability simplex constraints |
| KKT conditions | Foundation of convex optimization — used in interior-point methods |
import numpy as np
from scipy.optimize import minimize
np.random.seed(42)
# Constrained LS: minimize ||Ax - b||^2 subject to Cx = d
A = np.random.randn(10, 4)
b = np.random.randn(10)
C = np.array([[1, 1, 1, 1]]) # constraint: sum of x_i = 1
d = np.array([1.0])
# Method 1: Solve KKT system directly
n = A.shape[1]
m = C.shape[0]
KKT = np.block([
[2 * A.T @ A, C.T],
[C, np.zeros((m, m))]
])
rhs = np.concatenate([2 * A.T @ b, d])
sol = np.linalg.solve(KKT, rhs)
x_kkt = sol[:n]
nu = sol[n:] # Lagrange multiplier
print(f"KKT solution: x = {x_kkt.round(4)}")
print(f"Sum of x = {x_kkt.sum():.6f}") # = 1.0 ✓
print(f"Lagrange multiplier nu = {nu.round(4)}")
# Method 2: scipy.optimize with constraint
def objective(x):
return np.sum((A @ x - b)**2)
constraint = {'type': 'eq', 'fun': lambda x: C @ x - d}
result = minimize(objective, np.zeros(4), constraints=constraint)
print(f"scipy solution: x = {result.x.round(4)}")
assert np.allclose(x_kkt, result.x, atol=1e-4)
# Portfolio optimization example
# 5 assets, minimize variance s.t. expected return = target, weights sum to 1
n_assets = 5
returns = np.array([0.10, 0.12, 0.08, 0.15, 0.11]) # expected returns
cov_matrix = np.array([
[0.04, 0.006, 0.002, 0.01, 0.003],
[0.006, 0.09, 0.004, 0.015, 0.005],
[0.002, 0.004, 0.01, 0.003, 0.002],
[0.01, 0.015, 0.003, 0.16, 0.008],
[0.003, 0.005, 0.002, 0.008, 0.025],
])
target_return = 0.12
def portfolio_variance(w):
return w @ cov_matrix @ w
constraints = [
{'type': 'eq', 'fun': lambda w: np.sum(w) - 1}, # weights sum to 1
{'type': 'eq', 'fun': lambda w: w @ returns - target_return}, # target return
]
bounds = [(0, 1)] * n_assets # no short selling
result = minimize(portfolio_variance, np.ones(n_assets) / n_assets,
constraints=constraints, bounds=bounds, method='SLSQP')
print(f"\nOptimal portfolio weights: {result.x.round(4)}")
print(f"Portfolio return: {result.x @ returns:.4f}")
print(f"Portfolio std: {np.sqrt(result.fun):.4f}")
Linear Quadratic Regulator (LQR)
LQR 是 constrained optimization 在 control 中的經典應用:
- : penalizes state deviation(PSD matrix — 哪些 state 需要接近 0)
- : penalizes control effort(PD matrix — 避免 extreme actions)
- 解是 linear feedback: (optimal gain matrix 由 Riccati equation 決定)
ML 連結: RL 中的 continuous control(robotics)本質上就是 LQR 的 nonlinear 推廣。Policy gradient methods learn when dynamics are unknown。Model-based RL 用 learned dynamics + LQR 作為 planning module。
Linear Quadratic State Estimation (Kalman Filter)
Kalman filter = 最優 state estimation for linear systems with Gaussian noise:
= Kalman gain — balances prediction vs observation。Prediction uncertainty 高 → trust observation more。Observation noise 高 → trust prediction more。
ML 連結:
| Application | Connection |
|---|---|
| Time series | Kalman filter = Bayesian state estimation for linear time series |
| Sensor fusion | Combine GPS + IMU readings optimally |
| Object tracking | Predict + correct object position in video |
| RNN analogy | Kalman: predict → observe → update state。RNN: predict → input → update hidden state |
import numpy as np
# Simple Kalman Filter: track position from noisy measurements
A = np.array([[1, 1], [0, 1]]) # state transition: [pos, vel]
C = np.array([[1, 0]]) # observe position only
Q_noise = np.eye(2) * 0.01 # process noise
R_noise = np.array([[1.0]]) # measurement noise
# Initialize
x_hat = np.array([0.0, 0.0]) # initial state estimate
P = np.eye(2) # initial covariance
# True trajectory
x_true = np.array([0.0, 1.0]) # start at 0, velocity 1
measurements = []
estimates = []
for t in range(20):
# True state evolution
x_true = A @ x_true + np.random.multivariate_normal([0,0], Q_noise)
z = C @ x_true + np.random.normal(0, np.sqrt(R_noise[0,0]))
# Predict
x_pred = A @ x_hat
P_pred = A @ P @ A.T + Q_noise
# Update (Kalman gain)
S = C @ P_pred @ C.T + R_noise
K = P_pred @ C.T @ np.linalg.inv(S)
x_hat = x_pred + K.flatten() * (z - C @ x_pred)
P = (np.eye(2) - K @ C) @ P_pred
estimates.append(x_hat.copy())
print(f"True final state: {x_true.round(2)}")
print(f"Estimated final state: {x_hat.round(2)}")
LQR + Kalman = LQG
LQR(optimal control)+ Kalman filter(optimal estimation)= LQG(Linear Quadratic Gaussian)— separation principle 讓你可以 independently design controller 和 estimator。這是 control theory 的 fundamental result,也是 model-based RL 的理論基礎。
Nonlinear Least Squares (Ch 18-19)
Problem Formulation
When the model is nonlinear in parameters:
不能用 normal equation(因為 不是 linear in )→ 需要 iterative methods。
Gauss-Newton Algorithm
核心 idea:在當前 附近把 線性化(Taylor expansion):
where is the Jacobian of at 。
代入 least squares problem → 變成 linear least squares:
Normal equation for update:
然後 ,repeat。
Levenberg-Marquardt
Gauss-Newton + regularization(trust region):
- 大 → 接近 gradient descent(small careful steps)
- 小 → 接近 Gauss-Newton(aggressive steps)
- 自動調節 :step 好 → 減小 (更 aggressive),step 差 → 增大 (更 cautious)
Connection to Neural Networks
Neural network training 本質上就是 nonlinear least squares(用 MSE loss 時):
- Backpropagation = computing gradient of this nonlinear LS problem
- SGD = simplified version of Gauss-Newton(only first-order, no Jacobian/Hessian)
- Natural gradient / K-FAC = use curvature information similar to Gauss-Newton
Logistic regression 也是 nonlinear LS 的特例:(sigmoid 使其 nonlinear in ... 但通常用 cross-entropy loss 而非 squared loss)。
ML Connections
| Concept | ML Application |
|---|---|
| Nonlinear LS | Neural network training with MSE loss |
| Gauss-Newton | Foundation for second-order optimizers (K-FAC, natural gradient) |
| Levenberg-Marquardt | scipy curve fitting, robotics, computer vision (bundle adjustment) |
| Jacobian | Backpropagation computes product of Jacobians through layers |
| Logistic regression | Nonlinear LS with sigmoid (but usually trained with cross-entropy) |
| Trust region methods | More robust than line search for non-convex problems |
import numpy as np
from scipy.optimize import least_squares, curve_fit
np.random.seed(42)
# Example 1: Nonlinear curve fitting
# True model: y = a * exp(-b * x) + c
def model(x, a, b, c):
return a * np.exp(-b * x) + c
x_data = np.linspace(0, 5, 50)
y_true = model(x_data, 2.5, 1.3, 0.5)
y_data = y_true + np.random.randn(50) * 0.1
# Method 1: scipy.optimize.curve_fit (uses Levenberg-Marquardt)
popt, pcov = curve_fit(model, x_data, y_data, p0=[1.0, 1.0, 0.0])
print(f"curve_fit: a={popt[0]:.3f}, b={popt[1]:.3f}, c={popt[2]:.3f}")
print(f"True: a=2.500, b=1.300, c=0.500")
# Method 2: scipy.optimize.least_squares (more control)
def residuals(params):
a, b, c = params
return model(x_data, a, b, c) - y_data
result = least_squares(residuals, x0=[1.0, 1.0, 0.0], method='lm') # 'lm' = Levenberg-Marquardt
print(f"least_squares: a={result.x[0]:.3f}, b={result.x[1]:.3f}, c={result.x[2]:.3f}")
print(f"Cost (sum of squared residuals): {result.cost:.4f}")
print(f"Optimality (gradient norm): {result.optimality:.2e}")
# Example 2: Logistic regression as nonlinear LS
def sigmoid(z):
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
# Generate binary classification data
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=200, n_features=2, n_redundant=0, random_state=42)
# Logistic regression via nonlinear least squares (minimize ||sigmoid(Xw) - y||^2)
def logistic_residuals(params):
w = params[:2]
b = params[2]
pred = sigmoid(X @ w + b)
return pred - y
result_logistic = least_squares(logistic_residuals, x0=np.zeros(3))
w_nls = result_logistic.x[:2]
b_nls = result_logistic.x[2]
pred_nls = (sigmoid(X @ w_nls + b_nls) > 0.5).astype(int)
acc_nls = np.mean(pred_nls == y)
print(f"\nLogistic via nonlinear LS: accuracy={acc_nls:.3f}")
# Compare with sklearn LogisticRegression (uses cross-entropy, not squared loss)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state=42)
lr.fit(X, y)
acc_lr = lr.score(X, y)
print(f"sklearn LogisticRegression: accuracy={acc_lr:.3f}")
# Example 3: Neural network connection
# Simple 1-hidden-layer network: f(x) = W2 * relu(W1 * x + b1) + b2
# This is nonlinear LS when using MSE loss
def nn_predict(params, X):
"""Simple neural net: 2 -> 4 -> 1"""
W1 = params[:8].reshape(4, 2)
b1 = params[8:12]
W2 = params[12:16].reshape(1, 4)
b2 = params[16]
h = np.maximum(0, X @ W1.T + b1) # ReLU
return (h @ W2.T + b2).flatten()
def nn_residuals(params):
return nn_predict(params, X) - y
# Solve with Levenberg-Marquardt-like method
result_nn = least_squares(nn_residuals, x0=np.random.randn(17) * 0.1,
method='trf', max_nfev=1000)
pred_nn = (nn_predict(result_nn.x, X) > 0.5).astype(int)
acc_nn = np.mean(pred_nn == y)
print(f"Simple NN via nonlinear LS: accuracy={acc_nn:.3f}")
print("Neural net training = nonlinear least squares (with MSE loss)")
面試連結:Second-Order vs First-Order Optimization
「SGD 和 Newton's method 的差異?」→ SGD 只用 gradient(first-order),步驟簡單但收斂慢。Newton/Gauss-Newton 用 Hessian/Jacobian information(second-order),每步計算貴但收斂快。Adam 用 running average of squared gradients 來 approximate diagonal Hessian → 是 first-order 和 second-order 的折衷。
Constrained Nonlinear Least Squares
When the optimization has both nonlinear objective and constraints:
Penalty Method
Convert constrained problem into unconstrained by adding penalty:
越大 → constraint violation penalty 越重 → 解越接近 feasible,但 optimization 越 ill-conditioned。
實務中 progressively increase :先用小 (easy optimization),逐步加大直到 constraint 幾乎被滿足。
ML 連結: Regularization 就是一種 penalty method — 原始問題是 constrained(),用 penalty 轉成 unconstrained(Ridge regression)。
Augmented Lagrangian
結合 Lagrange multiplier 和 penalty method — 更好的 convergence:
Alternately update (minimize augmented Lagrangian)and (dual update: )。
比 pure penalty method 好:不需要 就能得到 exact solution(Lagrange multiplier 提供了 additional degree of freedom)。
ML 連結: ADMM(Alternating Direction Method of Multipliers)是 augmented Lagrangian 的推廣 — 在 distributed optimization 和 large-scale ML 中使用。
Nonlinear Control
Extend LQR to nonlinear dynamics: (不再假設 linear)。
Methods:
- iLQR (iterative LQR): Linearize dynamics around current trajectory → solve LQR → update trajectory → repeat
- Model Predictive Control (MPC): Online re-solve at each time step with updated state
- Policy gradient (RL): Learn control policy without knowing dynamics model
iLQR 和 MPC 是 model-based RL 的核心 — 知道(或學到)dynamics model 後,用 optimization 而非 trial-and-error 找 optimal policy。
from scipy.optimize import minimize
import numpy as np
# Penalty method: minimize x^2 subject to x = 3
# Unconstrained: minimize x^2 + mu * (x - 3)^2
for mu in [0.1, 1, 10, 100, 1000]:
result = minimize(lambda x: x[0]**2 + mu * (x[0] - 3)**2, x0=[0])
x_opt = result.x[0]
print(f"mu={mu:>6}: x={x_opt:.4f}, constraint violation={abs(x_opt-3):.4f}")
# As mu increases: x → 3 (constraint satisfied), but conditioning worsens
# Augmented Lagrangian: better convergence
x = 0.0
lam = 0.0 # Lagrange multiplier
mu = 1.0
for iteration in range(10):
# Minimize augmented Lagrangian w.r.t. x
result = minimize(lambda x: x[0]**2 + lam*(x[0]-3) + mu/2*(x[0]-3)**2, x0=[x])
x = result.x[0]
# Dual update
lam = lam + mu * (x - 3)
if abs(x - 3) < 1e-6:
break
print(f"\nAugmented Lagrangian: x={x:.6f} (should be close to 3)")
Interview Signals
What interviewers listen for:
- 你能從 normal equation 推導 OLS closed-form solution
- 你理解 residual orthogonality 的幾何意義(projection onto column space)
- 你能解釋 Ridge vs Lasso 的差異(L2 shrink vs L1 sparsity)以及 Bayesian interpretation
- 你知道 polynomial regression 仍然是 linear regression(linear in parameters)
- 你理解 LS classification 的缺陷和為什麼用 logistic loss 更好
- 你能連結 neural network training 和 nonlinear least squares
- 你知道 weight decay = L2 regularization
Practice
Flashcards
Flashcards (1/10)
Least squares 的 normal equation 是什麼?如何推導?
Normal equation: ATAx = ATb,解為 x = (ATA)^(-1)ATb。推導:對 ||Ax-b||^2 取 gradient 令為零 → 2AT(Ax-b) = 0 → ATAx = ATb。幾何意義:residual b-Ax 必須和 A 的 column space 正交。
Quiz
Normal equation ATAx = ATb 的幾何意義是什麼?