Regression

Interview Essentials

Regression 是 DS 面試中最常見的 ML 主題。面試官期待你不只是「fit a line」,還要能討論 assumptions、diagnostics、regularization tradeoffs,以及什麼時候 regression 會失效。

Ordinary Least Squares (OLS)

Linear regression models the relationship between a response yy and features x\mathbf{x}:

y=xβ+ϵ,ϵN(0,σ2)y = \mathbf{x}^\top \boldsymbol{\beta} + \epsilon, \quad \epsilon \sim N(0, \sigma^2)

The OLS estimator minimizes the sum of squared residuals:

β^=argminβi=1n(yixiβ)2=(XX)1Xy\hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \sum_{i=1}^{n} (y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}

Closed-form solution 存在的條件是 XX\mathbf{X}^\top \mathbf{X} 可逆。不可逆時(multicollinearity 或 p>np > n),需要 regularization。

The Five OLS Assumptions (Gauss-Markov)

Under these assumptions, OLS is the Best Linear Unbiased Estimator (BLUE):

  1. Linearity: yy is linear in the parameters(不一定要 linear in features — polynomial regression 仍然 linear in β\beta
  2. Independence: Observations are independent of each other
  3. Homoscedasticity: Var(ϵi)=σ2\text{Var}(\epsilon_i) = \sigma^2 is constant across all ii
  4. No perfect multicollinearity: No feature is an exact linear combination of others
  5. Exogeneity: E[ϵX]=0E[\epsilon | \mathbf{X}] = 0 — errors are uncorrelated with features

Normality Is NOT a Gauss-Markov Assumption

Normality of errors is not required for OLS to be BLUE. 只有在小樣本時做 t-test 和 F-test 才需要 normality。大樣本時 CLT 讓 inference 即使非常態也成立。這是面試中最常見的誤解之一。

Geometric Interpretation

OLS 的幾何直覺:y^=Xβ^\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}y\mathbf{y} 在 column space of X\mathbf{X} 上的 orthogonal projection。Residual vector e=yy^\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} 和所有 features 正交。

這就是為什麼 Xe=0\mathbf{X}^\top \mathbf{e} = 0 — residuals 和 features 無關是 OLS 的必然結果,不是假設。

MLE Connection

If errors are normally distributed, the MLE for β\boldsymbol{\beta} is exactly the OLS estimator. 最小化 squared error = 最大化 Gaussian likelihood。這也是為什麼 MSE 是 regression 的 default loss function。

Goodness of Fit

R-squared

R2=1SSresSStot=1(yiy^i)2(yiyˉ)2R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}

R2R^2 measures the proportion of variance explained by the model. 加更多 features 只會讓 R2R^2 增加(或不變),即使是 noise features。

Adjusted R-squared

Radj2=1SSres/(np1)SStot/(n1)R^2_{\text{adj}} = 1 - \frac{SS_{\text{res}} / (n - p - 1)}{SS_{\text{tot}} / (n - 1)}

Penalizes model complexity by degrees of freedom. 加入無用 feature 時 adjusted R2R^2 會下降,比 R2R^2 更適合做 model comparison。

Other Metrics

MetricFormulaWhen to Use
R2R^21SSres/SStot1 - SS_{\text{res}}/SS_{\text{tot}}快速衡量 explained variance(但會被 feature 數量騙)
Adjusted R2R^2Penalized by pp比較不同 feature 數的模型
AIC / BIC2lnL+kp-2\ln L + k \cdot pModel selection(BIC 對 complexity 懲罰更重)
Cross-validated MSEOut-of-sample error最可靠的 generalization estimate
RMSEMSE\sqrt{\text{MSE}}和 target 同單位,直覺易懂

Interview Tip

被問「How do you evaluate a regression model?」時,不要只說 R2R^2。提到 residual plots、cross-validated MSE、adjusted R2R^2、以及 assumptions 是否成立。Anscombe's quartet 就是 R2R^2 完全相同但 relationship 完全不同的經典例子。

Residual Analysis

A well-fitted model should have residuals that:

  • Show no pattern when plotted against predicted values(checks linearity + homoscedasticity)
  • Are approximately normally distributed(check with Q-Q plot)
  • Show no autocorrelation(Durbin-Watson test for time series)

Residual Patterns and Diagnoses

PatternDiagnosisFix
Funnel shape(越來越散)HeteroscedasticityLog transform target, WLS, robust SE
Curved patternNon-linearityPolynomial features, splines, nonlinear model
ClustersMissing categorical variableAdd the categorical feature
Outliers with high leverageInfluential pointsCheck Cook's distance, investigate data quality

Cook's Distance

Measures how much each observation influences the fitted model:

Di=j=1n(y^jy^j(i))2pMSED_i = \frac{\sum_{j=1}^n (\hat{y}_j - \hat{y}_{j(i)})^2}{p \cdot \text{MSE}}

where y^j(i)\hat{y}_{j(i)} is the prediction without observation ii. 一般 Di>1D_i > 1Di>4/nD_i > 4/n 被認為是 influential point。

面試中的用法:「我會先看 residual plot 找 pattern,用 Cook's distance 找 influential points,然後決定是修正模型還是處理資料品質問題。」

Multicollinearity

When features are highly correlated, OLS coefficients become unstable — small changes in data cause large swings in β^\hat{\beta}.

Detection: Variance Inflation Factor (VIF)

VIFj=11Rj2\text{VIF}_j = \frac{1}{1 - R_j^2}

where Rj2R_j^2 is the R2R^2 from regressing feature jj on all other features.

VIFInterpretation
1No collinearity
1-5Low, usually acceptable
5-10Moderate, worth investigating
> 10High, coefficients likely unreliable

Consequences and Solutions

Consequences: Coefficients are still unbiased but have high variance. Individual p-values 變得不可靠,但 overall prediction 可能仍然不錯。

Solutions:

  • Remove one of the correlated features
  • Ridge regression — shrinks correlated coefficients, always invertible
  • PCA — combine correlated features into orthogonal components
  • Domain knowledge — decide which feature is more meaningful

面試常見追問

「Multicollinearity 會讓 OLS biased 嗎?」— 不會。Coefficients 仍然 unbiased,但 variance 很大。如果你只在乎 prediction(不在乎 coefficient interpretation),multicollinearity 不是問題。如果你要做 feature importance 或 inference,就必須處理。

Regularization

Ridge Regression (L2)

β^ridge=argminβi=1n(yixiβ)2+λj=1pβj2\hat{\boldsymbol{\beta}}_{\text{ridge}} = \arg\min_{\boldsymbol{\beta}} \sum_{i=1}^{n} (y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2 + \lambda \sum_{j=1}^{p} \beta_j^2

Ridge shrinks coefficients toward zero but never sets them exactly to zero. (XX+λI)(\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}) 永遠可逆,完美處理 multicollinearity。

Bayesian interpretation: Gaussian prior on β\boldsymbol{\beta} → MAP estimate = Ridge。

Lasso Regression (L1)

β^lasso=argminβi=1n(yixiβ)2+λj=1pβj\hat{\boldsymbol{\beta}}_{\text{lasso}} = \arg\min_{\boldsymbol{\beta}} \sum_{i=1}^{n} (y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2 + \lambda \sum_{j=1}^{p} |\beta_j|

Lasso performs automatic feature selection by driving some coefficients to exactly zero.

Why does L1 give exact zeros but L2 doesn't? Geometric intuition:L1 constraint region 是菱形(diamond),corners 在座標軸上。Contour lines 更容易碰到 corner(某些 βj=0\beta_j = 0)。L2 constraint region 是圓形,沒有 corners。

Bayesian interpretation: Laplace prior on β\boldsymbol{\beta} → MAP estimate = Lasso。

Elastic Net

β^EN=argminβyXβ2+λ(αβ1+1α2β22)\hat{\boldsymbol{\beta}}_{\text{EN}} = \arg\min_{\boldsymbol{\beta}} \|y - X\beta\|^2 + \lambda \left( \alpha \|\beta\|_1 + \frac{1-\alpha}{2} \|\beta\|_2^2 \right)

Combines L1 and L2 penalties. 當 features 成群相關時,Lasso 傾向隨機選一個、丟掉其他;Elastic Net 會保留整組。

Choosing Regularization

ScenarioBest ChoiceWhy
Many features, all potentially relevantRidgeShrinks all, doesn't discard
Want feature selection, believe sparse modelLassoDrives irrelevant coefficients to 0
Correlated feature groupsElastic NetKeeps grouped features together
p>np > nRidge or Elastic NetLasso selects at most nn features

Tuning λ\lambda

Use cross-validation to find the optimal λ\lambda:

from sklearn.linear_model import RidgeCV, LassoCV

# RidgeCV automatically tunes lambda via cross-validation
ridge = RidgeCV(alphas=[0.01, 0.1, 1.0, 10, 100], cv=5).fit(X, y)
print(f"Best alpha: {ridge.alpha_}")

# LassoCV uses coordinate descent path
lasso = LassoCV(cv=5).fit(X, y)
print(f"Best alpha: {lasso.alpha_}")
print(f"Non-zero features: {(lasso.coef_ != 0).sum()} / {len(lasso.coef_)}")

Polynomial Regression

To capture non-linear relationships while staying within the linear regression framework:

y=β0+β1x+β2x2++βdxd+ϵy = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_d x^d + \epsilon

仍然 linear in parameters β\beta,所以 OLS 完全適用。危險是 overfitting — degree dd 接近 nn 時會 memorize training data。

Logistic Regression

面試中 logistic regression 常和 linear regression 一起被問到。它不是 regression 但名字裡有 regression:

P(y=1x)=σ(xβ)=11+exβP(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{x}^\top \boldsymbol{\beta}) = \frac{1}{1 + e^{-\mathbf{x}^\top \boldsymbol{\beta}}}

Key differences from linear regression:

AspectLinear RegressionLogistic Regression
OutputContinuous valueProbability [0,1][0, 1]
Loss functionMSE (squared error)Cross-entropy (log loss)
OptimizationClosed-form (OLS)Iterative (gradient descent, Newton)
Interpretationβj\beta_j = change in yy per unit change in xjx_jβj\beta_j = change in log-odds per unit change
AssumptionsLinear relationshipLinear in log-odds (not in probability)

Odds and Log-Odds

Odds=P(y=1)1P(y=1),Log-odds (logit)=ln(P1P)=xβ\text{Odds} = \frac{P(y=1)}{1 - P(y=1)}, \quad \text{Log-odds (logit)} = \ln\left(\frac{P}{1-P}\right) = \mathbf{x}^\top \boldsymbol{\beta}

Coefficient interpretation:βj=0.5\beta_j = 0.5 意味著 xjx_j 增加 1 單位,odds 乘以 e0.51.65e^{0.5} \approx 1.65(增加 65%)。

面試經典問題

「Logistic regression 的 coefficient 怎麼解釋?」— 不是「xx 增加 1,yy 的機率增加 β\beta」(那是 linear probability model)。正確解釋:xx 增加 1,log-odds 增加 β\beta,equivalently odds 乘以 eβe^\beta

Bias-Variance Tradeoff

Use this interactive tool to see how model complexity affects the bias-variance tradeoff:

Near Optimal
1 (simple)Degree: 6.120 (complex)

Training Error

0.194

Validation Error

0.308

Gap (Variance Proxy)

0.114

Training vs Validation Error

Bias-Variance Decomposition

Total Error = Bias² + Variance + Irreducible Noise

Bias²

0.128

Variance

0.130

Noise

0.050

Total

0.308

Total Error=Bias2+Variance+Irreducible Noise\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}

Regularization 增加 bias(模型更受限)但降低 variance(模型更穩定)。Optimal model 是在 bias 和 variance 之間取得平衡。

Real-World Use Cases

Case 1: 房價預測

Challenge: 房價右偏、positive-only → 用 log transform。

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Log-transform target (house prices are log-normal)
y_log = np.log1p(y)  # log(1 + y) to handle y=0

ridge = Ridge(alpha=10)
scores = cross_val_score(ridge, X, y_log, cv=5, scoring="neg_mean_squared_error")
# Predict: exp(model.predict(X_new)) - 1 to get back to original scale

面試 follow-up:

  • 「為什麼用 log transform?」— 房價接近 log-normal,transform 後 residuals 更接近常態且 homoscedastic
  • 「Feature interpretation 怎麼變?」— coefficient 0.1 on log scale ≈ 10% increase in price
  • 「有些 feature 是 categorical 怎麼辦?」— One-hot encoding, 但注意 dummy variable trap(kk categories 只需 k1k-1 dummies)

Case 2: 信用卡詐欺偵測

Challenge: Binary outcome → Logistic regression。但 class imbalance 嚴重(0.1% fraud)。

from sklearn.linear_model import LogisticRegression

# class_weight='balanced' adjusts for imbalance
model = LogisticRegression(C=1.0, class_weight="balanced")
# Equivalent to oversampling minority class
# C = inverse regularization strength (smaller C = more regularization)

面試 follow-up:

  • 「為什麼不用 linear regression 做 classification?」— 預測值可能超出 [0, 1],而且 MSE 不適合 binary targets
  • 「L1 or L2 regularization?」— L1 如果想做 feature selection(哪些 features 和 fraud 最相關);L2 如果想 keep all features

Case 3: 客戶消費金額預測

Challenge: Target 有很多零(很多客戶沒消費)→ Two-part model

Part 1: Logistic regression 預測「是否消費」(P(y>0)P(y > 0)) Part 2: Linear regression(on log scale)預測「消費多少」(E[yy>0]E[y \mid y > 0]

E[y]=P(y>0)×E[yy>0]E[y] = P(y > 0) \times E[y \mid y > 0]

這比直接對含零的資料跑 regression 更合理 — 因為「不消費」和「消費多少」是兩個不同的決策機制。

Hands-on: Regression in Python

OLS vs Ridge vs Lasso

import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import cross_val_score

# Correlated features + noise features
n = 200
X1 = np.random.randn(n)
X2 = X1 * 0.9 + np.random.randn(n) * 0.3  # correlated with X1
X3 = np.random.randn(n)
X_noise = np.random.randn(n, 5)             # irrelevant features
X = np.column_stack([X1, X2, X3, X_noise])
y = 3 * X1 + 2 * X3 + np.random.randn(n) * 0.5  # only X1, X3 matter

models = {
    "OLS":   LinearRegression(),
    "Ridge": Ridge(alpha=1.0),
    "Lasso": Lasso(alpha=0.1),
}
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring="r2")
    model.fit(X, y)
    print(f"{name}: CV R²={scores.mean():.3f}, coefs={model.coef_.round(2)}")

# OLS:   unstable coefficients for correlated X1/X2
# Ridge: shrinks all coefficients, but keeps all nonzero
# Lasso: drives noise coefficients to exactly 0 (feature selection)
# True:  [3.00, 0.00, 2.00, 0, 0, 0, 0, 0]

Diagnostics with statsmodels

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

X_const = sm.add_constant(X)
model = sm.OLS(y, X_const).fit()
print(model.summary())
# Look at: coefficients, p-values, R², F-statistic, Durbin-Watson

# VIF check for multicollinearity
for i in range(X.shape[1]):
    vif = variance_inflation_factor(X, i)
    print(f"Feature {i}: VIF = {vif:.2f}")
# VIF > 10 → multicollinearity problem

Interview Signals

What interviewers listen for:

  • 你能說出 OLS 的 5 個假設,並知道 normality 不是其中之一
  • 你知道 R2R^2 的限制,不會只靠它判斷 model quality
  • 你能解釋 Ridge vs Lasso 的幾何差異(圓 vs 菱形)和 Bayesian 解釋
  • 你知道 multicollinearity 不影響 unbiasedness,只影響 variance
  • 你能正確解釋 logistic regression 的 coefficient(log-odds, not probability)

Practice

Flashcards

Flashcards (1/10)

Why does adding more features always increase R² but not necessarily adjusted R²?

R² measures total variance explained, can only increase with more features (even noise). Adjusted R² penalizes by degrees of freedom — 新 feature 帶來的 improvement 如果不夠大,adjusted R² 會下降。

Click card to flip

Quiz

Question 1/10

Which regularization method performs automatic feature selection by setting coefficients exactly to zero?

Mark as Complete

3/5 — Okay