Regression

Interview Essentials

Regression 是 DS 面試中最常見的 ML 主題。面試官期待你不只是「fit a line」，還要能討論 assumptions、diagnostics、regularization tradeoffs，以及什麼時候 regression 會失效。

Ordinary Least Squares (OLS)

Linear regression models the relationship between a response $y$ and features $\mathbf{x}$ :

y = \mathbf{x}^\top \boldsymbol{\beta} + \epsilon, \quad \epsilon \sim N(0, \sigma^2)

The OLS estimator minimizes the sum of squared residuals:

\hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \sum_{i=1}^{n} (y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}

Closed-form solution 存在的條件是 $\mathbf{X}^\top \mathbf{X}$ 可逆。不可逆時（multicollinearity 或 $p > n$ ），需要 regularization。

The Five OLS Assumptions (Gauss-Markov)

Under these assumptions, OLS is the Best Linear Unbiased Estimator (BLUE):

Linearity: $y$ is linear in the parameters（不一定要 linear in features — polynomial regression 仍然 linear in $\beta$ ）
Independence: Observations are independent of each other
Homoscedasticity: $\text{Var}(\epsilon_i) = \sigma^2$ is constant across all $i$
No perfect multicollinearity: No feature is an exact linear combination of others
Exogeneity: $E[\epsilon | \mathbf{X}] = 0$ — errors are uncorrelated with features

Normality Is NOT a Gauss-Markov Assumption

Normality of errors is not required for OLS to be BLUE. 只有在小樣本時做 t-test 和 F-test 才需要 normality。大樣本時 CLT 讓 inference 即使非常態也成立。這是面試中最常見的誤解之一。

Geometric Interpretation

OLS 的幾何直覺： $\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$ 是 $\mathbf{y}$ 在 column space of $\mathbf{X}$ 上的 orthogonal projection。Residual vector $\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$ 和所有 features 正交。

這就是為什麼 $\mathbf{X}^\top \mathbf{e} = 0$ — residuals 和 features 無關是 OLS 的必然結果，不是假設。

MLE Connection

If errors are normally distributed, the MLE for $\boldsymbol{\beta}$ is exactly the OLS estimator. 最小化 squared error = 最大化 Gaussian likelihood。這也是為什麼 MSE 是 regression 的 default loss function。

Goodness of Fit

R-squared

R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}

$R^2$ measures the proportion of variance explained by the model. 加更多 features 只會讓 $R^2$ 增加（或不變），即使是 noise features。

Adjusted R-squared

R^2_{\text{adj}} = 1 - \frac{SS_{\text{res}} / (n - p - 1)}{SS_{\text{tot}} / (n - 1)}

Penalizes model complexity by degrees of freedom. 加入無用 feature 時 adjusted $R^2$ 會下降，比 $R^2$ 更適合做 model comparison。

Other Metrics

Metric	Formula	When to Use
$R^2$	$1 - SS_{\text{res}}/SS_{\text{tot}}$	快速衡量 explained variance（但會被 feature 數量騙）
Adjusted $R^2$	Penalized by $p$	比較不同 feature 數的模型
AIC / BIC	$-2\ln L + k \cdot p$	Model selection（BIC 對 complexity 懲罰更重）
Cross-validated MSE	Out-of-sample error	最可靠的 generalization estimate
RMSE	$\sqrt{\text{MSE}}$	和 target 同單位，直覺易懂

Interview Tip

被問「How do you evaluate a regression model?」時，不要只說 $R^2$ 。提到 residual plots、cross-validated MSE、adjusted $R^2$ 、以及 assumptions 是否成立。Anscombe's quartet 就是 $R^2$ 完全相同但 relationship 完全不同的經典例子。

Residual Analysis

A well-fitted model should have residuals that:

Show no pattern when plotted against predicted values（checks linearity + homoscedasticity）
Are approximately normally distributed（check with Q-Q plot）
Show no autocorrelation（Durbin-Watson test for time series）

Residual Patterns and Diagnoses

Pattern	Diagnosis	Fix
Funnel shape（越來越散）	Heteroscedasticity	Log transform target, WLS, robust SE
Curved pattern	Non-linearity	Polynomial features, splines, nonlinear model
Clusters	Missing categorical variable	Add the categorical feature
Outliers with high leverage	Influential points	Check Cook's distance, investigate data quality

Cook's Distance

Measures how much each observation influences the fitted model:

D_i = \frac{\sum_{j=1}^n (\hat{y}_j - \hat{y}_{j(i)})^2}{p \cdot \text{MSE}}

where $\hat{y}_{j(i)}$ is the prediction without observation $i$ . 一般 $D_i > 1$ 或 $D_i > 4/n$ 被認為是 influential point。

面試中的用法：「我會先看 residual plot 找 pattern，用 Cook's distance 找 influential points，然後決定是修正模型還是處理資料品質問題。」

Multicollinearity

When features are highly correlated, OLS coefficients become unstable — small changes in data cause large swings in $\hat{\beta}$ .

Detection: Variance Inflation Factor (VIF)

\text{VIF}_j = \frac{1}{1 - R_j^2}

where $R_j^2$ is the $R^2$ from regressing feature $j$ on all other features.

VIF	Interpretation
1	No collinearity
1-5	Low, usually acceptable
5-10	Moderate, worth investigating
> 10	High, coefficients likely unreliable

Consequences and Solutions

Consequences: Coefficients are still unbiased but have high variance. Individual p-values 變得不可靠，但 overall prediction 可能仍然不錯。

Solutions:

Remove one of the correlated features
Ridge regression — shrinks correlated coefficients, always invertible
PCA — combine correlated features into orthogonal components
Domain knowledge — decide which feature is more meaningful

面試常見追問

「Multicollinearity 會讓 OLS biased 嗎？」— 不會。Coefficients 仍然 unbiased，但 variance 很大。如果你只在乎 prediction（不在乎 coefficient interpretation），multicollinearity 不是問題。如果你要做 feature importance 或 inference，就必須處理。

Regularization

Ridge Regression (L2)

\hat{\boldsymbol{\beta}}_{\text{ridge}} = \arg\min_{\boldsymbol{\beta}} \sum_{i=1}^{n} (y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2 + \lambda \sum_{j=1}^{p} \beta_j^2

Ridge shrinks coefficients toward zero but never sets them exactly to zero. $(\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})$ 永遠可逆，完美處理 multicollinearity。

Bayesian interpretation: Gaussian prior on $\boldsymbol{\beta}$ → MAP estimate = Ridge。

Lasso Regression (L1)

\hat{\boldsymbol{\beta}}_{\text{lasso}} = \arg\min_{\boldsymbol{\beta}} \sum_{i=1}^{n} (y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2 + \lambda \sum_{j=1}^{p} |\beta_j|

Lasso performs automatic feature selection by driving some coefficients to exactly zero.

Why does L1 give exact zeros but L2 doesn't? Geometric intuition：L1 constraint region 是菱形（diamond），corners 在座標軸上。Contour lines 更容易碰到 corner（某些 $\beta_j = 0$ ）。L2 constraint region 是圓形，沒有 corners。

Bayesian interpretation: Laplace prior on $\boldsymbol{\beta}$ → MAP estimate = Lasso。

Elastic Net

\hat{\boldsymbol{\beta}}_{\text{EN}} = \arg\min_{\boldsymbol{\beta}} \|y - X\beta\|^2 + \lambda \left( \alpha \|\beta\|_1 + \frac{1-\alpha}{2} \|\beta\|_2^2 \right)

Combines L1 and L2 penalties. 當 features 成群相關時，Lasso 傾向隨機選一個、丟掉其他；Elastic Net 會保留整組。

Choosing Regularization

Scenario	Best Choice	Why
Many features, all potentially relevant	Ridge	Shrinks all, doesn't discard
Want feature selection, believe sparse model	Lasso	Drives irrelevant coefficients to 0
Correlated feature groups	Elastic Net	Keeps grouped features together
$p > n$	Ridge or Elastic Net	Lasso selects at most $n$ features

Tuning $\lambda$

Use cross-validation to find the optimal $\lambda$ :

from sklearn.linear_model import RidgeCV, LassoCV

# RidgeCV automatically tunes lambda via cross-validation
ridge = RidgeCV(alphas=[0.01, 0.1, 1.0, 10, 100], cv=5).fit(X, y)
print(f"Best alpha: {ridge.alpha_}")

# LassoCV uses coordinate descent path
lasso = LassoCV(cv=5).fit(X, y)
print(f"Best alpha: {lasso.alpha_}")
print(f"Non-zero features: {(lasso.coef_ != 0).sum()} / {len(lasso.coef_)}")

Polynomial Regression

To capture non-linear relationships while staying within the linear regression framework:

y = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_d x^d + \epsilon

仍然 linear in parameters $\beta$ ，所以 OLS 完全適用。危險是 overfitting — degree $d$ 接近 $n$ 時會 memorize training data。

Logistic Regression

面試中 logistic regression 常和 linear regression 一起被問到。它不是 regression 但名字裡有 regression：

P(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{x}^\top \boldsymbol{\beta}) = \frac{1}{1 + e^{-\mathbf{x}^\top \boldsymbol{\beta}}}

Key differences from linear regression:

Aspect	Linear Regression	Logistic Regression
Output	Continuous value	Probability $[0, 1]$
Loss function	MSE (squared error)	Cross-entropy (log loss)
Optimization	Closed-form (OLS)	Iterative (gradient descent, Newton)
Interpretation	$\beta_j$ = change in $y$ per unit change in $x_j$	$\beta_j$ = change in log-odds per unit change
Assumptions	Linear relationship	Linear in log-odds (not in probability)

Odds and Log-Odds

\text{Odds} = \frac{P(y=1)}{1 - P(y=1)}, \quad \text{Log-odds (logit)} = \ln\left(\frac{P}{1-P}\right) = \mathbf{x}^\top \boldsymbol{\beta}

Coefficient interpretation： $\beta_j = 0.5$ 意味著 $x_j$ 增加 1 單位，odds 乘以 $e^{0.5} \approx 1.65$ （增加 65%）。

面試經典問題

「Logistic regression 的 coefficient 怎麼解釋？」— 不是「 $x$ 增加 1， $y$ 的機率增加 $\beta$ 」（那是 linear probability model）。正確解釋： $x$ 增加 1，log-odds 增加 $\beta$ ，equivalently odds 乘以 $e^\beta$ 。

Bias-Variance Tradeoff

Use this interactive tool to see how model complexity affects the bias-variance tradeoff:

Model ComplexityNear Optimal

1 (simple)Degree: 6.120 (complex)

Training Error

0.194

Validation Error

0.308

Gap (Variance Proxy)

0.114

Training vs Validation Error

Bias-Variance Decomposition

Total Error = Bias² + Variance + Irreducible Noise

Bias²

0.128

Variance

0.130

Noise

0.050

Total

0.308

\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}

Regularization 增加 bias（模型更受限）但降低 variance（模型更穩定）。Optimal model 是在 bias 和 variance 之間取得平衡。

Real-World Use Cases

Case 1: 房價預測

Challenge: 房價右偏、positive-only → 用 log transform。

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Log-transform target (house prices are log-normal)
y_log = np.log1p(y)  # log(1 + y) to handle y=0

ridge = Ridge(alpha=10)
scores = cross_val_score(ridge, X, y_log, cv=5, scoring="neg_mean_squared_error")
# Predict: exp(model.predict(X_new)) - 1 to get back to original scale

面試 follow-up：

「為什麼用 log transform？」— 房價接近 log-normal，transform 後 residuals 更接近常態且 homoscedastic
「Feature interpretation 怎麼變？」— coefficient 0.1 on log scale ≈ 10% increase in price
「有些 feature 是 categorical 怎麼辦？」— One-hot encoding, 但注意 dummy variable trap（ $k$ categories 只需 $k-1$ dummies）

Case 2: 信用卡詐欺偵測

Challenge: Binary outcome → Logistic regression。但 class imbalance 嚴重（0.1% fraud）。

from sklearn.linear_model import LogisticRegression

# class_weight='balanced' adjusts for imbalance
model = LogisticRegression(C=1.0, class_weight="balanced")
# Equivalent to oversampling minority class
# C = inverse regularization strength (smaller C = more regularization)

面試 follow-up：

「為什麼不用 linear regression 做 classification？」— 預測值可能超出 [0, 1]，而且 MSE 不適合 binary targets
「L1 or L2 regularization？」— L1 如果想做 feature selection（哪些 features 和 fraud 最相關）；L2 如果想 keep all features

Case 3: 客戶消費金額預測

Challenge: Target 有很多零（很多客戶沒消費）→ Two-part model。

Part 1: Logistic regression 預測「是否消費」（ $P(y > 0)$ ） Part 2: Linear regression（on log scale）預測「消費多少」（ $E[y \mid y > 0]$ ）

E[y] = P(y > 0) \times E[y \mid y > 0]

這比直接對含零的資料跑 regression 更合理 — 因為「不消費」和「消費多少」是兩個不同的決策機制。

Hands-on: Regression in Python

OLS vs Ridge vs Lasso

import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import cross_val_score

# Correlated features + noise features
n = 200
X1 = np.random.randn(n)
X2 = X1 * 0.9 + np.random.randn(n) * 0.3  # correlated with X1
X3 = np.random.randn(n)
X_noise = np.random.randn(n, 5)             # irrelevant features
X = np.column_stack([X1, X2, X3, X_noise])
y = 3 * X1 + 2 * X3 + np.random.randn(n) * 0.5  # only X1, X3 matter

models = {
    "OLS":   LinearRegression(),
    "Ridge": Ridge(alpha=1.0),
    "Lasso": Lasso(alpha=0.1),
}
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring="r2")
    model.fit(X, y)
    print(f"{name}: CV R²={scores.mean():.3f}, coefs={model.coef_.round(2)}")

# OLS:   unstable coefficients for correlated X1/X2
# Ridge: shrinks all coefficients, but keeps all nonzero
# Lasso: drives noise coefficients to exactly 0 (feature selection)
# True:  [3.00, 0.00, 2.00, 0, 0, 0, 0, 0]

Diagnostics with statsmodels

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

X_const = sm.add_constant(X)
model = sm.OLS(y, X_const).fit()
print(model.summary())
# Look at: coefficients, p-values, R², F-statistic, Durbin-Watson

# VIF check for multicollinearity
for i in range(X.shape[1]):
    vif = variance_inflation_factor(X, i)
    print(f"Feature {i}: VIF = {vif:.2f}")
# VIF > 10 → multicollinearity problem

Interview Signals

What interviewers listen for:

你能說出 OLS 的 5 個假設，並知道 normality 不是其中之一
你知道 $R^2$ 的限制，不會只靠它判斷 model quality
你能解釋 Ridge vs Lasso 的幾何差異（圓 vs 菱形）和 Bayesian 解釋
你知道 multicollinearity 不影響 unbiasedness，只影響 variance
你能正確解釋 logistic regression 的 coefficient（log-odds, not probability）

Practice

Flashcards

Flashcards (1/10)

Why does adding more features always increase R² but not necessarily adjusted R²?

R² measures total variance explained, can only increase with more features (even noise). Adjusted R² penalizes by degrees of freedom — 新 feature 帶來的 improvement 如果不夠大，adjusted R² 會下降。

Click card to flip

Quiz

Question 1/10

Which regularization method performs automatic feature selection by setting coefficients exactly to zero?

Mark as Complete

How confident are you with this topic?

3/5 — Okay