Regression
Interview Essentials
Regression 是 DS 面試中最常見的 ML 主題。面試官期待你不只是「fit a line」,還要能討論 assumptions、diagnostics、regularization tradeoffs,以及什麼時候 regression 會失效。
Ordinary Least Squares (OLS)
Linear regression models the relationship between a response and features :
The OLS estimator minimizes the sum of squared residuals:
Closed-form solution 存在的條件是 可逆。不可逆時(multicollinearity 或 ),需要 regularization。
The Five OLS Assumptions (Gauss-Markov)
Under these assumptions, OLS is the Best Linear Unbiased Estimator (BLUE):
- Linearity: is linear in the parameters(不一定要 linear in features — polynomial regression 仍然 linear in )
- Independence: Observations are independent of each other
- Homoscedasticity: is constant across all
- No perfect multicollinearity: No feature is an exact linear combination of others
- Exogeneity: — errors are uncorrelated with features
Normality Is NOT a Gauss-Markov Assumption
Normality of errors is not required for OLS to be BLUE. 只有在小樣本時做 t-test 和 F-test 才需要 normality。大樣本時 CLT 讓 inference 即使非常態也成立。這是面試中最常見的誤解之一。
Geometric Interpretation
OLS 的幾何直覺: 是 在 column space of 上的 orthogonal projection。Residual vector 和所有 features 正交。
這就是為什麼 — residuals 和 features 無關是 OLS 的必然結果,不是假設。
MLE Connection
If errors are normally distributed, the MLE for is exactly the OLS estimator. 最小化 squared error = 最大化 Gaussian likelihood。這也是為什麼 MSE 是 regression 的 default loss function。
Goodness of Fit
R-squared
measures the proportion of variance explained by the model. 加更多 features 只會讓 增加(或不變),即使是 noise features。
Adjusted R-squared
Penalizes model complexity by degrees of freedom. 加入無用 feature 時 adjusted 會下降,比 更適合做 model comparison。
Other Metrics
| Metric | Formula | When to Use |
|---|---|---|
| 快速衡量 explained variance(但會被 feature 數量騙) | ||
| Adjusted | Penalized by | 比較不同 feature 數的模型 |
| AIC / BIC | Model selection(BIC 對 complexity 懲罰更重) | |
| Cross-validated MSE | Out-of-sample error | 最可靠的 generalization estimate |
| RMSE | 和 target 同單位,直覺易懂 |
Interview Tip
被問「How do you evaluate a regression model?」時,不要只說 。提到 residual plots、cross-validated MSE、adjusted 、以及 assumptions 是否成立。Anscombe's quartet 就是 完全相同但 relationship 完全不同的經典例子。
Residual Analysis
A well-fitted model should have residuals that:
- Show no pattern when plotted against predicted values(checks linearity + homoscedasticity)
- Are approximately normally distributed(check with Q-Q plot)
- Show no autocorrelation(Durbin-Watson test for time series)
Residual Patterns and Diagnoses
| Pattern | Diagnosis | Fix |
|---|---|---|
| Funnel shape(越來越散) | Heteroscedasticity | Log transform target, WLS, robust SE |
| Curved pattern | Non-linearity | Polynomial features, splines, nonlinear model |
| Clusters | Missing categorical variable | Add the categorical feature |
| Outliers with high leverage | Influential points | Check Cook's distance, investigate data quality |
Cook's Distance
Measures how much each observation influences the fitted model:
where is the prediction without observation . 一般 或 被認為是 influential point。
面試中的用法:「我會先看 residual plot 找 pattern,用 Cook's distance 找 influential points,然後決定是修正模型還是處理資料品質問題。」
Multicollinearity
When features are highly correlated, OLS coefficients become unstable — small changes in data cause large swings in .
Detection: Variance Inflation Factor (VIF)
where is the from regressing feature on all other features.
| VIF | Interpretation |
|---|---|
| 1 | No collinearity |
| 1-5 | Low, usually acceptable |
| 5-10 | Moderate, worth investigating |
| > 10 | High, coefficients likely unreliable |
Consequences and Solutions
Consequences: Coefficients are still unbiased but have high variance. Individual p-values 變得不可靠,但 overall prediction 可能仍然不錯。
Solutions:
- Remove one of the correlated features
- Ridge regression — shrinks correlated coefficients, always invertible
- PCA — combine correlated features into orthogonal components
- Domain knowledge — decide which feature is more meaningful
面試常見追問
「Multicollinearity 會讓 OLS biased 嗎?」— 不會。Coefficients 仍然 unbiased,但 variance 很大。如果你只在乎 prediction(不在乎 coefficient interpretation),multicollinearity 不是問題。如果你要做 feature importance 或 inference,就必須處理。
Regularization
Ridge Regression (L2)
Ridge shrinks coefficients toward zero but never sets them exactly to zero. 永遠可逆,完美處理 multicollinearity。
Bayesian interpretation: Gaussian prior on → MAP estimate = Ridge。
Lasso Regression (L1)
Lasso performs automatic feature selection by driving some coefficients to exactly zero.
Why does L1 give exact zeros but L2 doesn't? Geometric intuition:L1 constraint region 是菱形(diamond),corners 在座標軸上。Contour lines 更容易碰到 corner(某些 )。L2 constraint region 是圓形,沒有 corners。
Bayesian interpretation: Laplace prior on → MAP estimate = Lasso。
Elastic Net
Combines L1 and L2 penalties. 當 features 成群相關時,Lasso 傾向隨機選一個、丟掉其他;Elastic Net 會保留整組。
Choosing Regularization
| Scenario | Best Choice | Why |
|---|---|---|
| Many features, all potentially relevant | Ridge | Shrinks all, doesn't discard |
| Want feature selection, believe sparse model | Lasso | Drives irrelevant coefficients to 0 |
| Correlated feature groups | Elastic Net | Keeps grouped features together |
| Ridge or Elastic Net | Lasso selects at most features |
Tuning
Use cross-validation to find the optimal :
from sklearn.linear_model import RidgeCV, LassoCV
# RidgeCV automatically tunes lambda via cross-validation
ridge = RidgeCV(alphas=[0.01, 0.1, 1.0, 10, 100], cv=5).fit(X, y)
print(f"Best alpha: {ridge.alpha_}")
# LassoCV uses coordinate descent path
lasso = LassoCV(cv=5).fit(X, y)
print(f"Best alpha: {lasso.alpha_}")
print(f"Non-zero features: {(lasso.coef_ != 0).sum()} / {len(lasso.coef_)}")
Polynomial Regression
To capture non-linear relationships while staying within the linear regression framework:
仍然 linear in parameters ,所以 OLS 完全適用。危險是 overfitting — degree 接近 時會 memorize training data。
Logistic Regression
面試中 logistic regression 常和 linear regression 一起被問到。它不是 regression 但名字裡有 regression:
Key differences from linear regression:
| Aspect | Linear Regression | Logistic Regression |
|---|---|---|
| Output | Continuous value | Probability |
| Loss function | MSE (squared error) | Cross-entropy (log loss) |
| Optimization | Closed-form (OLS) | Iterative (gradient descent, Newton) |
| Interpretation | = change in per unit change in | = change in log-odds per unit change |
| Assumptions | Linear relationship | Linear in log-odds (not in probability) |
Odds and Log-Odds
Coefficient interpretation: 意味著 增加 1 單位,odds 乘以 (增加 65%)。
面試經典問題
「Logistic regression 的 coefficient 怎麼解釋?」— 不是「 增加 1, 的機率增加 」(那是 linear probability model)。正確解釋: 增加 1,log-odds 增加 ,equivalently odds 乘以 。
Bias-Variance Tradeoff
Use this interactive tool to see how model complexity affects the bias-variance tradeoff:
Training Error
0.194
Validation Error
0.308
Gap (Variance Proxy)
0.114
Training vs Validation Error
Bias-Variance Decomposition
Total Error = Bias² + Variance + Irreducible Noise
Bias²
0.128
Variance
0.130
Noise
0.050
Total
0.308
Regularization 增加 bias(模型更受限)但降低 variance(模型更穩定)。Optimal model 是在 bias 和 variance 之間取得平衡。
Real-World Use Cases
Case 1: 房價預測
Challenge: 房價右偏、positive-only → 用 log transform。
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
# Log-transform target (house prices are log-normal)
y_log = np.log1p(y) # log(1 + y) to handle y=0
ridge = Ridge(alpha=10)
scores = cross_val_score(ridge, X, y_log, cv=5, scoring="neg_mean_squared_error")
# Predict: exp(model.predict(X_new)) - 1 to get back to original scale
面試 follow-up:
- 「為什麼用 log transform?」— 房價接近 log-normal,transform 後 residuals 更接近常態且 homoscedastic
- 「Feature interpretation 怎麼變?」— coefficient 0.1 on log scale ≈ 10% increase in price
- 「有些 feature 是 categorical 怎麼辦?」— One-hot encoding, 但注意 dummy variable trap( categories 只需 dummies)
Case 2: 信用卡詐欺偵測
Challenge: Binary outcome → Logistic regression。但 class imbalance 嚴重(0.1% fraud)。
from sklearn.linear_model import LogisticRegression
# class_weight='balanced' adjusts for imbalance
model = LogisticRegression(C=1.0, class_weight="balanced")
# Equivalent to oversampling minority class
# C = inverse regularization strength (smaller C = more regularization)
面試 follow-up:
- 「為什麼不用 linear regression 做 classification?」— 預測值可能超出 [0, 1],而且 MSE 不適合 binary targets
- 「L1 or L2 regularization?」— L1 如果想做 feature selection(哪些 features 和 fraud 最相關);L2 如果想 keep all features
Case 3: 客戶消費金額預測
Challenge: Target 有很多零(很多客戶沒消費)→ Two-part model。
Part 1: Logistic regression 預測「是否消費」() Part 2: Linear regression(on log scale)預測「消費多少」()
這比直接對含零的資料跑 regression 更合理 — 因為「不消費」和「消費多少」是兩個不同的決策機制。
Hands-on: Regression in Python
OLS vs Ridge vs Lasso
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import cross_val_score
# Correlated features + noise features
n = 200
X1 = np.random.randn(n)
X2 = X1 * 0.9 + np.random.randn(n) * 0.3 # correlated with X1
X3 = np.random.randn(n)
X_noise = np.random.randn(n, 5) # irrelevant features
X = np.column_stack([X1, X2, X3, X_noise])
y = 3 * X1 + 2 * X3 + np.random.randn(n) * 0.5 # only X1, X3 matter
models = {
"OLS": LinearRegression(),
"Ridge": Ridge(alpha=1.0),
"Lasso": Lasso(alpha=0.1),
}
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=5, scoring="r2")
model.fit(X, y)
print(f"{name}: CV R²={scores.mean():.3f}, coefs={model.coef_.round(2)}")
# OLS: unstable coefficients for correlated X1/X2
# Ridge: shrinks all coefficients, but keeps all nonzero
# Lasso: drives noise coefficients to exactly 0 (feature selection)
# True: [3.00, 0.00, 2.00, 0, 0, 0, 0, 0]
Diagnostics with statsmodels
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
X_const = sm.add_constant(X)
model = sm.OLS(y, X_const).fit()
print(model.summary())
# Look at: coefficients, p-values, R², F-statistic, Durbin-Watson
# VIF check for multicollinearity
for i in range(X.shape[1]):
vif = variance_inflation_factor(X, i)
print(f"Feature {i}: VIF = {vif:.2f}")
# VIF > 10 → multicollinearity problem
Interview Signals
What interviewers listen for:
- 你能說出 OLS 的 5 個假設,並知道 normality 不是其中之一
- 你知道 的限制,不會只靠它判斷 model quality
- 你能解釋 Ridge vs Lasso 的幾何差異(圓 vs 菱形)和 Bayesian 解釋
- 你知道 multicollinearity 不影響 unbiasedness,只影響 variance
- 你能正確解釋 logistic regression 的 coefficient(log-odds, not probability)
Practice
Flashcards
Flashcards (1/10)
Why does adding more features always increase R² but not necessarily adjusted R²?
R² measures total variance explained, can only increase with more features (even noise). Adjusted R² penalizes by degrees of freedom — 新 feature 帶來的 improvement 如果不夠大,adjusted R² 會下降。
Quiz
Which regularization method performs automatic feature selection by setting coefficients exactly to zero?