Evaluation Metrics
Interview Essentials
Evaluation metrics 是面試必考題。面試官想知道你能不能根據 business context 選對 metric、解釋 tradeoff、診斷模型問題。不要只背公式,要能說出「在這個場景下我會看這個指標,因為⋯⋯」。
What You Should Understand
- 能根據 business context 選擇正確的 metric(不是所有問題都看 accuracy)
- 理解 precision/recall tradeoff 和 threshold 選擇
- 知道 ROC-AUC vs PR-AUC 在 imbalanced data 下的差異
- 能解釋 calibration 為什麼重要以及怎麼修正
- 理解 data leakage 的種類和偵測方法
Classification Metrics
Confusion Matrix
All binary classification metrics derive from four counts:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actually Positive | True Positive (TP) | False Negative (FN) |
| Actually Negative | False Positive (FP) | True Negative (TN) |
調整下面的值,即時觀察各指標如何變化:
Accuracy
最直覺但經常誤導的指標。99% negatives 的 dataset 中,always predict negative 就有 99% accuracy。
Accuracy Paradox
永遠不要在 imbalanced dataset 上用 accuracy 作為 primary metric。Fraud detection model 有 99.9% accuracy 但可能抓到 0% 的詐欺 — 只是因為它 always predict "not fraud"。
Precision and Recall
Precision — Of all predicted positives, how many are truly positive?
Recall (Sensitivity, TPR) — Of all actual positives, how many did we catch?
The Precision-Recall Tradeoff
Lowering the classification threshold → catches more positives (higher recall) → but also more false alarms (lower precision). 這個 tradeoff 是根本的,不可能兩者同時最大化。
選擇 threshold 取決於 business cost:
| Scenario | Prioritize | Cost Reasoning |
|---|---|---|
| Cancer screening | Recall | FN(漏掉癌症)的代價遠大於 FP(多做一次檢查) |
| Spam filtering | Precision | FP(正常信被標為垃圾)讓用戶非常生氣 |
| Fraud detection | Cost-weighted | 比較 FP investigation cost vs FN fraud loss |
| Search ranking | Precision@K | 前 10 筆結果的精準度決定用戶體驗 |
| Content moderation | Recall | FN(有害內容未被偵測)的 reputational risk 很高 |
F1 Score and F-beta
F1 — harmonic mean of precision and recall:
為什麼用 harmonic mean 而不是 arithmetic mean?Harmonic mean 懲罰極端不平衡 — precision=1.0, recall=0.01 → arithmetic mean=0.505(看起來不錯),harmonic mean=0.02(正確反映模型幾乎沒用)。
generalizes to weight recall times more:
- : 更在意 recall(FN costly — cancer, fraud)
- : 更在意 precision(FP costly — spam, content recommendation)
Specificity and FPR
Specificity (TNR):
False Positive Rate:
Threshold-Independent Metrics
ROC Curve and AUC
ROC plots TPR vs FPR at every threshold:
- Perfect: AUC = 1.0(curve goes to top-left corner)
- Random: AUC = 0.5(diagonal line)
- Interpretation: AUC = probability that a random positive is scored higher than a random negative
Strengths: Threshold-independent, scale-invariant.
Weakness: 在 imbalanced data 下可能 misleading — FPR 分母是大量的 TN,即使 FP 很多 FPR 仍然低 → AUC 看起來很高但 positive class 的表現很差。
PR Curve and PR-AUC
PR curve plots precision vs recall at every threshold:
- Perfect: PR-AUC = 1.0
- Random baseline: PR-AUC ≈ prevalence(positive rate)
- 完全聚焦在 positive class — 不因 TN 多而被 inflate
ROC-AUC vs PR-AUC
1% positive rate 的 dataset:一個把所有 positives 排在前面但同時也給很多 negatives 高分的模型 → AUC ≈ 0.99(因為 FPR 仍然低)但 PR-AUC 會很差(precision 在 high recall 時暴跌)。
Rule of thumb: positive class rare + 在意 positive predictions → PR-AUC。Classes balanced + 在意 overall ranking → ROC-AUC。
Log Loss (Cross-Entropy)
Measures the quality of predicted probabilities:
- Penalizes confident wrong predictions heavily(predict 0.99 when truth is 0 → huge loss)
- 需要 calibrated probabilities — AUC 高不代表 log loss 低
Calibration
A model is well-calibrated if among all predictions of 70%, approximately 70% are actually positive.
Reliability Diagram
Plot predicted probability(binned)vs actual proportion of positives. 完美 calibrated 的 model 在 diagonal 上。
常見的 miscalibration patterns:
- Overconfident: 預測值比實際偏極端(predicted 90% 但實際只有 70%)— 常見於 deep learning
- Underconfident: 預測值太保守(predicted 60% 但實際有 80%)
Brier Score
Lower is better. Decomposes into calibration + resolution + uncertainty.
When Calibration Matters
| Need probabilities for... | Calibration critical? |
|---|---|
| Ranking / sorting | No — only relative order matters |
| Setting ad bid prices | Yes — 出價 = predicted conversion rate × value |
| Computing expected loss | Yes — E[loss] = P(event) × cost |
| Displaying risk to users | Yes — "你有 30% 的機率..." 必須準確 |
| Combining multiple models | Yes — 需要 probabilities on same scale |
| A/B test metric | No — usually comparing means |
Calibration Methods
| Method | How It Works | When to Use |
|---|---|---|
| Platt Scaling | Fit logistic regression on model outputs | SVMs, neural networks |
| Isotonic Regression | Non-parametric monotone step function | More data available, need flexibility |
| Temperature Scaling | Divide logits by learned | Deep learning (single parameter) |
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
# Calibrate a model
calibrated = CalibratedClassifierCV(base_model, method="isotonic", cv=5)
calibrated.fit(X_train, y_train)
# Check calibration
prob_true, prob_pred = calibration_curve(y_test, calibrated.predict_proba(X_test)[:, 1], n_bins=10)
# Plot prob_pred vs prob_true — should be close to diagonal
Regression Metrics
MSE, RMSE, MAE
| Metric | Formula | Outlier Sensitivity | Interpretation |
|---|---|---|---|
| MSE | High(squaring amplifies large errors) | Hard to interpret(squared unit) | |
| RMSE | High | Same unit as — "typical error magnitude" | |
| MAE | Low(robust to outliers) | Same unit as — "median-like error" |
MSE vs MAE 的選擇:
- MSE 對大誤差懲罰更重 → 如果大錯的代價遠高於小錯(電力需求、庫存預測),用 MSE
- MAE 對所有誤差一視同仁 → 如果想要穩健的「平均誤差」,用 MAE
- MSE 對應 mean prediction,MAE 對應 median prediction
MAPE and Its Limitations
Gives relative error — 但有兩個問題:
- 時 undefined
- Asymmetric:over-prediction 和 under-prediction 的 penalty 不同
R-Squared
- : perfect. : no better than predicting the mean
- Can be negative on test data — model is worse than the mean baseline
- 加更多 features 只會讓 training 增加(即使是 noise)→ 用 adjusted 或 cross-validated metrics
Ranking Metrics
在推薦系統和搜尋引擎中,排序品質比 classification 指標更重要:
NDCG (Normalized Discounted Cumulative Gain)
where IDCG is the DCG of the ideal (perfect) ranking.
高 relevance 的 item 排在前面 → gain 更大。Position discount()反映使用者越往下看的機率越低。
MAP (Mean Average Precision)
每個 relevant item 被 retrieve 時計算 Precision@K,取平均。MAP 是多個 query 的 AP 平均。
Hit Rate and MRR
| Metric | Formula | Use Case |
|---|---|---|
| Hit Rate@K | 前 K 個推薦中有至少一個 relevant 的比例 | 推薦系統 — "使用者有沒有點到東西" |
| MRR | 搜尋 — "第一個正確答案出現在第幾位" | |
| Precision@K | 前 K 個中 relevant 的比例 | 搜尋 / 推薦 |
| Recall@K | 所有 relevant 中有多少出現在前 K | 推薦 — "能找回多少相關 item" |
面試中的 Ranking Metrics
被問推薦系統或搜尋的 evaluation,不要只說 AUC。要提到 NDCG@K、Hit Rate@K、Precision@K — 因為使用者只看前幾個結果,overall ranking 不重要。
Cross-Validation
K-Fold Cross-Validation
- Split data into folds(typically or )
- Train on folds, evaluate on held-out fold
- Repeat times
- Report mean and standard deviation
Special Variants
| Variant | When to Use |
|---|---|
| Stratified K-Fold | Imbalanced classification — 確保每個 fold 的 class ratio 一致 |
| Time Series Split | Temporal data — train on past, test on future(不洩漏未來資訊) |
| Group K-Fold | Grouped data — 同一個 patient/user 的所有資料在同一個 fold |
| Leave-One-Out | Very small datasets — , maximum training data per fold |
| Repeated K-Fold | 更穩定的 estimate — 重複多次 K-Fold with different random splits |
CV 的致命錯誤
Feature selection、hyperparameter tuning、或任何 data-dependent preprocessing 必須在每個 fold 內部完成,不能在 split 之前。否則 → data leakage。正確做法:用 sklearn.pipeline.Pipeline 把 preprocessing 和 model 包在一起。
Data Leakage
Common Sources
| Type | Description | Example |
|---|---|---|
| Feature leakage | Feature 包含了 prediction time 不應知道的資訊 | 預測「是否需要治療」但 features 包含「治療結果」 |
| Train-test contamination | Test data 影響了 training | Split 前做 normalization(mean/std 包含 test data) |
| Temporal leakage | 用未來資料預測過去 | 用明天的股價作為 feature 預測今天的交易量 |
| Target leakage | Feature 是 target 的衍生 | 預測 churn 但 feature 包含「最後活躍日期」 |
Detection Signals
- Suspiciously high metrics: Model 表現太好 → 幾乎一定有 leakage
- Top feature is unexpected: 一個看似無關的 feature 成為最重要的 → 調查它是否和 target 有非法關聯
- Train ≈ Test performance: 通常 train 比 test 好。如果差不多 → leaked information 在 train 和 test 都可用
Metric Selection Decision Framework
面試中被問「你會用什麼 metric?」時的決策流程:
| Step | Question | Action |
|---|---|---|
| 1 | 問題類型? | Classification → precision/recall/AUC. Regression → RMSE/MAE. Ranking → NDCG/MAP. |
| 2 | Class balanced? | Balanced → accuracy OK. Imbalanced → PR-AUC, F1. |
| 3 | 需要 probability? | Yes → check calibration (Brier score). No → AUC for ranking. |
| 4 | FP vs FN 哪個更貴? | FN costly → optimize recall. FP costly → optimize precision. |
| 5 | Offline vs online? | Offline metric 不一定 = online metric. 用 A/B test 驗證。 |
| 6 | Baseline? | 永遠和 baseline 比(majority class, mean prediction, random)。 |
Real-World Use Cases
Case 1: 信用卡詐欺偵測
| Decision | Choice | Reasoning |
|---|---|---|
| Primary metric | PR-AUC | 99.9% negatives → accuracy 和 ROC-AUC 都被 TNs inflate |
| Threshold tuning | Optimize F2 | FN(漏掉詐欺 = 直接金錢損失)比 FP(誤擋合法交易 = 客服成本)更貴 |
| Calibration | Required | P(fraud) 用來決定是否 block 交易或只是 flag for review |
| Online metric | Fraud loss rate + false block rate | Offline PR-AUC 好不代表線上 fraud loss 降低 |
Case 2: 房價預測
| Decision | Choice | Reasoning |
|---|---|---|
| Primary metric | RMSE or MAE | 取決於大誤差的代價。仲介用 → MAE 更直覺。銀行風控 → RMSE(大錯更危險) |
| Why not ? | 不同 dataset 的 不可比 | 取決於 target 的 variance,不同區域不可比 |
| Log transform | 用 RMSLE | 房價是 log-normal → 在 log scale 上的 RMSE 更合理 |
| Baseline | Mean/median of training set | 任何 model 必須 beat 這個 baseline 才有價值 |
Case 3: 推薦系統
| Decision | Choice | Reasoning |
|---|---|---|
| Offline metric | NDCG@10, Hit Rate@10 | 使用者只看前 10 個推薦,overall ranking 不重要 |
| Online metric | CTR, conversion rate, revenue per session | Offline NDCG 好不代表使用者真的買了 |
| Diversity | Intra-list diversity | 只推薦同類商品 → 使用者體驗差,即使 CTR 高 |
| Cold start | 新 item 的 coverage rate | 推薦系統是否能推薦新上架的商品 |
Decision Boundary Explorer
See how different classifiers create different decision boundaries:
Decision Boundary Explorer
See how different classifiers separate data. Requires scikit-learn (~20MB, cached after first load).
Hands-on: Metrics in Python
Classification Metrics on Imbalanced Data
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, average_precision_score,
classification_report,
)
# Imbalanced: 95% negative, 5% positive
X, y = make_classification(n_samples=2000, n_features=10, weights=[0.95, 0.05])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
model = LogisticRegression().fit(X_train, y_train)
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
accuracy_score(y_test, y_pred) # ~0.96 ← misleading!
precision_score(y_test, y_pred) # ~0.67
recall_score(y_test, y_pred) # ~0.40
f1_score(y_test, y_pred) # ~0.50
roc_auc_score(y_test, y_prob) # ~0.92
average_precision_score(y_test, y_prob) # ~0.45 ← most informative here
# Baseline (always predict 0): accuracy ≈ 0.95 → model barely beats it
Ranking Metrics
from sklearn.metrics import ndcg_score
import numpy as np
# True relevance scores and model predictions
y_true = np.array([[3, 2, 3, 0, 1, 2]]) # relevance labels
y_score = np.array([[0.9, 0.8, 0.7, 0.6, 0.5, 0.4]]) # model scores
ndcg_at_3 = ndcg_score(y_true, y_score, k=3)
ndcg_at_5 = ndcg_score(y_true, y_score, k=5)
# NDCG@3 close to 1.0 if top-3 items are the most relevant
Interview Signals
What interviewers listen for:
- 你會先定義 business problem 和 metric,再談模型
- 你知道 baseline 為什麼重要(majority class, mean prediction)
- 你能根據 FP/FN cost 選擇 metric 和 threshold
- 你能解釋 ROC-AUC vs PR-AUC 的差異,以及 calibration 何時重要
- 你知道 offline metric ≠ online metric,需要 A/B test 驗證
Practice
Flashcards
Flashcards (1/10)
為什麼 imbalanced data 上 PR-AUC 比 ROC-AUC 更有意義?
ROC-AUC 的 FPR 分母包含大量 TN — 即使 FP 很多 FPR 仍然低 → AUC 被 inflate。PR-AUC 只看 positive class 的 precision 和 recall,不因 TN 多而被膨脹。Rule of thumb: positive class rare → PR-AUC。
Quiz
Credit risk prediction(99.9% 非違約)。哪個指標比 accuracy 更有意義?