Evaluation Metrics

Interview Essentials

Evaluation metrics 是面試必考題。面試官想知道你能不能根據 business context 選對 metric、解釋 tradeoff、診斷模型問題。不要只背公式,要能說出「在這個場景下我會看這個指標,因為⋯⋯」。

What You Should Understand

  • 能根據 business context 選擇正確的 metric(不是所有問題都看 accuracy)
  • 理解 precision/recall tradeoff 和 threshold 選擇
  • 知道 ROC-AUC vs PR-AUC 在 imbalanced data 下的差異
  • 能解釋 calibration 為什麼重要以及怎麼修正
  • 理解 data leakage 的種類和偵測方法

Classification Metrics

Confusion Matrix

All binary classification metrics derive from four counts:

Predicted PositivePredicted Negative
Actually PositiveTrue Positive (TP)False Negative (FN)
Actually NegativeFalse Positive (FP)True Negative (TN)

調整下面的值,即時觀察各指標如何變化:

TP45
FP5
FN5
TN45
Total samples: 100
Accuracy
90.0%
(TP+TN) / Total
Precision
90.0%
TP / (TP+FP)
Recall
90.0%
TP / (TP+FN)
F1 Score
90.0%
2PR / (P+R)
Specificity
90.0%
TN / (TN+FP)

Accuracy

Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

最直覺但經常誤導的指標。99% negatives 的 dataset 中,always predict negative 就有 99% accuracy。

Accuracy Paradox

永遠不要在 imbalanced dataset 上用 accuracy 作為 primary metric。Fraud detection model 有 99.9% accuracy 但可能抓到 0% 的詐欺 — 只是因為它 always predict "not fraud"。

Precision and Recall

Precision — Of all predicted positives, how many are truly positive?

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

Recall (Sensitivity, TPR) — Of all actual positives, how many did we catch?

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

The Precision-Recall Tradeoff

Lowering the classification threshold → catches more positives (higher recall) → but also more false alarms (lower precision). 這個 tradeoff 是根本的,不可能兩者同時最大化。

選擇 threshold 取決於 business cost

ScenarioPrioritizeCost Reasoning
Cancer screeningRecallFN(漏掉癌症)的代價遠大於 FP(多做一次檢查)
Spam filteringPrecisionFP(正常信被標為垃圾)讓用戶非常生氣
Fraud detectionCost-weighted比較 FP investigation cost vs FN fraud loss
Search rankingPrecision@K前 10 筆結果的精準度決定用戶體驗
Content moderationRecallFN(有害內容未被偵測)的 reputational risk 很高

F1 Score and F-beta

F1 — harmonic mean of precision and recall:

F1=2PrecisionRecallPrecision+Recall=2TP2TP+FP+FNF_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}

為什麼用 harmonic mean 而不是 arithmetic mean?Harmonic mean 懲罰極端不平衡 — precision=1.0, recall=0.01 → arithmetic mean=0.505(看起來不錯),harmonic mean=0.02(正確反映模型幾乎沒用)。

FβF_\beta generalizes to weight recall β\beta times more:

Fβ=(1+β2)PrecisionRecallβ2Precision+RecallF_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}
  • F2F_2: 更在意 recall(FN costly — cancer, fraud)
  • F0.5F_{0.5}: 更在意 precision(FP costly — spam, content recommendation)

Specificity and FPR

Specificity (TNR): TNTN+FP=1FPR\frac{TN}{TN + FP} = 1 - \text{FPR}

False Positive Rate: FPR=FPFP+TN\text{FPR} = \frac{FP}{FP + TN}

Threshold-Independent Metrics

ROC Curve and AUC

ROC plots TPR vs FPR at every threshold:

  • Perfect: AUC = 1.0(curve goes to top-left corner)
  • Random: AUC = 0.5(diagonal line)
  • Interpretation: AUC = probability that a random positive is scored higher than a random negative
AUC=P(y^pos>y^neg)\text{AUC} = P(\hat{y}_{\text{pos}} > \hat{y}_{\text{neg}})

Strengths: Threshold-independent, scale-invariant.

Weakness: 在 imbalanced data 下可能 misleading — FPR 分母是大量的 TN,即使 FP 很多 FPR 仍然低 → AUC 看起來很高但 positive class 的表現很差。

PR Curve and PR-AUC

PR curve plots precision vs recall at every threshold:

  • Perfect: PR-AUC = 1.0
  • Random baseline: PR-AUC ≈ prevalence(positive rate)
  • 完全聚焦在 positive class — 不因 TN 多而被 inflate

ROC-AUC vs PR-AUC

1% positive rate 的 dataset:一個把所有 positives 排在前面但同時也給很多 negatives 高分的模型 → AUC ≈ 0.99(因為 FPR 仍然低)但 PR-AUC 會很差(precision 在 high recall 時暴跌)。

Rule of thumb: positive class rare + 在意 positive predictions → PR-AUC。Classes balanced + 在意 overall ranking → ROC-AUC。

Log Loss (Cross-Entropy)

Measures the quality of predicted probabilities:

Log Loss=1ni=1n[yilog(p^i)+(1yi)log(1p^i)]\text{Log Loss} = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)\right]
  • Penalizes confident wrong predictions heavily(predict 0.99 when truth is 0 → huge loss)
  • 需要 calibrated probabilities — AUC 高不代表 log loss 低

Calibration

A model is well-calibrated if among all predictions of 70%, approximately 70% are actually positive.

Reliability Diagram

Plot predicted probability(binned)vs actual proportion of positives. 完美 calibrated 的 model 在 diagonal 上。

常見的 miscalibration patterns:

  • Overconfident: 預測值比實際偏極端(predicted 90% 但實際只有 70%)— 常見於 deep learning
  • Underconfident: 預測值太保守(predicted 60% 但實際有 80%)

Brier Score

Brier=1ni=1n(p^iyi)2\text{Brier} = \frac{1}{n}\sum_{i=1}^{n}(\hat{p}_i - y_i)^2

Lower is better. Decomposes into calibration + resolution + uncertainty.

When Calibration Matters

Need probabilities for...Calibration critical?
Ranking / sortingNo — only relative order matters
Setting ad bid pricesYes — 出價 = predicted conversion rate × value
Computing expected lossYes — E[loss] = P(event) × cost
Displaying risk to usersYes — "你有 30% 的機率..." 必須準確
Combining multiple modelsYes — 需要 probabilities on same scale
A/B test metricNo — usually comparing means

Calibration Methods

MethodHow It WorksWhen to Use
Platt ScalingFit logistic regression on model outputsSVMs, neural networks
Isotonic RegressionNon-parametric monotone step functionMore data available, need flexibility
Temperature ScalingDivide logits by learned TTDeep learning (single parameter)
from sklearn.calibration import CalibratedClassifierCV, calibration_curve

# Calibrate a model
calibrated = CalibratedClassifierCV(base_model, method="isotonic", cv=5)
calibrated.fit(X_train, y_train)

# Check calibration
prob_true, prob_pred = calibration_curve(y_test, calibrated.predict_proba(X_test)[:, 1], n_bins=10)
# Plot prob_pred vs prob_true — should be close to diagonal

Regression Metrics

MSE, RMSE, MAE

MetricFormulaOutlier SensitivityInterpretation
MSE1n(yiy^i)2\frac{1}{n}\sum(y_i - \hat{y}_i)^2High(squaring amplifies large errors)Hard to interpret(squared unit)
RMSEMSE\sqrt{\text{MSE}}HighSame unit as yy — "typical error magnitude"
MAE1nyiy^i\frac{1}{n}\sum\|y_i - \hat{y}_i\|Low(robust to outliers)Same unit as yy — "median-like error"

MSE vs MAE 的選擇

  • MSE 對大誤差懲罰更重 → 如果大錯的代價遠高於小錯(電力需求、庫存預測),用 MSE
  • MAE 對所有誤差一視同仁 → 如果想要穩健的「平均誤差」,用 MAE
  • MSE 對應 mean prediction,MAE 對應 median prediction

MAPE and Its Limitations

MAPE=100%nyiy^iyi\text{MAPE} = \frac{100\%}{n}\sum\left|\frac{y_i - \hat{y}_i}{y_i}\right|

Gives relative error — 但有兩個問題:

  1. yi=0y_i = 0 時 undefined
  2. Asymmetric:over-prediction 和 under-prediction 的 penalty 不同

R-Squared

R2=1SSresSStotR^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}
  • R2=1R^2 = 1: perfect. R2=0R^2 = 0: no better than predicting the mean
  • Can be negative on test data — model is worse than the mean baseline
  • 加更多 features 只會讓 training R2R^2 增加(即使是 noise)→ 用 adjusted R2R^2 或 cross-validated metrics

Ranking Metrics

在推薦系統和搜尋引擎中,排序品質比 classification 指標更重要:

NDCG (Normalized Discounted Cumulative Gain)

DCG@K=i=1K2reli1log2(i+1)\text{DCG@K} = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i + 1)} NDCG@K=DCG@KIDCG@K\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}

where IDCG is the DCG of the ideal (perfect) ranking.

高 relevance 的 item 排在前面 → gain 更大。Position discount(log2(i+1)\log_2(i+1))反映使用者越往下看的機率越低。

MAP (Mean Average Precision)

AP=1Rk=1nPrecision@krel(k)\text{AP} = \frac{1}{|R|}\sum_{k=1}^{n} \text{Precision@k} \cdot \text{rel}(k)

每個 relevant item 被 retrieve 時計算 Precision@K,取平均。MAP 是多個 query 的 AP 平均。

Hit Rate and MRR

MetricFormulaUse Case
Hit Rate@K前 K 個推薦中有至少一個 relevant 的比例推薦系統 — "使用者有沒有點到東西"
MRR1Q1ranki\frac{1}{Q}\sum \frac{1}{\text{rank}_i}搜尋 — "第一個正確答案出現在第幾位"
Precision@K前 K 個中 relevant 的比例搜尋 / 推薦
Recall@K所有 relevant 中有多少出現在前 K推薦 — "能找回多少相關 item"

面試中的 Ranking Metrics

被問推薦系統或搜尋的 evaluation,不要只說 AUC。要提到 NDCG@K、Hit Rate@K、Precision@K — 因為使用者只看前幾個結果,overall ranking 不重要。

Cross-Validation

K-Fold Cross-Validation

  1. Split data into kk folds(typically k=5k = 5 or 1010
  2. Train on k1k-1 folds, evaluate on held-out fold
  3. Repeat kk times
  4. Report mean and standard deviation
CV Score=1ki=1kScorei\text{CV Score} = \frac{1}{k}\sum_{i=1}^{k}\text{Score}_i

Special Variants

VariantWhen to Use
Stratified K-FoldImbalanced classification — 確保每個 fold 的 class ratio 一致
Time Series SplitTemporal data — train on past, test on future(不洩漏未來資訊)
Group K-FoldGrouped data — 同一個 patient/user 的所有資料在同一個 fold
Leave-One-OutVery small datasets — k=nk = n, maximum training data per fold
Repeated K-Fold更穩定的 estimate — 重複多次 K-Fold with different random splits

CV 的致命錯誤

Feature selection、hyperparameter tuning、或任何 data-dependent preprocessing 必須在每個 fold 內部完成,不能在 split 之前。否則 → data leakage。正確做法:用 sklearn.pipeline.Pipeline 把 preprocessing 和 model 包在一起。

Data Leakage

Common Sources

TypeDescriptionExample
Feature leakageFeature 包含了 prediction time 不應知道的資訊預測「是否需要治療」但 features 包含「治療結果」
Train-test contaminationTest data 影響了 trainingSplit 前做 normalization(mean/std 包含 test data)
Temporal leakage用未來資料預測過去用明天的股價作為 feature 預測今天的交易量
Target leakageFeature 是 target 的衍生預測 churn 但 feature 包含「最後活躍日期」

Detection Signals

  • Suspiciously high metrics: Model 表現太好 → 幾乎一定有 leakage
  • Top feature is unexpected: 一個看似無關的 feature 成為最重要的 → 調查它是否和 target 有非法關聯
  • Train ≈ Test performance: 通常 train 比 test 好。如果差不多 → leaked information 在 train 和 test 都可用

Metric Selection Decision Framework

面試中被問「你會用什麼 metric?」時的決策流程:

StepQuestionAction
1問題類型?Classification → precision/recall/AUC. Regression → RMSE/MAE. Ranking → NDCG/MAP.
2Class balanced?Balanced → accuracy OK. Imbalanced → PR-AUC, F1.
3需要 probability?Yes → check calibration (Brier score). No → AUC for ranking.
4FP vs FN 哪個更貴?FN costly → optimize recall. FP costly → optimize precision.
5Offline vs online?Offline metric 不一定 = online metric. 用 A/B test 驗證。
6Baseline?永遠和 baseline 比(majority class, mean prediction, random)。

Real-World Use Cases

Case 1: 信用卡詐欺偵測

DecisionChoiceReasoning
Primary metricPR-AUC99.9% negatives → accuracy 和 ROC-AUC 都被 TNs inflate
Threshold tuningOptimize F2FN(漏掉詐欺 = 直接金錢損失)比 FP(誤擋合法交易 = 客服成本)更貴
CalibrationRequiredP(fraud) 用來決定是否 block 交易或只是 flag for review
Online metricFraud loss rate + false block rateOffline PR-AUC 好不代表線上 fraud loss 降低

Case 2: 房價預測

DecisionChoiceReasoning
Primary metricRMSE or MAE取決於大誤差的代價。仲介用 → MAE 更直覺。銀行風控 → RMSE(大錯更危險)
Why not R2R^2?不同 dataset 的 R2R^2 不可比R2R^2 取決於 target 的 variance,不同區域不可比
Log transform用 RMSLE房價是 log-normal → 在 log scale 上的 RMSE 更合理
BaselineMean/median of training set任何 model 必須 beat 這個 baseline 才有價值

Case 3: 推薦系統

DecisionChoiceReasoning
Offline metricNDCG@10, Hit Rate@10使用者只看前 10 個推薦,overall ranking 不重要
Online metricCTR, conversion rate, revenue per sessionOffline NDCG 好不代表使用者真的買了
DiversityIntra-list diversity只推薦同類商品 → 使用者體驗差,即使 CTR 高
Cold start新 item 的 coverage rate推薦系統是否能推薦新上架的商品

Decision Boundary Explorer

See how different classifiers create different decision boundaries:

Decision Boundary Explorer

See how different classifiers separate data. Requires scikit-learn (~20MB, cached after first load).

Dataset
Classifier
Click Run to generate the decision boundary

Hands-on: Metrics in Python

Classification Metrics on Imbalanced Data

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, average_precision_score,
    classification_report,
)

# Imbalanced: 95% negative, 5% positive
X, y = make_classification(n_samples=2000, n_features=10, weights=[0.95, 0.05])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)

model = LogisticRegression().fit(X_train, y_train)
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

accuracy_score(y_test, y_pred)            # ~0.96 ← misleading!
precision_score(y_test, y_pred)           # ~0.67
recall_score(y_test, y_pred)              # ~0.40
f1_score(y_test, y_pred)                  # ~0.50
roc_auc_score(y_test, y_prob)             # ~0.92
average_precision_score(y_test, y_prob)   # ~0.45 ← most informative here
# Baseline (always predict 0): accuracy ≈ 0.95 → model barely beats it

Ranking Metrics

from sklearn.metrics import ndcg_score
import numpy as np

# True relevance scores and model predictions
y_true = np.array([[3, 2, 3, 0, 1, 2]])  # relevance labels
y_score = np.array([[0.9, 0.8, 0.7, 0.6, 0.5, 0.4]])  # model scores

ndcg_at_3 = ndcg_score(y_true, y_score, k=3)
ndcg_at_5 = ndcg_score(y_true, y_score, k=5)
# NDCG@3 close to 1.0 if top-3 items are the most relevant

Interview Signals

What interviewers listen for:

  • 你會先定義 business problem 和 metric,再談模型
  • 你知道 baseline 為什麼重要(majority class, mean prediction)
  • 你能根據 FP/FN cost 選擇 metric 和 threshold
  • 你能解釋 ROC-AUC vs PR-AUC 的差異,以及 calibration 何時重要
  • 你知道 offline metric ≠ online metric,需要 A/B test 驗證

Practice

Flashcards

Flashcards (1/10)

為什麼 imbalanced data 上 PR-AUC 比 ROC-AUC 更有意義?

ROC-AUC 的 FPR 分母包含大量 TN — 即使 FP 很多 FPR 仍然低 → AUC 被 inflate。PR-AUC 只看 positive class 的 precision 和 recall,不因 TN 多而被膨脹。Rule of thumb: positive class rare → PR-AUC。

Click card to flip

Quiz

Question 1/10

Credit risk prediction(99.9% 非違約)。哪個指標比 accuracy 更有意義?

Mark as Complete

3/5 — Okay