Evaluation Metrics

Interview Essentials

Evaluation metrics 是面試必考題。面試官想知道你能不能根據 business context 選對 metric、解釋 tradeoff、診斷模型問題。不要只背公式，要能說出「在這個場景下我會看這個指標，因為⋯⋯」。

What You Should Understand

能根據 business context 選擇正確的 metric（不是所有問題都看 accuracy）
理解 precision/recall tradeoff 和 threshold 選擇
知道 ROC-AUC vs PR-AUC 在 imbalanced data 下的差異
能解釋 calibration 為什麼重要以及怎麼修正
理解 data leakage 的種類和偵測方法

Classification Metrics

Confusion Matrix

All binary classification metrics derive from four counts:

	Predicted Positive	Predicted Negative
Actually Positive	True Positive (TP)	False Negative (FN)
Actually Negative	False Positive (FP)	True Negative (TN)

調整下面的值，即時觀察各指標如何變化：

TP45

FP5

FN5

TN45

Total samples: 100

Accuracy

90.0%

(TP+TN) / Total

Precision

90.0%

TP / (TP+FP)

Recall

90.0%

TP / (TP+FN)

F1 Score

90.0%

2PR / (P+R)

Specificity

90.0%

TN / (TN+FP)

Accuracy

\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

最直覺但經常誤導的指標。99% negatives 的 dataset 中，always predict negative 就有 99% accuracy。

Accuracy Paradox

永遠不要在 imbalanced dataset 上用 accuracy 作為 primary metric。Fraud detection model 有 99.9% accuracy 但可能抓到 0% 的詐欺 — 只是因為它 always predict "not fraud"。

Precision and Recall

Precision — Of all predicted positives, how many are truly positive?

\text{Precision} = \frac{TP}{TP + FP}

Recall (Sensitivity, TPR) — Of all actual positives, how many did we catch?

\text{Recall} = \frac{TP}{TP + FN}

The Precision-Recall Tradeoff

Lowering the classification threshold → catches more positives (higher recall) → but also more false alarms (lower precision). 這個 tradeoff 是根本的，不可能兩者同時最大化。

選擇 threshold 取決於 business cost：

Scenario	Prioritize	Cost Reasoning
Cancer screening	Recall	FN（漏掉癌症）的代價遠大於 FP（多做一次檢查）
Spam filtering	Precision	FP（正常信被標為垃圾）讓用戶非常生氣
Fraud detection	Cost-weighted	比較 FP investigation cost vs FN fraud loss
Search ranking	Precision@K	前 10 筆結果的精準度決定用戶體驗
Content moderation	Recall	FN（有害內容未被偵測）的 reputational risk 很高

F1 Score and F-beta

F1 — harmonic mean of precision and recall:

F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}

為什麼用 harmonic mean 而不是 arithmetic mean？Harmonic mean 懲罰極端不平衡 — precision=1.0, recall=0.01 → arithmetic mean=0.505（看起來不錯），harmonic mean=0.02（正確反映模型幾乎沒用）。

$F_\beta$ generalizes to weight recall $\beta$ times more:

F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}

$F_2$ : 更在意 recall（FN costly — cancer, fraud）
$F_{0.5}$ : 更在意 precision（FP costly — spam, content recommendation）

Specificity and FPR

Specificity (TNR): $\frac{TN}{TN + FP} = 1 - \text{FPR}$

False Positive Rate: $\text{FPR} = \frac{FP}{FP + TN}$

Threshold-Independent Metrics

ROC Curve and AUC

ROC plots TPR vs FPR at every threshold:

Perfect: AUC = 1.0（curve goes to top-left corner）
Random: AUC = 0.5（diagonal line）
Interpretation: AUC = probability that a random positive is scored higher than a random negative

\text{AUC} = P(\hat{y}_{\text{pos}} > \hat{y}_{\text{neg}})

Strengths: Threshold-independent, scale-invariant.

Weakness: 在 imbalanced data 下可能 misleading — FPR 分母是大量的 TN，即使 FP 很多 FPR 仍然低 → AUC 看起來很高但 positive class 的表現很差。

PR Curve and PR-AUC

PR curve plots precision vs recall at every threshold:

Perfect: PR-AUC = 1.0
Random baseline: PR-AUC ≈ prevalence（positive rate）
完全聚焦在 positive class — 不因 TN 多而被 inflate

ROC-AUC vs PR-AUC

1% positive rate 的 dataset：一個把所有 positives 排在前面但同時也給很多 negatives 高分的模型 → AUC ≈ 0.99（因為 FPR 仍然低）但 PR-AUC 會很差（precision 在 high recall 時暴跌）。

Rule of thumb: positive class rare + 在意 positive predictions → PR-AUC。Classes balanced + 在意 overall ranking → ROC-AUC。

Log Loss (Cross-Entropy)

Measures the quality of predicted probabilities:

\text{Log Loss} = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)\right]

Penalizes confident wrong predictions heavily（predict 0.99 when truth is 0 → huge loss）
需要 calibrated probabilities — AUC 高不代表 log loss 低

Calibration

A model is well-calibrated if among all predictions of 70%, approximately 70% are actually positive.

Reliability Diagram

Plot predicted probability（binned）vs actual proportion of positives. 完美 calibrated 的 model 在 diagonal 上。

常見的 miscalibration patterns：

Overconfident: 預測值比實際偏極端（predicted 90% 但實際只有 70%）— 常見於 deep learning
Underconfident: 預測值太保守（predicted 60% 但實際有 80%）

Brier Score

\text{Brier} = \frac{1}{n}\sum_{i=1}^{n}(\hat{p}_i - y_i)^2

Lower is better. Decomposes into calibration + resolution + uncertainty.

When Calibration Matters

Need probabilities for...	Calibration critical?
Ranking / sorting	No — only relative order matters
Setting ad bid prices	Yes — 出價 = predicted conversion rate × value
Computing expected loss	Yes — E[loss] = P(event) × cost
Displaying risk to users	Yes — "你有 30% 的機率..." 必須準確
Combining multiple models	Yes — 需要 probabilities on same scale
A/B test metric	No — usually comparing means

Calibration Methods

Method	How It Works	When to Use
Platt Scaling	Fit logistic regression on model outputs	SVMs, neural networks
Isotonic Regression	Non-parametric monotone step function	More data available, need flexibility
Temperature Scaling	Divide logits by learned $T$	Deep learning (single parameter)

from sklearn.calibration import CalibratedClassifierCV, calibration_curve

# Calibrate a model
calibrated = CalibratedClassifierCV(base_model, method="isotonic", cv=5)
calibrated.fit(X_train, y_train)

# Check calibration
prob_true, prob_pred = calibration_curve(y_test, calibrated.predict_proba(X_test)[:, 1], n_bins=10)
# Plot prob_pred vs prob_true — should be close to diagonal

Regression Metrics

MSE, RMSE, MAE

Metric	Formula	Outlier Sensitivity	Interpretation
MSE	$\frac{1}{n}\sum(y_i - \hat{y}_i)^2$	High（squaring amplifies large errors）	Hard to interpret（squared unit）
RMSE	$\sqrt{\text{MSE}}$	High	Same unit as $y$ — "typical error magnitude"
MAE	$\frac{1}{n}\sum\\|y_i - \hat{y}_i\\|$	Low（robust to outliers）	Same unit as $y$ — "median-like error"

MSE vs MAE 的選擇：

MSE 對大誤差懲罰更重 → 如果大錯的代價遠高於小錯（電力需求、庫存預測），用 MSE
MAE 對所有誤差一視同仁 → 如果想要穩健的「平均誤差」，用 MAE
MSE 對應 mean prediction，MAE 對應 median prediction

MAPE and Its Limitations

\text{MAPE} = \frac{100\%}{n}\sum\left|\frac{y_i - \hat{y}_i}{y_i}\right|

Gives relative error — 但有兩個問題：

$y_i = 0$ 時 undefined
Asymmetric：over-prediction 和 under-prediction 的 penalty 不同

R-Squared

R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}

$R^2 = 1$ : perfect. $R^2 = 0$ : no better than predicting the mean
Can be negative on test data — model is worse than the mean baseline
加更多 features 只會讓 training $R^2$ 增加（即使是 noise）→ 用 adjusted $R^2$ 或 cross-validated metrics

Ranking Metrics

在推薦系統和搜尋引擎中，排序品質比 classification 指標更重要：

NDCG (Normalized Discounted Cumulative Gain)

\text{DCG@K} = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i + 1)}

\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}

where IDCG is the DCG of the ideal (perfect) ranking.

高 relevance 的 item 排在前面 → gain 更大。Position discount（ $\log_2(i+1)$ ）反映使用者越往下看的機率越低。

MAP (Mean Average Precision)

\text{AP} = \frac{1}{|R|}\sum_{k=1}^{n} \text{Precision@k} \cdot \text{rel}(k)

每個 relevant item 被 retrieve 時計算 Precision@K，取平均。MAP 是多個 query 的 AP 平均。

Hit Rate and MRR

Metric	Formula	Use Case
Hit Rate@K	前 K 個推薦中有至少一個 relevant 的比例	推薦系統 — "使用者有沒有點到東西"
MRR	$\frac{1}{Q}\sum \frac{1}{\text{rank}_i}$	搜尋 — "第一個正確答案出現在第幾位"
Precision@K	前 K 個中 relevant 的比例	搜尋 / 推薦
Recall@K	所有 relevant 中有多少出現在前 K	推薦 — "能找回多少相關 item"

面試中的 Ranking Metrics

被問推薦系統或搜尋的 evaluation，不要只說 AUC。要提到 NDCG@K、Hit Rate@K、Precision@K — 因為使用者只看前幾個結果，overall ranking 不重要。

Cross-Validation

K-Fold Cross-Validation

Split data into $k$ folds（typically $k = 5$ or $10$ ）
Train on $k-1$ folds, evaluate on held-out fold
Repeat $k$ times
Report mean and standard deviation

\text{CV Score} = \frac{1}{k}\sum_{i=1}^{k}\text{Score}_i

Special Variants

Variant	When to Use
Stratified K-Fold	Imbalanced classification — 確保每個 fold 的 class ratio 一致
Time Series Split	Temporal data — train on past, test on future（不洩漏未來資訊）
Group K-Fold	Grouped data — 同一個 patient/user 的所有資料在同一個 fold
Leave-One-Out	Very small datasets — $k = n$ , maximum training data per fold
Repeated K-Fold	更穩定的 estimate — 重複多次 K-Fold with different random splits

CV 的致命錯誤

Feature selection、hyperparameter tuning、或任何 data-dependent preprocessing 必須在每個 fold 內部完成，不能在 split 之前。否則 → data leakage。正確做法：用 sklearn.pipeline.Pipeline 把 preprocessing 和 model 包在一起。

Data Leakage

Common Sources

Type	Description	Example
Feature leakage	Feature 包含了 prediction time 不應知道的資訊	預測「是否需要治療」但 features 包含「治療結果」
Train-test contamination	Test data 影響了 training	Split 前做 normalization（mean/std 包含 test data）
Temporal leakage	用未來資料預測過去	用明天的股價作為 feature 預測今天的交易量
Target leakage	Feature 是 target 的衍生	預測 churn 但 feature 包含「最後活躍日期」

Detection Signals

Suspiciously high metrics: Model 表現太好 → 幾乎一定有 leakage
Top feature is unexpected: 一個看似無關的 feature 成為最重要的 → 調查它是否和 target 有非法關聯
Train ≈ Test performance: 通常 train 比 test 好。如果差不多 → leaked information 在 train 和 test 都可用

Metric Selection Decision Framework

面試中被問「你會用什麼 metric？」時的決策流程：

Step	Question	Action
1	問題類型？	Classification → precision/recall/AUC. Regression → RMSE/MAE. Ranking → NDCG/MAP.
2	Class balanced?	Balanced → accuracy OK. Imbalanced → PR-AUC, F1.
3	需要 probability?	Yes → check calibration (Brier score). No → AUC for ranking.
4	FP vs FN 哪個更貴?	FN costly → optimize recall. FP costly → optimize precision.
5	Offline vs online?	Offline metric 不一定 = online metric. 用 A/B test 驗證。
6	Baseline?	永遠和 baseline 比（majority class, mean prediction, random）。

Real-World Use Cases

Case 1: 信用卡詐欺偵測

Decision	Choice	Reasoning
Primary metric	PR-AUC	99.9% negatives → accuracy 和 ROC-AUC 都被 TNs inflate
Threshold tuning	Optimize F2	FN（漏掉詐欺 = 直接金錢損失）比 FP（誤擋合法交易 = 客服成本）更貴
Calibration	Required	P(fraud) 用來決定是否 block 交易或只是 flag for review
Online metric	Fraud loss rate + false block rate	Offline PR-AUC 好不代表線上 fraud loss 降低

Case 2: 房價預測

Decision	Choice	Reasoning
Primary metric	RMSE or MAE	取決於大誤差的代價。仲介用 → MAE 更直覺。銀行風控 → RMSE（大錯更危險）
Why not $R^2$ ?	不同 dataset 的 $R^2$ 不可比	$R^2$ 取決於 target 的 variance，不同區域不可比
Log transform	用 RMSLE	房價是 log-normal → 在 log scale 上的 RMSE 更合理
Baseline	Mean/median of training set	任何 model 必須 beat 這個 baseline 才有價值

Case 3: 推薦系統

Decision	Choice	Reasoning
Offline metric	NDCG@10, Hit Rate@10	使用者只看前 10 個推薦，overall ranking 不重要
Online metric	CTR, conversion rate, revenue per session	Offline NDCG 好不代表使用者真的買了
Diversity	Intra-list diversity	只推薦同類商品 → 使用者體驗差，即使 CTR 高
Cold start	新 item 的 coverage rate	推薦系統是否能推薦新上架的商品

Decision Boundary Explorer

See how different classifiers create different decision boundaries:

Decision Boundary Explorer

See how different classifiers separate data. Requires scikit-learn (~20MB, cached after first load).

Dataset

Classifier

Click Run to generate the decision boundary

Hands-on: Metrics in Python

Classification Metrics on Imbalanced Data

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, average_precision_score,
    classification_report,
)

# Imbalanced: 95% negative, 5% positive
X, y = make_classification(n_samples=2000, n_features=10, weights=[0.95, 0.05])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)

model = LogisticRegression().fit(X_train, y_train)
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

accuracy_score(y_test, y_pred)            # ~0.96 ← misleading!
precision_score(y_test, y_pred)           # ~0.67
recall_score(y_test, y_pred)              # ~0.40
f1_score(y_test, y_pred)                  # ~0.50
roc_auc_score(y_test, y_prob)             # ~0.92
average_precision_score(y_test, y_prob)   # ~0.45 ← most informative here
# Baseline (always predict 0): accuracy ≈ 0.95 → model barely beats it

Ranking Metrics

from sklearn.metrics import ndcg_score
import numpy as np

# True relevance scores and model predictions
y_true = np.array([[3, 2, 3, 0, 1, 2]])  # relevance labels
y_score = np.array([[0.9, 0.8, 0.7, 0.6, 0.5, 0.4]])  # model scores

ndcg_at_3 = ndcg_score(y_true, y_score, k=3)
ndcg_at_5 = ndcg_score(y_true, y_score, k=5)
# NDCG@3 close to 1.0 if top-3 items are the most relevant

Interview Signals

What interviewers listen for:

你會先定義 business problem 和 metric，再談模型
你知道 baseline 為什麼重要（majority class, mean prediction）
你能根據 FP/FN cost 選擇 metric 和 threshold
你能解釋 ROC-AUC vs PR-AUC 的差異，以及 calibration 何時重要
你知道 offline metric ≠ online metric，需要 A/B test 驗證

Practice

Flashcards

Flashcards (1/10)

為什麼 imbalanced data 上 PR-AUC 比 ROC-AUC 更有意義？

ROC-AUC 的 FPR 分母包含大量 TN — 即使 FP 很多 FPR 仍然低 → AUC 被 inflate。PR-AUC 只看 positive class 的 precision 和 recall，不因 TN 多而被膨脹。Rule of thumb: positive class rare → PR-AUC。

Click card to flip

Quiz

Question 1/10

Credit risk prediction（99.9% 非違約）。哪個指標比 accuracy 更有意義？

Mark as Complete

How confident are you with this topic?

3/5 — Okay