ML Pipelines & Systems

Interview Context

近年的 DS 面試常要求你不只會建模，還要能談部署、監控、實驗迭代與跨部門合作。System design 題目考的是你能不能把 ML 從 notebook 帶到 production — model 只是 5%，另外 95% 是 data pipeline、serving、monitoring。

What You Should Understand

能描述一個 end-to-end ML pipeline 的所有 components
知道 data drift、concept drift、train-serving skew 的差異和偵測方式
能設計 feature store 的 offline/online 架構
知道 batch vs online serving 的 tradeoffs
能用 systematic framework 回答 ML system design 面試題

End-to-End ML Pipeline

A production ML system has far more components than the model. 點擊任一 component 查看它的細節和連結：

DataFeatureTrainingServingMonitoringFlowFeedback

Model 通常是最不容易出問題的 component — most production issues 來自 data quality, feature pipelines, 和 serving infrastructure。

The Hidden Technical Debt in ML

Google 的 famous paper 估計 ML code 只佔 real-world ML system 的 5%。另外 95% 是 data collection, verification, feature extraction, serving, monitoring, and configuration。這就是為什麼 system design 和 modeling 同樣重要。

Data Pipeline

Data Collection and Validation

Before any modeling, ensure data reliability:

Check	What It Catches	Example
Schema validation	Breaking changes upstream	Column type changes from int to string
Null rate monitoring	Data collection bugs	Sensor starts sending nulls at midnight
Distribution shift	Upstream data changes	User demographics shift after marketing campaign
Volume anomaly	Pipeline failures	ETL job fails silently, table has 0 new rows
Duplicate detection	Replay bugs	Same events ingested twice
Freshness check	Stale data	Data was last updated 3 days ago instead of hourly

Fail Loud, Not Silent

最糟糕的 data quality issue 是沒人注意到的。Build pipelines that fail loudly — block downstream jobs when quality checks fail。Delayed model 遠好過 model trained on corrupted data。

Data Versioning

Tool	What It Versions	Use Case
DVC	Data files, models	Reproducible experiments
Delta Lake / Iceberg	Tables (time travel)	Query data as of any timestamp
Feature Store	Feature definitions + values	Consistent train/serve features
Git	Code only	Already standard

Feature Engineering in Production

Feature Store

A centralized system for managing ML features with two key stores:

Store	Latency	Computation	Use Case
Offline	Minutes-hours	Batch (Spark, SQL)	Training data, batch inference
Online	Milliseconds	Pre-computed, key-value lookup	Real-time inference

兩者必須計算完全相同的 features — 任何差異就是 train-serving skew。

Feature Types by Latency

Type	Compute Time	Examples	Storage
Static	Rarely changes	User country, account type	Database
Batch	Hourly/daily	7-day average spend, rolling conversion rate	Offline store
Near-real-time	Minutes	Session click count, cart total	Streaming + online store
Real-time	Milliseconds	Current page URL, query text	Request context

Hybrid pattern（最常見）: Expensive features 用 batch 預計算 → serve 時 combine with real-time context features。

Train-Serving Skew: The #1 Production ML Bug

Training 和 serving 的 feature computation 不一致。常見原因：不同 library（pandas vs Spark）、不同 data source（warehouse vs event stream）、不同 preprocessing（training 用 cleaned data，serving 收到 raw data）。解法：same code path for train and serve — feature store 的核心價值。

Model Training Pipeline

Training Best Practices

Practice	Why	How
Reproducibility	重現任何過去的實驗	Fix random seeds, version data + code, log all hyperparameters
Baseline first	不知道 model 好不好除非有 baseline 比較	Logistic regression, mean prediction, most-recent-value
Proper validation	避免 data leakage	Time-based splits for temporal, stratified for imbalanced
Automated pipeline	Human manual steps = error-prone	Airflow, Kubeflow, or similar orchestrator

Model Registry

Track model lifecycle:

\text{Development} \to \text{Staging} \to \text{Production} \to \text{Archived}

Registry stores: model artifacts, metadata（training data version, hyperparameters, metrics）, lineage（which data/features/code produced this model）。

Enables reproducibility（recreate any past model）and rollback（revert to previous production model in minutes）。

Model Serving

Serving Patterns

Pattern	Latency	Use Case	Architecture
Batch	Hours	Email recommendations, daily reports	Scheduled job → write predictions to table
Online	Milliseconds	Search ranking, fraud detection	Model server behind API
Streaming	Seconds	Real-time anomaly detection	Consume events → predict → produce
Edge	Milliseconds	Mobile apps, IoT	Model on device

Online Serving Considerations

Concern	Impact	Solution
Latency budget	Product team says 50ms → model must respond within that	Simpler model, distillation, quantization
Throughput	QPS requirements	Batching requests, horizontal scaling
Model size	Large model → slow load, high memory	Distillation, pruning, quantization
Feature fetching	Often the real bottleneck（not model inference）	Pre-compute, cache, reduce feature count

Model Compression

Technique	How	Tradeoff
Distillation	Train small "student" model to mimic large "teacher"	Smaller + faster, slight accuracy drop
Quantization	FP32 → INT8 weights	2-4x smaller, minimal accuracy drop
Pruning	Remove near-zero weights	Sparse model, hardware-dependent speedup
ONNX Runtime	Optimized inference engine	Faster serving, framework-independent

Interview Framework for Serving

被問 serving strategy 時考慮四件事：（1）Latency requirements（2）How often predictions change（3）How many predictions per day（4）Cost constraints。如果 predictions 不需要 fresh（例如 weekly recommendations）→ batch 更簡單更便宜。

Monitoring and Drift Detection

Three Levels of Monitoring

Level	What to Monitor	Alert Threshold
Infrastructure	Latency (p50/p95/p99), error rate, CPU/memory	SLA-based
Model performance	Online metrics (CTR, conversion, revenue)	X% drop from baseline
Data/prediction	Feature distributions, prediction distribution, coverage	PSI > 0.25, null rate > X%

Data Drift vs Concept Drift vs Train-Serving Skew

Type	What Changes	Example	Detection
Data drift	$P(X)$ changes, $P(Y \mid X)$ stays	More mobile users after redesign	Feature distribution monitoring (PSI, KS test)
Concept drift	$P(Y \mid X)$ changes	User preferences change during pandemic	Model performance monitoring (needs labels)
Train-serving skew	Feature computation differs (bug)	Training uses pandas, serving uses Spark	Compare training vs serving feature statistics

PSI (Population Stability Index)

\text{PSI} = \sum_{i=1}^{k} (p_i - q_i) \cdot \ln\frac{p_i}{q_i}

PSI	Interpretation
< 0.1	No significant shift
0.1 - 0.25	Moderate — investigate
> 0.25	Significant — likely need retraining

Drift Detection in Practice

大部分團隊從 simple threshold alerts 開始：prediction mean shift > X%，key feature null rate > Y%。Sophisticated drift detection（statistical test on every feature）會產生太多 false alarms。Start simple, add complexity only when needed。

Retraining Strategy

When to Retrain

Trigger	When	Example
Scheduled	Fixed cadence	Fraud model retrained daily
Performance-based	Online metrics drop	CTR model accuracy drops 5%
Drift-based	Significant feature drift detected	PSI > 0.25
Event-based	Known external change	New product launch, holiday season

Safe Deployment

Step	What	Why
1. Automated validation	New model must pass quality gates	Performance ≥ baseline, no regression on slices
2. Shadow deployment	Run new model alongside current	Compare outputs on real traffic without affecting users
3. Canary / gradual rollout	1% → 5% → 25% → 100%	Detect issues early, limit blast radius
4. Easy rollback	One-click revert	Undo in minutes if issues arise

Offline vs Online Metric Gap

Why They Diverge

Cause	Example
Proxy metric mismatch	Offline AUC ↑ but online revenue unchanged — AUC 和 revenue 不一定 correlated
Data distribution shift	Offline eval set 不代表 current production traffic
Feedback loops	Model predictions change user behavior → invalidate offline eval
Latency issues	Model too slow → timeouts → fallback to simpler model
Position bias	Users click higher-ranked items regardless of relevance

Bridging the Gap

永遠用 A/B test 驗證 offline improvements。Build historical analysis 顯示 offline metric 和 online business metrics 的 correlation。如果它們 consistently diverge → your offline metric needs to change。

ML System Design Interview Framework

面試中回答 ML system design 題的六步框架：

Step 1: Clarify Requirements

Question	Why
What is the product? Who are the users?	Define scope
Latency and throughput requirements?	Constrains model complexity + serving pattern
How much data? How often does it change?	Training strategy
What are the success metrics?	Offline + online

Step 2: Define the ML Problem

Question	Options
Framing	Classification, regression, ranking, recommendation?
Label	What is it? How is it collected? How delayed?
Baseline	Rule-based, most-popular, random, existing model?

Step 3: Data and Features

What data sources are available?
Feature engineering strategy（static, batch, real-time）
How to handle missing data, outliers, cold start?

Step 4: Model Selection and Training

Start with simple baseline → iterate to complex
Training: data split, validation strategy, hyperparameter tuning
Offline evaluation: which metrics, on what data?

Step 5: Serving and Deployment

Batch vs online inference?
Latency budget and model complexity tradeoff
A/B testing plan for online validation

Step 6: Monitoring and Iteration

What metrics to monitor? At what levels?
How to detect and handle drift?
Retraining cadence and rollback plan

Real-World System Design Examples

Example 1: Fraud Detection System

Component	Design Decision
Problem	Binary classification: is this transaction fraud?
Latency	< 100ms（must decide before transaction completes）
Features	Real-time（amount, merchant）+ batch（user history, device fingerprint）
Model	LightGBM（fast inference, handles mixed features）
Serving	Online — model server behind payment API
Monitoring	Fraud loss rate, false positive rate, feature drift
Retraining	Daily（fraud patterns change rapidly）
Challenge	Delayed labels（chargeback 30-90 days）→ use proxy metrics for early monitoring

Example 2: Recommendation System

Component	Design Decision
Problem	Ranking: which items to show to this user?
Latency	< 200ms for full pipeline
Architecture	Candidate generation（1M→1K, two-tower + ANN）→ Ranking（1K→100, GBM with rich features）→ Re-ranking（diversity, business rules）
Features	User embeddings（batch）, item embeddings（batch）, real-time context
Serving	Online with batch-precomputed embeddings
Monitoring	CTR, conversion, revenue per session, diversity, coverage
Challenge	Cold start（new items/users）, feedback loop（popular items get more data）

Example 3: Search Ranking

Component	Design Decision
Problem	Learning to rank: relevance of documents to query
Latency	< 50ms
Features	Query features, document features, query-document match features（BM25, semantic similarity）
Model	LambdaMART or cross-encoder → distilled to faster model for serving
Offline metric	NDCG@10
Online metric	Successful session rate, queries per session（should not increase）
Challenge	Position bias in logged data → need inverse propensity weighting or unbiased evaluation

Hands-on: ML Pipeline in Python

sklearn Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Pipeline: same transforms at training AND serving time
pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler",  StandardScaler()),
    ("model",   RandomForestClassifier(n_estimators=100)),
])

# CV: preprocessing inside each fold (no leakage!)
cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring="accuracy")

# Deploy same pipeline object → no train-serving skew
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

Drift Detection with PSI

import numpy as np

def compute_psi(expected, actual, bins=10):
    """Population Stability Index for drift detection."""
    breakpoints = np.quantile(expected, np.linspace(0, 1, bins + 1))
    breakpoints[0], breakpoints[-1] = -np.inf, np.inf

    exp_pct = np.histogram(expected, bins=breakpoints)[0] / len(expected)
    act_pct = np.histogram(actual, bins=breakpoints)[0] / len(actual)
    exp_pct = np.clip(exp_pct, 1e-4, None)
    act_pct = np.clip(act_pct, 1e-4, None)

    return np.sum((act_pct - exp_pct) * np.log(act_pct / exp_pct))

# Example: monitor feature drift in production
for feature in important_features:
    psi = compute_psi(train_data[feature], production_data[feature])
    if psi > 0.25:
        alert(f"Significant drift in {feature}: PSI={psi:.3f}")

Model Experiment Tracking

import mlflow

mlflow.set_experiment("fraud_detection")

with mlflow.start_run():
    # Log parameters
    mlflow.log_params({"n_estimators": 100, "max_depth": 6, "learning_rate": 0.1})

    # Train model
    model = train_model(X_train, y_train)

    # Log metrics
    mlflow.log_metrics({"auc": 0.95, "precision": 0.82, "recall": 0.78})

    # Log model
    mlflow.sklearn.log_model(model, "model")

    # Later: load model for serving
    # loaded_model = mlflow.sklearn.load_model("runs:/<run_id>/model")

Interview Signals

What interviewers listen for:

你有 product 和 system 觀點，不只會調 model parameters
你知道 model 上線後才是真正的開始，不是 notebook 分數最好就結束
你能把複雜技術翻譯成 stakeholder 聽得懂的決策語言
你會主動提到 monitoring、rollback、baseline comparison
你能辨別 data drift、concept drift、train-serving skew

Practice

Flashcards

Flashcards (1/10)

Train-serving skew 是什麼？怎麼解決？

Training 和 serving 的 feature computation 不一致 → 線上表現掉下來。最常見原因：不同 code path（pandas vs Spark）、不同 data source。解法：feature store 確保 same code path for train and serve。

Click card to flip

Quiz

Question 1/10

哪個最接近 train-serving skew？

Mark as Complete

How confident are you with this topic?

3/5 — Okay