ML Pipelines & Systems
Interview Context
近年的 DS 面試常要求你不只會建模,還要能談部署、監控、實驗迭代與跨部門合作。System design 題目考的是你能不能把 ML 從 notebook 帶到 production — model 只是 5%,另外 95% 是 data pipeline、serving、monitoring。
What You Should Understand
- 能描述一個 end-to-end ML pipeline 的所有 components
- 知道 data drift、concept drift、train-serving skew 的差異和偵測方式
- 能設計 feature store 的 offline/online 架構
- 知道 batch vs online serving 的 tradeoffs
- 能用 systematic framework 回答 ML system design 面試題
End-to-End ML Pipeline
A production ML system has far more components than the model. 點擊任一 component 查看它的細節和連結:
Model 通常是最不容易出問題的 component — most production issues 來自 data quality, feature pipelines, 和 serving infrastructure。
The Hidden Technical Debt in ML
Google 的 famous paper 估計 ML code 只佔 real-world ML system 的 5%。另外 95% 是 data collection, verification, feature extraction, serving, monitoring, and configuration。這就是為什麼 system design 和 modeling 同樣重要。
Data Pipeline
Data Collection and Validation
Before any modeling, ensure data reliability:
| Check | What It Catches | Example |
|---|---|---|
| Schema validation | Breaking changes upstream | Column type changes from int to string |
| Null rate monitoring | Data collection bugs | Sensor starts sending nulls at midnight |
| Distribution shift | Upstream data changes | User demographics shift after marketing campaign |
| Volume anomaly | Pipeline failures | ETL job fails silently, table has 0 new rows |
| Duplicate detection | Replay bugs | Same events ingested twice |
| Freshness check | Stale data | Data was last updated 3 days ago instead of hourly |
Fail Loud, Not Silent
最糟糕的 data quality issue 是沒人注意到的。Build pipelines that fail loudly — block downstream jobs when quality checks fail。Delayed model 遠好過 model trained on corrupted data。
Data Versioning
| Tool | What It Versions | Use Case |
|---|---|---|
| DVC | Data files, models | Reproducible experiments |
| Delta Lake / Iceberg | Tables (time travel) | Query data as of any timestamp |
| Feature Store | Feature definitions + values | Consistent train/serve features |
| Git | Code only | Already standard |
Feature Engineering in Production
Feature Store
A centralized system for managing ML features with two key stores:
| Store | Latency | Computation | Use Case |
|---|---|---|---|
| Offline | Minutes-hours | Batch (Spark, SQL) | Training data, batch inference |
| Online | Milliseconds | Pre-computed, key-value lookup | Real-time inference |
兩者必須計算完全相同的 features — 任何差異就是 train-serving skew。
Feature Types by Latency
| Type | Compute Time | Examples | Storage |
|---|---|---|---|
| Static | Rarely changes | User country, account type | Database |
| Batch | Hourly/daily | 7-day average spend, rolling conversion rate | Offline store |
| Near-real-time | Minutes | Session click count, cart total | Streaming + online store |
| Real-time | Milliseconds | Current page URL, query text | Request context |
Hybrid pattern(最常見): Expensive features 用 batch 預計算 → serve 時 combine with real-time context features。
Train-Serving Skew: The #1 Production ML Bug
Training 和 serving 的 feature computation 不一致。常見原因:不同 library(pandas vs Spark)、不同 data source(warehouse vs event stream)、不同 preprocessing(training 用 cleaned data,serving 收到 raw data)。解法:same code path for train and serve — feature store 的核心價值。
Model Training Pipeline
Training Best Practices
| Practice | Why | How |
|---|---|---|
| Reproducibility | 重現任何過去的實驗 | Fix random seeds, version data + code, log all hyperparameters |
| Baseline first | 不知道 model 好不好除非有 baseline 比較 | Logistic regression, mean prediction, most-recent-value |
| Proper validation | 避免 data leakage | Time-based splits for temporal, stratified for imbalanced |
| Automated pipeline | Human manual steps = error-prone | Airflow, Kubeflow, or similar orchestrator |
Model Registry
Track model lifecycle:
Registry stores: model artifacts, metadata(training data version, hyperparameters, metrics), lineage(which data/features/code produced this model)。
Enables reproducibility(recreate any past model)and rollback(revert to previous production model in minutes)。
Model Serving
Serving Patterns
| Pattern | Latency | Use Case | Architecture |
|---|---|---|---|
| Batch | Hours | Email recommendations, daily reports | Scheduled job → write predictions to table |
| Online | Milliseconds | Search ranking, fraud detection | Model server behind API |
| Streaming | Seconds | Real-time anomaly detection | Consume events → predict → produce |
| Edge | Milliseconds | Mobile apps, IoT | Model on device |
Online Serving Considerations
| Concern | Impact | Solution |
|---|---|---|
| Latency budget | Product team says 50ms → model must respond within that | Simpler model, distillation, quantization |
| Throughput | QPS requirements | Batching requests, horizontal scaling |
| Model size | Large model → slow load, high memory | Distillation, pruning, quantization |
| Feature fetching | Often the real bottleneck(not model inference) | Pre-compute, cache, reduce feature count |
Model Compression
| Technique | How | Tradeoff |
|---|---|---|
| Distillation | Train small "student" model to mimic large "teacher" | Smaller + faster, slight accuracy drop |
| Quantization | FP32 → INT8 weights | 2-4x smaller, minimal accuracy drop |
| Pruning | Remove near-zero weights | Sparse model, hardware-dependent speedup |
| ONNX Runtime | Optimized inference engine | Faster serving, framework-independent |
Interview Framework for Serving
被問 serving strategy 時考慮四件事:(1)Latency requirements(2)How often predictions change(3)How many predictions per day(4)Cost constraints。如果 predictions 不需要 fresh(例如 weekly recommendations)→ batch 更簡單更便宜。
Monitoring and Drift Detection
Three Levels of Monitoring
| Level | What to Monitor | Alert Threshold |
|---|---|---|
| Infrastructure | Latency (p50/p95/p99), error rate, CPU/memory | SLA-based |
| Model performance | Online metrics (CTR, conversion, revenue) | X% drop from baseline |
| Data/prediction | Feature distributions, prediction distribution, coverage | PSI > 0.25, null rate > X% |
Data Drift vs Concept Drift vs Train-Serving Skew
| Type | What Changes | Example | Detection |
|---|---|---|---|
| Data drift | changes, stays | More mobile users after redesign | Feature distribution monitoring (PSI, KS test) |
| Concept drift | changes | User preferences change during pandemic | Model performance monitoring (needs labels) |
| Train-serving skew | Feature computation differs (bug) | Training uses pandas, serving uses Spark | Compare training vs serving feature statistics |
PSI (Population Stability Index)
| PSI | Interpretation |
|---|---|
| < 0.1 | No significant shift |
| 0.1 - 0.25 | Moderate — investigate |
| > 0.25 | Significant — likely need retraining |
Drift Detection in Practice
大部分團隊從 simple threshold alerts 開始:prediction mean shift > X%,key feature null rate > Y%。Sophisticated drift detection(statistical test on every feature)會產生太多 false alarms。Start simple, add complexity only when needed。
Retraining Strategy
When to Retrain
| Trigger | When | Example |
|---|---|---|
| Scheduled | Fixed cadence | Fraud model retrained daily |
| Performance-based | Online metrics drop | CTR model accuracy drops 5% |
| Drift-based | Significant feature drift detected | PSI > 0.25 |
| Event-based | Known external change | New product launch, holiday season |
Safe Deployment
| Step | What | Why |
|---|---|---|
| 1. Automated validation | New model must pass quality gates | Performance ≥ baseline, no regression on slices |
| 2. Shadow deployment | Run new model alongside current | Compare outputs on real traffic without affecting users |
| 3. Canary / gradual rollout | 1% → 5% → 25% → 100% | Detect issues early, limit blast radius |
| 4. Easy rollback | One-click revert | Undo in minutes if issues arise |
Offline vs Online Metric Gap
Why They Diverge
| Cause | Example |
|---|---|
| Proxy metric mismatch | Offline AUC ↑ but online revenue unchanged — AUC 和 revenue 不一定 correlated |
| Data distribution shift | Offline eval set 不代表 current production traffic |
| Feedback loops | Model predictions change user behavior → invalidate offline eval |
| Latency issues | Model too slow → timeouts → fallback to simpler model |
| Position bias | Users click higher-ranked items regardless of relevance |
Bridging the Gap
永遠用 A/B test 驗證 offline improvements。Build historical analysis 顯示 offline metric 和 online business metrics 的 correlation。如果它們 consistently diverge → your offline metric needs to change。
ML System Design Interview Framework
面試中回答 ML system design 題的六步框架:
Step 1: Clarify Requirements
| Question | Why |
|---|---|
| What is the product? Who are the users? | Define scope |
| Latency and throughput requirements? | Constrains model complexity + serving pattern |
| How much data? How often does it change? | Training strategy |
| What are the success metrics? | Offline + online |
Step 2: Define the ML Problem
| Question | Options |
|---|---|
| Framing | Classification, regression, ranking, recommendation? |
| Label | What is it? How is it collected? How delayed? |
| Baseline | Rule-based, most-popular, random, existing model? |
Step 3: Data and Features
- What data sources are available?
- Feature engineering strategy(static, batch, real-time)
- How to handle missing data, outliers, cold start?
Step 4: Model Selection and Training
- Start with simple baseline → iterate to complex
- Training: data split, validation strategy, hyperparameter tuning
- Offline evaluation: which metrics, on what data?
Step 5: Serving and Deployment
- Batch vs online inference?
- Latency budget and model complexity tradeoff
- A/B testing plan for online validation
Step 6: Monitoring and Iteration
- What metrics to monitor? At what levels?
- How to detect and handle drift?
- Retraining cadence and rollback plan
Real-World System Design Examples
Example 1: Fraud Detection System
| Component | Design Decision |
|---|---|
| Problem | Binary classification: is this transaction fraud? |
| Latency | < 100ms(must decide before transaction completes) |
| Features | Real-time(amount, merchant)+ batch(user history, device fingerprint) |
| Model | LightGBM(fast inference, handles mixed features) |
| Serving | Online — model server behind payment API |
| Monitoring | Fraud loss rate, false positive rate, feature drift |
| Retraining | Daily(fraud patterns change rapidly) |
| Challenge | Delayed labels(chargeback 30-90 days)→ use proxy metrics for early monitoring |
Example 2: Recommendation System
| Component | Design Decision |
|---|---|
| Problem | Ranking: which items to show to this user? |
| Latency | < 200ms for full pipeline |
| Architecture | Candidate generation(1M→1K, two-tower + ANN)→ Ranking(1K→100, GBM with rich features)→ Re-ranking(diversity, business rules) |
| Features | User embeddings(batch), item embeddings(batch), real-time context |
| Serving | Online with batch-precomputed embeddings |
| Monitoring | CTR, conversion, revenue per session, diversity, coverage |
| Challenge | Cold start(new items/users), feedback loop(popular items get more data) |
Example 3: Search Ranking
| Component | Design Decision |
|---|---|
| Problem | Learning to rank: relevance of documents to query |
| Latency | < 50ms |
| Features | Query features, document features, query-document match features(BM25, semantic similarity) |
| Model | LambdaMART or cross-encoder → distilled to faster model for serving |
| Offline metric | NDCG@10 |
| Online metric | Successful session rate, queries per session(should not increase) |
| Challenge | Position bias in logged data → need inverse propensity weighting or unbiased evaluation |
Hands-on: ML Pipeline in Python
sklearn Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
# Pipeline: same transforms at training AND serving time
pipeline = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
("model", RandomForestClassifier(n_estimators=100)),
])
# CV: preprocessing inside each fold (no leakage!)
cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring="accuracy")
# Deploy same pipeline object → no train-serving skew
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
Drift Detection with PSI
import numpy as np
def compute_psi(expected, actual, bins=10):
"""Population Stability Index for drift detection."""
breakpoints = np.quantile(expected, np.linspace(0, 1, bins + 1))
breakpoints[0], breakpoints[-1] = -np.inf, np.inf
exp_pct = np.histogram(expected, bins=breakpoints)[0] / len(expected)
act_pct = np.histogram(actual, bins=breakpoints)[0] / len(actual)
exp_pct = np.clip(exp_pct, 1e-4, None)
act_pct = np.clip(act_pct, 1e-4, None)
return np.sum((act_pct - exp_pct) * np.log(act_pct / exp_pct))
# Example: monitor feature drift in production
for feature in important_features:
psi = compute_psi(train_data[feature], production_data[feature])
if psi > 0.25:
alert(f"Significant drift in {feature}: PSI={psi:.3f}")
Model Experiment Tracking
import mlflow
mlflow.set_experiment("fraud_detection")
with mlflow.start_run():
# Log parameters
mlflow.log_params({"n_estimators": 100, "max_depth": 6, "learning_rate": 0.1})
# Train model
model = train_model(X_train, y_train)
# Log metrics
mlflow.log_metrics({"auc": 0.95, "precision": 0.82, "recall": 0.78})
# Log model
mlflow.sklearn.log_model(model, "model")
# Later: load model for serving
# loaded_model = mlflow.sklearn.load_model("runs:/<run_id>/model")
Interview Signals
What interviewers listen for:
- 你有 product 和 system 觀點,不只會調 model parameters
- 你知道 model 上線後才是真正的開始,不是 notebook 分數最好就結束
- 你能把複雜技術翻譯成 stakeholder 聽得懂的決策語言
- 你會主動提到 monitoring、rollback、baseline comparison
- 你能辨別 data drift、concept drift、train-serving skew
Practice
Flashcards
Flashcards (1/10)
Train-serving skew 是什麼?怎麼解決?
Training 和 serving 的 feature computation 不一致 → 線上表現掉下來。最常見原因:不同 code path(pandas vs Spark)、不同 data source。解法:feature store 確保 same code path for train and serve。
Quiz
哪個最接近 train-serving skew?