ML Pipelines & Systems

Interview Context

近年的 DS 面試常要求你不只會建模,還要能談部署、監控、實驗迭代與跨部門合作。System design 題目考的是你能不能把 ML 從 notebook 帶到 production — model 只是 5%,另外 95% 是 data pipeline、serving、monitoring。

What You Should Understand

  • 能描述一個 end-to-end ML pipeline 的所有 components
  • 知道 data drift、concept drift、train-serving skew 的差異和偵測方式
  • 能設計 feature store 的 offline/online 架構
  • 知道 batch vs online serving 的 tradeoffs
  • 能用 systematic framework 回答 ML system design 面試題

End-to-End ML Pipeline

A production ML system has far more components than the model. 點擊任一 component 查看它的細節和連結:

DataFeatureTrainingServingMonitoringFlowFeedback

Model 通常是最不容易出問題的 component — most production issues 來自 data quality, feature pipelines, 和 serving infrastructure。

The Hidden Technical Debt in ML

Google 的 famous paper 估計 ML code 只佔 real-world ML system 的 5%。另外 95% 是 data collection, verification, feature extraction, serving, monitoring, and configuration。這就是為什麼 system design 和 modeling 同樣重要。

Data Pipeline

Data Collection and Validation

Before any modeling, ensure data reliability:

CheckWhat It CatchesExample
Schema validationBreaking changes upstreamColumn type changes from int to string
Null rate monitoringData collection bugsSensor starts sending nulls at midnight
Distribution shiftUpstream data changesUser demographics shift after marketing campaign
Volume anomalyPipeline failuresETL job fails silently, table has 0 new rows
Duplicate detectionReplay bugsSame events ingested twice
Freshness checkStale dataData was last updated 3 days ago instead of hourly

Fail Loud, Not Silent

最糟糕的 data quality issue 是沒人注意到的。Build pipelines that fail loudly — block downstream jobs when quality checks fail。Delayed model 遠好過 model trained on corrupted data。

Data Versioning

ToolWhat It VersionsUse Case
DVCData files, modelsReproducible experiments
Delta Lake / IcebergTables (time travel)Query data as of any timestamp
Feature StoreFeature definitions + valuesConsistent train/serve features
GitCode onlyAlready standard

Feature Engineering in Production

Feature Store

A centralized system for managing ML features with two key stores:

StoreLatencyComputationUse Case
OfflineMinutes-hoursBatch (Spark, SQL)Training data, batch inference
OnlineMillisecondsPre-computed, key-value lookupReal-time inference

兩者必須計算完全相同的 features — 任何差異就是 train-serving skew。

Feature Types by Latency

TypeCompute TimeExamplesStorage
StaticRarely changesUser country, account typeDatabase
BatchHourly/daily7-day average spend, rolling conversion rateOffline store
Near-real-timeMinutesSession click count, cart totalStreaming + online store
Real-timeMillisecondsCurrent page URL, query textRequest context

Hybrid pattern(最常見): Expensive features 用 batch 預計算 → serve 時 combine with real-time context features。

Train-Serving Skew: The #1 Production ML Bug

Training 和 serving 的 feature computation 不一致。常見原因:不同 library(pandas vs Spark)、不同 data source(warehouse vs event stream)、不同 preprocessing(training 用 cleaned data,serving 收到 raw data)。解法:same code path for train and serve — feature store 的核心價值。

Model Training Pipeline

Training Best Practices

PracticeWhyHow
Reproducibility重現任何過去的實驗Fix random seeds, version data + code, log all hyperparameters
Baseline first不知道 model 好不好除非有 baseline 比較Logistic regression, mean prediction, most-recent-value
Proper validation避免 data leakageTime-based splits for temporal, stratified for imbalanced
Automated pipelineHuman manual steps = error-proneAirflow, Kubeflow, or similar orchestrator

Model Registry

Track model lifecycle:

DevelopmentStagingProductionArchived\text{Development} \to \text{Staging} \to \text{Production} \to \text{Archived}

Registry stores: model artifacts, metadata(training data version, hyperparameters, metrics), lineage(which data/features/code produced this model)。

Enables reproducibility(recreate any past model)and rollback(revert to previous production model in minutes)。

Model Serving

Serving Patterns

PatternLatencyUse CaseArchitecture
BatchHoursEmail recommendations, daily reportsScheduled job → write predictions to table
OnlineMillisecondsSearch ranking, fraud detectionModel server behind API
StreamingSecondsReal-time anomaly detectionConsume events → predict → produce
EdgeMillisecondsMobile apps, IoTModel on device

Online Serving Considerations

ConcernImpactSolution
Latency budgetProduct team says 50ms → model must respond within thatSimpler model, distillation, quantization
ThroughputQPS requirementsBatching requests, horizontal scaling
Model sizeLarge model → slow load, high memoryDistillation, pruning, quantization
Feature fetchingOften the real bottleneck(not model inference)Pre-compute, cache, reduce feature count

Model Compression

TechniqueHowTradeoff
DistillationTrain small "student" model to mimic large "teacher"Smaller + faster, slight accuracy drop
QuantizationFP32 → INT8 weights2-4x smaller, minimal accuracy drop
PruningRemove near-zero weightsSparse model, hardware-dependent speedup
ONNX RuntimeOptimized inference engineFaster serving, framework-independent

Interview Framework for Serving

被問 serving strategy 時考慮四件事:(1)Latency requirements(2)How often predictions change(3)How many predictions per day(4)Cost constraints。如果 predictions 不需要 fresh(例如 weekly recommendations)→ batch 更簡單更便宜。

Monitoring and Drift Detection

Three Levels of Monitoring

LevelWhat to MonitorAlert Threshold
InfrastructureLatency (p50/p95/p99), error rate, CPU/memorySLA-based
Model performanceOnline metrics (CTR, conversion, revenue)X% drop from baseline
Data/predictionFeature distributions, prediction distribution, coveragePSI > 0.25, null rate > X%

Data Drift vs Concept Drift vs Train-Serving Skew

TypeWhat ChangesExampleDetection
Data driftP(X)P(X) changes, P(YX)P(Y \mid X) staysMore mobile users after redesignFeature distribution monitoring (PSI, KS test)
Concept driftP(YX)P(Y \mid X) changesUser preferences change during pandemicModel performance monitoring (needs labels)
Train-serving skewFeature computation differs (bug)Training uses pandas, serving uses SparkCompare training vs serving feature statistics

PSI (Population Stability Index)

PSI=i=1k(piqi)lnpiqi\text{PSI} = \sum_{i=1}^{k} (p_i - q_i) \cdot \ln\frac{p_i}{q_i}
PSIInterpretation
< 0.1No significant shift
0.1 - 0.25Moderate — investigate
> 0.25Significant — likely need retraining

Drift Detection in Practice

大部分團隊從 simple threshold alerts 開始:prediction mean shift > X%,key feature null rate > Y%。Sophisticated drift detection(statistical test on every feature)會產生太多 false alarms。Start simple, add complexity only when needed。

Retraining Strategy

When to Retrain

TriggerWhenExample
ScheduledFixed cadenceFraud model retrained daily
Performance-basedOnline metrics dropCTR model accuracy drops 5%
Drift-basedSignificant feature drift detectedPSI > 0.25
Event-basedKnown external changeNew product launch, holiday season

Safe Deployment

StepWhatWhy
1. Automated validationNew model must pass quality gatesPerformance ≥ baseline, no regression on slices
2. Shadow deploymentRun new model alongside currentCompare outputs on real traffic without affecting users
3. Canary / gradual rollout1% → 5% → 25% → 100%Detect issues early, limit blast radius
4. Easy rollbackOne-click revertUndo in minutes if issues arise

Offline vs Online Metric Gap

Why They Diverge

CauseExample
Proxy metric mismatchOffline AUC ↑ but online revenue unchanged — AUC 和 revenue 不一定 correlated
Data distribution shiftOffline eval set 不代表 current production traffic
Feedback loopsModel predictions change user behavior → invalidate offline eval
Latency issuesModel too slow → timeouts → fallback to simpler model
Position biasUsers click higher-ranked items regardless of relevance

Bridging the Gap

永遠用 A/B test 驗證 offline improvements。Build historical analysis 顯示 offline metric 和 online business metrics 的 correlation。如果它們 consistently diverge → your offline metric needs to change。

ML System Design Interview Framework

面試中回答 ML system design 題的六步框架:

Step 1: Clarify Requirements

QuestionWhy
What is the product? Who are the users?Define scope
Latency and throughput requirements?Constrains model complexity + serving pattern
How much data? How often does it change?Training strategy
What are the success metrics?Offline + online

Step 2: Define the ML Problem

QuestionOptions
FramingClassification, regression, ranking, recommendation?
LabelWhat is it? How is it collected? How delayed?
BaselineRule-based, most-popular, random, existing model?

Step 3: Data and Features

  • What data sources are available?
  • Feature engineering strategy(static, batch, real-time)
  • How to handle missing data, outliers, cold start?

Step 4: Model Selection and Training

  • Start with simple baseline → iterate to complex
  • Training: data split, validation strategy, hyperparameter tuning
  • Offline evaluation: which metrics, on what data?

Step 5: Serving and Deployment

  • Batch vs online inference?
  • Latency budget and model complexity tradeoff
  • A/B testing plan for online validation

Step 6: Monitoring and Iteration

  • What metrics to monitor? At what levels?
  • How to detect and handle drift?
  • Retraining cadence and rollback plan

Real-World System Design Examples

Example 1: Fraud Detection System

ComponentDesign Decision
ProblemBinary classification: is this transaction fraud?
Latency< 100ms(must decide before transaction completes)
FeaturesReal-time(amount, merchant)+ batch(user history, device fingerprint)
ModelLightGBM(fast inference, handles mixed features)
ServingOnline — model server behind payment API
MonitoringFraud loss rate, false positive rate, feature drift
RetrainingDaily(fraud patterns change rapidly)
ChallengeDelayed labels(chargeback 30-90 days)→ use proxy metrics for early monitoring

Example 2: Recommendation System

ComponentDesign Decision
ProblemRanking: which items to show to this user?
Latency< 200ms for full pipeline
ArchitectureCandidate generation(1M→1K, two-tower + ANN)→ Ranking(1K→100, GBM with rich features)→ Re-ranking(diversity, business rules)
FeaturesUser embeddings(batch), item embeddings(batch), real-time context
ServingOnline with batch-precomputed embeddings
MonitoringCTR, conversion, revenue per session, diversity, coverage
ChallengeCold start(new items/users), feedback loop(popular items get more data)

Example 3: Search Ranking

ComponentDesign Decision
ProblemLearning to rank: relevance of documents to query
Latency< 50ms
FeaturesQuery features, document features, query-document match features(BM25, semantic similarity)
ModelLambdaMART or cross-encoder → distilled to faster model for serving
Offline metricNDCG@10
Online metricSuccessful session rate, queries per session(should not increase)
ChallengePosition bias in logged data → need inverse propensity weighting or unbiased evaluation

Hands-on: ML Pipeline in Python

sklearn Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Pipeline: same transforms at training AND serving time
pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler",  StandardScaler()),
    ("model",   RandomForestClassifier(n_estimators=100)),
])

# CV: preprocessing inside each fold (no leakage!)
cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring="accuracy")

# Deploy same pipeline object → no train-serving skew
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

Drift Detection with PSI

import numpy as np

def compute_psi(expected, actual, bins=10):
    """Population Stability Index for drift detection."""
    breakpoints = np.quantile(expected, np.linspace(0, 1, bins + 1))
    breakpoints[0], breakpoints[-1] = -np.inf, np.inf

    exp_pct = np.histogram(expected, bins=breakpoints)[0] / len(expected)
    act_pct = np.histogram(actual, bins=breakpoints)[0] / len(actual)
    exp_pct = np.clip(exp_pct, 1e-4, None)
    act_pct = np.clip(act_pct, 1e-4, None)

    return np.sum((act_pct - exp_pct) * np.log(act_pct / exp_pct))

# Example: monitor feature drift in production
for feature in important_features:
    psi = compute_psi(train_data[feature], production_data[feature])
    if psi > 0.25:
        alert(f"Significant drift in {feature}: PSI={psi:.3f}")

Model Experiment Tracking

import mlflow

mlflow.set_experiment("fraud_detection")

with mlflow.start_run():
    # Log parameters
    mlflow.log_params({"n_estimators": 100, "max_depth": 6, "learning_rate": 0.1})

    # Train model
    model = train_model(X_train, y_train)

    # Log metrics
    mlflow.log_metrics({"auc": 0.95, "precision": 0.82, "recall": 0.78})

    # Log model
    mlflow.sklearn.log_model(model, "model")

    # Later: load model for serving
    # loaded_model = mlflow.sklearn.load_model("runs:/<run_id>/model")

Interview Signals

What interviewers listen for:

  • 你有 product 和 system 觀點,不只會調 model parameters
  • 你知道 model 上線後才是真正的開始,不是 notebook 分數最好就結束
  • 你能把複雜技術翻譯成 stakeholder 聽得懂的決策語言
  • 你會主動提到 monitoring、rollback、baseline comparison
  • 你能辨別 data drift、concept drift、train-serving skew

Practice

Flashcards

Flashcards (1/10)

Train-serving skew 是什麼?怎麼解決?

Training 和 serving 的 feature computation 不一致 → 線上表現掉下來。最常見原因:不同 code path(pandas vs Spark)、不同 data source。解法:feature store 確保 same code path for train and serve。

Click card to flip

Quiz

Question 1/10

哪個最接近 train-serving skew?

Mark as Complete

3/5 — Okay