Recommendation Systems
Interview Context
推薦系統是 FAANG 面試的高頻 system design 題。面試官不只問演算法,還問你怎麼處理 cold start、怎麼 evaluate、怎麼從 offline metric 到 online impact。要能從 product 角度思考,不只是技術。
What You Should Understand
- 能比較 collaborative filtering、content-based、hybrid 的適用場景
- 理解 matrix factorization 的數學和 implicit feedback 的差異
- 知道 cold start 問題的具體解法
- 能設計 offline + online evaluation 策略
- 理解 feedback loop、position bias、popularity bias 等 production 問題
Overview
The Three Paradigms
| Paradigm | Input | Idea | Example |
|---|---|---|---|
| Collaborative Filtering (CF) | User-item interactions | 和你相似的人喜歡的東西,你也可能喜歡 | "Users who bought X also bought Y" |
| Content-Based | Item/user features | 推薦和你過去喜歡的東西特徵相似的 item | "Because you watched sci-fi movies..." |
| Hybrid | Both | 結合兩者的優勢 | Netflix, Spotify, Amazon |
Collaborative Filtering
User-Based CF
Find users similar to you, recommend items they liked:
- Compute similarity between users(based on their rating patterns)
- Find top- most similar users to the target user
- Recommend items those similar users rated highly but the target user hasn't seen
Item-Based CF
Find items similar to items the user already liked:
- Compute similarity between items(based on who rated them similarly)
- For each item the user liked, find similar items
- Recommend the most similar items the user hasn't seen
User-Based vs Item-Based
| Aspect | User-Based | Item-Based |
|---|---|---|
| Similarity matrix | ||
| Stability | Users change preferences → unstable | Item similarities are more stable |
| Cold start | New user problem(no history) | New item problem(no ratings) |
| Scalability | 通常 users >> items → matrix 更大 | Items 更少 → matrix 更小、更 cacheable |
| Industry preference | 較少用(Amazon 早期) | 更常用(Amazon's "item-to-item CF") |
Item-Based CF 在實務中更常見 — item similarity 更穩定(電影的特性不會變,但使用者的偏好會),計算量更小(items 通常比 users 少),且 similarity matrix 可以 pre-compute。
Similarity Metrics
Cosine Similarity: — normalized dot product, range [-1, 1]。不受 rating scale 影響。
Pearson Correlation: Mean-centered cosine — 減去每個 user 的 mean rating → 消除 rating bias(有些人都給高分,有些人嚴格)。
Jaccard Similarity: intersection / union — 只看有沒有互動,不看 rating value。適合 binary implicit feedback。
Cosine vs Pearson
Cosine 不 center data → 如果 user A 都給 4-5 分、user B 都給 1-2 分,cosine 會認為他們不相似。Pearson 先減去 mean → 捕捉 pattern 而非 absolute level。通常 Pearson for ratings, Cosine for implicit feedback。
Content-Based Filtering
Recommend items similar to what the user has liked, based on item features:
Item Features
| Domain | Features |
|---|---|
| Movies | Genre, director, actors, keywords, release year |
| E-commerce | Category, brand, price range, description (TF-IDF) |
| News | Topic, entities, source, recency |
| Music | Genre, tempo, energy, artist, audio features |
User Profile
Aggregate features of items the user has interacted with:
或用 weighted average(rating 高的 item 權重大)。
Content-Based vs CF
| Aspect | Content-Based | Collaborative Filtering |
|---|---|---|
| Needs user history | Yes (past items) | Yes (ratings/interactions) |
| Needs item features | Yes | No |
| Cold start (new item) | Handles well(有 features 就能推薦) | Can't recommend(no ratings yet) |
| Cold start (new user) | Can't recommend(no history) | Can't recommend(no ratings) |
| Serendipity | Low(只推薦相似的 → filter bubble) | High(similar users may have diverse tastes) |
| Domain knowledge | Required(need good features) | Not required |
Filter Bubble Problem
Content-based 系統只推薦和使用者過去喜歡的相似的 item → 使用者被困在 echo chamber 中。解法:加入 diversity penalty、exploration(random recommendations)、或 hybrid with CF(引入 serendipity)。
Matrix Factorization
The Idea
Decompose the sparse user-item rating matrix () into two low-rank matrices:
where () = user latent factors, () = item latent factors, .
Predicted rating:
每個 latent factor 代表一個 hidden dimension(例如 factor 1 = 「動作程度」, factor 2 = 「深度」)— 但 factors 通常不 directly interpretable。
SVD-based Methods
Vanilla SVD:
問題: 有大量 missing values → standard SVD 不能直接用在 incomplete matrix。
Regularized SVD (Funk SVD): Only minimize loss on observed ratings:
用 SGD 或 ALS(Alternating Least Squares)optimize。
Adding Biases
Users and items have inherent biases(某些 user 都打高分、某些 item 本身就好):
- : global average rating
- : user bias(相對於平均)
- : item bias
ALS (Alternating Least Squares)
ALS algorithm:
- Fix , solve for (closed-form linear regression per user)
- Fix , solve for (closed-form per item)
- Repeat until convergence
ALS 的優勢:每步都是 closed-form → 可以高度 parallelize(Spark MLlib 用 ALS)。SGD 是 sequential,不好 parallelize。
Implicit Feedback
大部分推薦場景沒有 explicit ratings — 只有 implicit signals:clicks, views, purchases, time spent.
Key differences from explicit feedback:
| Aspect | Explicit (ratings) | Implicit (clicks/views) |
|---|---|---|
| Signal | Clear preference (1-5 stars) | Ambiguous(click ≠ like) |
| Missing data | Unknown preference | Could mean "not interested" or "not seen" |
| Negative signal | Low rating | Absence of interaction(但可能只是沒看到) |
| Volume | Sparse | More abundant |
Weighted ALS for implicit feedback (Hu et al., 2008):
- if user interacted with item, 0 otherwise(binary preference)
- (confidence — 互動越多 confidence 越高)
from implicit.als import AlternatingLeastSquares
# CSR matrix of user-item interactions (implicit)
model = AlternatingLeastSquares(factors=50, regularization=0.1, iterations=20)
model.fit(user_item_matrix)
# Recommend for user 42
ids, scores = model.recommend(42, user_item_matrix[42], N=10)
Deep Learning for Recommendations
Two-Tower Model
Modern large-scale recommendation systems 常用 two-tower architecture:
User tower: Encode user features → user embedding
Item tower: Encode item features → item embedding
兩個 tower 分開訓練 → item embeddings 可以 pre-compute 和 index → 用 ANN(Approximate Nearest Neighbor)做 millisecond-level retrieval。
Multi-Stage Architecture
Production 推薦系統通常是 multi-stage pipeline:
| Stage | Purpose | Candidates | Model Complexity |
|---|---|---|---|
| Candidate generation | 從百萬 items 中快速篩出 ~1000 candidates | 1M → 1K | Simple (two-tower, ANN) |
| Ranking | 精排 1000 candidates | 1K → 100 | Complex (deep model, many features) |
| Re-ranking | Business rules, diversity, freshness | 100 → 10 | Rules + optimization |
面試中的 RecSys Design
面試官問「設計一個推薦系統」時,不要只講 algorithm。要講 multi-stage pipeline:candidate generation(fast, recall-oriented)→ ranking(accurate, precision-oriented)→ re-ranking(diversity, business rules)。然後再討論每個 stage 用什麼 model。
Cold Start Problem
Types
| Type | Problem | Solutions |
|---|---|---|
| New user | No interaction history → CF 無法工作 | Content-based(用 user demographics)、popularity-based、onboarding quiz |
| New item | No ratings → CF 無法推薦 | Content-based(用 item features)、editorial curation、explore-exploit |
| New system | Both user and item data are sparse | Start with content-based + popular items → gradually transition to CF |
Solutions in Practice
For new users:
- Show popular / trending items(non-personalized → safe baseline)
- Onboarding:ask for preferences or import from other platforms
- 用 demographic features 做 content-based(age, location → similar users' preferences)
- Contextual bandits:explore user preferences while serving recommendations
For new items:
- Inject new items into small % of recommendations(exploration)
- Use item features + content-based similarity to existing items
- Editorial / curated placement
- Time-based boost(new items get temporary ranking lift)
Evaluation
Offline Metrics
| Metric | What It Measures | Note |
|---|---|---|
| RMSE | Rating prediction accuracy | Only for explicit feedback |
| Precision@K | Relevant items in top-K / K | 前 K 個推薦中有多少是 relevant |
| Recall@K | Relevant items in top-K / total relevant | 所有 relevant items 中有多少被推薦 |
| NDCG@K | Ranking quality with position discount | 考慮 item 的 relevance level 和 position |
| Hit Rate@K | % of users with ≥1 relevant in top-K | 最簡單的推薦 success metric |
| MAP | Average precision across users | 每個 relevant item 的 precision 平均 |
| Coverage | % of items ever recommended | 低 = 只推薦熱門 item |
| Diversity | Intra-list dissimilarity | 高 = 推薦的 items 彼此不同 |
Offline vs Online Gap
Offline metrics 好不代表 online metrics 好:
| Reason | Example |
|---|---|
| Selection bias | Offline 只能 evaluate 被推薦過的 items(missing not at random) |
| Feedback loop | Model 推薦的 items 得到更多 clicks → 自我強化 |
| Position bias | Users click higher-ranked items regardless of relevance |
| Context | Offline 無法捕捉 user 的 real-time intent |
Online metrics(A/B test):CTR, conversion rate, revenue per session, time spent, user retention.
Offline NDCG ↑ ≠ Online CTR ↑
永遠要用 A/B test 驗證 offline 的改善是否 translate to online。常見失敗案例:offline NDCG 提升 5% 但 online CTR 不變 — 因為 offline evaluation data 有 selection bias 或 model 只改善了 tail items(users 不太看到的位置)。
Production Challenges
Popularity Bias
Model 傾向推薦已經熱門的 items(因為 training data 中熱門 items 有更多 interactions)→ 越推越熱門 → 長尾 items 永遠沒機會。
Solutions: Inverse propensity scoring, exploration budget, diversity penalty in re-ranking.
Position Bias
Users click items partly because they're ranked higher, not because they're more relevant.
Training data 中高位置的 items 得到更多 clicks → model 學到「高位置 = good」→ 自我強化。
Solutions: Inverse propensity weighting on position, random shuffle experiments to collect unbiased data.
Feedback Loop
Model predictions → affect what users see → affect user behavior → affect training data → affect model → ...
這個循環讓 model 的 mistakes 自我強化。
Solutions: Regular exploration(隨機推薦一小比例 items)、counterfactual evaluation、定期用 unbiased data retrain。
Real-World Use Cases
Case 1: 電商推薦(E-commerce)
| Component | Choice | Reasoning |
|---|---|---|
| Candidate gen | Item-Based CF + popularity | 快速從百萬 products 中選出 ~1000 |
| Ranking | GBM with user × item features | Purchase probability + revenue optimization |
| Features | Browsing history, purchase history, price sensitivity, category affinity | |
| Cold start (new user) | Popular items + category-based | Onboarding 時收集偏好 |
| Cold start (new item) | Content-based (category, brand) + explore | 新品上架時 boost + exploration budget |
| Online metric | Revenue per session, conversion rate | 不只看 CTR — clicks without purchase 沒有 business value |
Case 2: 影音串流(Video/Music)
| Component | Choice | Reasoning |
|---|---|---|
| Candidate gen | Two-tower (user embed + item embed) | 從百萬 content 中做 ANN retrieval |
| Ranking | Deep model with sequence features | Watch history → 預測下一個 |
| Features | Watch time, completion rate, genre, audio features, time of day | |
| Cold start | Content-based (metadata) + editorial | 新 content 靠 metadata 和編輯推薦 |
| Key challenge | Engagement vs satisfaction — 使用者看完 ≠ 喜歡(binge watch guilt) | |
| Online metric | Watch time, retention, survey satisfaction |
Case 3: 社群 Feed(Social Media)
| Component | Choice | Reasoning |
|---|---|---|
| Candidate gen | Friends' activities + interest graph | Social signal + topic affinity |
| Ranking | Multi-objective: engagement + quality + diversity | |
| Features | Post features, author features, social graph, user engagement history | |
| Key challenge | Filter bubble — 只看到和自己觀點一致的內容 | |
| Guardrails | Content diversity, quality score, misinformation filter | |
| Online metric | DAU, time spent, meaningful social interactions |
Hands-on: Recommendation in Python
Item-Based CF with Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# user_item_matrix: rows=users, cols=items, values=ratings
item_sim = cosine_similarity(user_item_matrix.T) # items × items
def recommend(user_id, user_item_matrix, item_sim, top_k=10):
user_ratings = user_item_matrix[user_id]
scores = item_sim.dot(user_ratings) # weighted sum of similarities
# Zero out already-rated items
scores[user_ratings > 0] = 0
return np.argsort(scores)[::-1][:top_k]
Matrix Factorization with ALS
from implicit.als import AlternatingLeastSquares
from scipy.sparse import csr_matrix
# Implicit feedback: user_item_matrix (sparse, values = interaction counts)
sparse_matrix = csr_matrix(user_item_matrix)
model = AlternatingLeastSquares(
factors=64,
regularization=0.05,
iterations=30,
)
model.fit(sparse_matrix)
# Recommend for user 0
recommended_ids, scores = model.recommend(0, sparse_matrix[0], N=10)
# Similar items to item 42
similar_ids, scores = model.similar_items(42, N=5)
Two-Tower with PyTorch
import torch
import torch.nn as nn
class TwoTowerModel(nn.Module):
def __init__(self, n_users, n_items, embed_dim=64):
super().__init__()
self.user_tower = nn.Sequential(
nn.Embedding(n_users, embed_dim),
nn.Linear(embed_dim, embed_dim),
nn.ReLU(),
nn.Linear(embed_dim, embed_dim),
)
self.item_tower = nn.Sequential(
nn.Embedding(n_items, embed_dim),
nn.Linear(embed_dim, embed_dim),
nn.ReLU(),
nn.Linear(embed_dim, embed_dim),
)
def forward(self, user_ids, item_ids):
user_emb = self.user_tower(user_ids) # [B, embed_dim]
item_emb = self.item_tower(item_ids) # [B, embed_dim]
return (user_emb * item_emb).sum(dim=1) # dot product → score
Interview Signals
What interviewers listen for:
- 你能描述 multi-stage pipeline(candidate gen → ranking → re-ranking)
- 你知道 CF 和 content-based 的 tradeoffs,以及各自的 cold start 問題
- 你能解釋 implicit vs explicit feedback 的差異和處理方式
- 你知道 offline NDCG ≠ online CTR,以及 feedback loop / position bias 等 production 問題
- 你能結合 business context 設計 evaluation 策略
Practice
Flashcards
Flashcards (1/10)
User-Based CF 和 Item-Based CF 哪個在 production 更常用?為什麼?
Item-Based CF 更常用。原因:(1)item similarity 比 user similarity 更穩定(item 特性不變,user 偏好會變),(2)items 通常比 users 少 → similarity matrix 更小、更 cacheable,(3)可以 pre-compute。Amazon 的 'item-to-item CF' 是經典案例。
Quiz
Content-based filtering 最大的優勢是什麼?