Clustering
Interview Context
Clustering 是 unsupervised learning 的核心主題。面試中常以兩種形式出現:(1)「你會怎麼把客戶分群?」(2)「K-Means 有什麼限制?什麼時候用其他方法?」要能比較不同方法的假設、優缺點和適用場景。
What You Should Understand
- 知道 K-Means 的演算法步驟、k-means++ 初始化、和選 K 的方法
- 能比較 K-Means、DBSCAN、Hierarchical、GMM 的假設和適用場景
- 理解 clustering evaluation 在沒有 labels 時怎麼做
- 知道 clustering 在 production 中的 pitfalls(feature scaling, high-dim, interpretation)
K-Means
Algorithm
K-Means minimizes the within-cluster sum of squares (WCSS):
where is the centroid of cluster .
Steps (Lloyd's algorithm):
- Initialize centroids(random or k-means++)
- Assign each point to the nearest centroid
- Update each centroid to the mean of its assigned points
- Repeat steps 2-3 until convergence(centroids stop moving)
Convergence is guaranteed(WCSS decreases every iteration),但可能 converge to a local minimum — 所以 initialization 很重要。
Geometric Interpretation (Linear Algebra Perspective)
K-Means 的 assign step 本質上是Voronoi tessellation — 把 feature space 切成 個 Voronoi cells,每個 cell 的 points 離對應 centroid 最近。
Update step: centroid = mean of assigned points = minimize squared distances in that cluster。這等價於把 data 的 representation 從 continuous space 壓縮成 個 discrete centroids — 一種 vector quantization。
如果用 squared Euclidean distance → 每個 cluster 的 optimal centroid = mean(因為 mean minimizes )。如果改用 L1 distance → optimal center = median。如果改用其他 distance → K-Medoids。
K-Means++ Initialization
Random initialization 的問題:如果初始 centroids 都落在同一個 cluster 裡 → terrible result。
K-Means++ solves this:
- Pick the first centroid randomly
- For each remaining centroid, choose the point with probability proportional to (distance to nearest existing centroid)
- Repeat until centroids are chosen
直覺:讓初始 centroids 盡量分散在不同區域。sklearn 的 KMeans 預設就是 k-means++。
Choosing K
Elbow method: Plot WCSS vs . 找 WCSS 下降速度開始變慢的「拐點」。
問題:拐點常常不明顯,很主觀。
Silhouette score: For each point, compute:
where = average distance to same-cluster points, = average distance to nearest-cluster points.
- : well-clustered
- : on the boundary
- : probably in the wrong cluster
Average silhouette score across all points → overall clustering quality。選 使 average silhouette 最高。
K-Means Assumptions and Limitations
| Assumption | What It Means | When It Fails |
|---|---|---|
| Spherical clusters | Clusters are roughly round | Elongated, irregular shapes |
| Equal-size clusters | All clusters have similar | One cluster much larger than others |
| Equal-variance | All clusters have similar spread | Dense + sparse clusters together |
| Fixed K | Must specify number of clusters | True K is unknown |
| Euclidean distance | Uses L2 distance | Categorical features, high-dim data |
K-Means Is Sensitive to Scale
K-Means 用 Euclidean distance — feature 範圍 0-1000 的會主導 clustering,範圍 0-1 的被忽略。Always standardize before K-Means。和 SVM 一樣的問題,原因也一樣(distance-based method)。
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
X_scaled = StandardScaler().fit_transform(X)
# Try different K values
for k in range(2, 8):
km = KMeans(n_clusters=k, init="k-means++", n_init=10, random_state=42)
labels = km.fit_predict(X_scaled)
sil = silhouette_score(X_scaled, labels)
print(f"K={k}: Silhouette={sil:.3f}, Inertia={km.inertia_:.0f}")
DBSCAN
Algorithm
Density-Based Spatial Clustering of Applications with Noise. 不需要指定 ,自動找任意形狀的 clusters,並標記 noise points。
Two parameters:
- eps (): Radius of neighborhood
- min_samples: Minimum points within to form a "core point"
Point types:
- Core point: Has ≥
min_samplespoints within - Border point: Within of a core point, but doesn't have enough neighbors itself
- Noise point: Neither core nor border → labeled as -1
Algorithm:
- Find all core points
- Connect core points that are within of each other → form clusters
- Assign border points to the nearest core point's cluster
- Everything else is noise
DBSCAN vs K-Means
| Aspect | K-Means | DBSCAN |
|---|---|---|
| Cluster shape | Spherical only | Arbitrary shapes |
| Requires | Yes | No(automatically determined) |
| Noise handling | Every point assigned to a cluster | Explicit noise label (-1) |
| Cluster sizes | Tends toward equal sizes | Handles very different sizes |
| Sensitivity | To initialization | To and min_samples |
| Scalability | — very fast | with spatial index, worst case |
Choosing eps and min_samples
min_samples: 通常設為 。更大的值 → 更 conservative(fewer clusters, more noise)。
eps: 用 k-distance plot — 計算每個點到第 k 個最近鄰的距離(k = min_samples),sorted descending。找 "elbow" → 那就是好的 。
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
import numpy as np
# Find good eps using k-distance plot
k = 5 # min_samples
nn = NearestNeighbors(n_neighbors=k).fit(X_scaled)
distances, _ = nn.kneighbors(X_scaled)
k_distances = np.sort(distances[:, -1])[::-1]
# Plot k_distances — look for the "elbow"
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = (labels == -1).sum()
print(f"Clusters: {n_clusters}, Noise points: {n_noise}")
Hierarchical Clustering
Agglomerative (Bottom-Up)
Start with each point as its own cluster, then merge the two closest clusters repeatedly:
- Start: clusters (one per point)
- Merge the two closest clusters
- Repeat until one cluster remains
- Cut the dendrogram at desired level to get clusters
Linkage Criteria
How to define "distance between two clusters"?
| Linkage | Distance Definition | Behavior |
|---|---|---|
| Single | Min distance between any two points | Finds elongated chains, sensitive to noise |
| Complete | Max distance between any two points | Compact clusters, biased toward spherical |
| Average | Mean pairwise distance | Compromise between single and complete |
| Ward's | Increase in WCSS from merging | Most similar to K-Means, produces compact clusters |
Ward's linkage 最常用 — 效果類似 K-Means 但不需要指定 K(可以看 dendrogram 再決定)。
Dendrogram
Dendrogram 視覺化整個 merging 過程。Y-axis 是 merge 時的距離。找 longest vertical gap(最大的 distance jump)→ 在那裡 cut → 得到自然的 cluster 數量。
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
# Compute linkage matrix for dendrogram
Z = linkage(X_scaled, method="ward")
# Plot dendrogram to decide K
# dendrogram(Z)
# Fit with chosen K
hc = AgglomerativeClustering(n_clusters=4, linkage="ward")
labels = hc.fit_predict(X_scaled)
Hierarchical vs K-Means
Hierarchical clustering 的優勢:不需要預設 K(看 dendrogram 決定),可以看到不同粒度的 clustering 結構。缺點: memory + time(不適合大資料),不能 update(新資料要重跑)。
Gaussian Mixture Models (GMM)
Soft Clustering
K-Means assigns each point to exactly one cluster (hard assignment). GMM assigns probabilities — each point belongs to each cluster with some probability.
GMM assumes data is generated from a mixture of Gaussian distributions:
where is the mixing weight(), and are the mean and covariance of each component.
EM Algorithm
Parameters are estimated via Expectation-Maximization:
E-step: Compute the probability(responsibility)that each point belongs to each cluster:
M-step: Update parameters using the responsibilities as weights:
Iterate E-M until convergence. 和 K-Means 的關係:K-Means 是 GMM 的特例 — 當所有 (spherical, equal variance)且用 hard assignment 時。
GMM vs K-Means
| Aspect | K-Means | GMM |
|---|---|---|
| Assignment | Hard (one cluster) | Soft (probability per cluster) |
| Cluster shape | Spherical | Ellipsoidal (covariance matrix) |
| Output | Cluster labels | for each cluster |
| Model selection | Elbow, silhouette | BIC/AIC — penalize complexity |
| Handles overlap | No | Yes — overlapping clusters get shared probabilities |
| Speed | Fast | Slower(compute full covariance) |
Choosing K with BIC
where is the likelihood, is the number of parameters, is the number of data points. Lower BIC = better model. BIC 自動 penalize 太多 clusters。
from sklearn.mixture import GaussianMixture
import numpy as np
bic_scores = []
for k in range(2, 8):
gmm = GaussianMixture(n_components=k, random_state=42)
gmm.fit(X_scaled)
bic_scores.append((k, gmm.bic(X_scaled)))
probs = gmm.predict_proba(X_scaled) # soft assignment
print(f"K={k}: BIC={gmm.bic(X_scaled):.0f}")
best_k = min(bic_scores, key=lambda x: x[1])[0]
Evaluation Metrics
Without Ground Truth Labels
| Metric | Formula / Meaning | Range | Higher is Better? |
|---|---|---|---|
| Silhouette | Yes | ||
| Davies-Bouldin | Avg ratio of within-cluster to between-cluster distance | No(lower is better) | |
| Calinski-Harabasz | Ratio of between-cluster to within-cluster variance | Yes | |
| WCSS / Inertia | Sum of squared distances to centroids | No(lower is better) |
With Ground Truth Labels
如果有 true labels(例如驗證 clustering 是否捕捉到已知結構):
| Metric | What It Measures | Range |
|---|---|---|
| Adjusted Rand Index (ARI) | Agreement between predicted and true labels, adjusted for chance | (1 = perfect) |
| Normalized Mutual Information (NMI) | Mutual information normalized by entropy | (1 = perfect) |
| Homogeneity | Each cluster contains only members of a single class | |
| Completeness | All members of a class are assigned to the same cluster |
Clustering Evaluation 陷阱
不要用同一批資料做 clustering 然後用 ANOVA 驗證 clusters 之間有差異 — 這是 circular reasoning(K-Means 的目標函數本身就在最大化 cluster 差異)。正確做法:(1) 在 holdout set 上驗證,或 (2) 檢驗 downstream task 是否因為 clustering 而改善。
Method Selection Guide
| Scenario | Best Method | Why |
|---|---|---|
| Spherical clusters, known | K-Means | Fast, simple, works well for round clusters |
| Unknown , arbitrary shapes | DBSCAN | No needed, finds irregular shapes, handles noise |
| Want cluster hierarchy / different granularities | Hierarchical | Dendrogram shows structure at all levels |
| Overlapping clusters, need probabilities | GMM | Soft assignment, ellipsoidal shapes, BIC for model selection |
| Very large dataset () | K-Means or Mini-Batch K-Means | per iteration, parallelizable |
| High-dimensional data | Spectral Clustering or reduce dim first | Euclidean distance degrades in high-dim |
| Clusters of very different densities | HDBSCAN | Extends DBSCAN to varying densities |
Real-World Use Cases
Case 1: 客戶分群(Customer Segmentation)
Scenario: 電商平台想把客戶分成不同群,針對性投放行銷活動。
Features: RFM (Recency, Frequency, Monetary) — 最後一次購買距今多久、購買頻率、消費金額。
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# RFM features
rfm = df[["recency", "frequency", "monetary"]]
rfm_scaled = StandardScaler().fit_transform(rfm)
km = KMeans(n_clusters=4, init="k-means++", random_state=42)
df["segment"] = km.fit_predict(rfm_scaled)
# Interpret clusters
for seg in range(4):
subset = df[df["segment"] == seg]
print(f"Segment {seg}: n={len(subset)}, "
f"avg_recency={subset['recency'].mean():.0f}, "
f"avg_frequency={subset['frequency'].mean():.1f}, "
f"avg_monetary={subset['monetary'].mean():.0f}")
面試 follow-up:
- 「怎麼決定分幾群?」— Silhouette score + business interpretability(4 群能不能對應到明確的行銷策略?)
- 「RFM 三個 features 的 scale 不同怎麼辦?」— StandardScaler。不 scale 的話 monetary(可能 0-10000)會主導 clustering。
- 「怎麼 validate clustering?」— 不能在 training data 上用 ANOVA。可以在 holdout 上看各 segment 的 response rate 是否真的不同,或 A/B test 不同 segment 的行銷策略。
Case 2: 信用卡詐欺偵測(Anomaly Detection via Clustering)
Scenario: 沒有 labeled fraud data,想用 unsupervised method 找異常交易。
Approach: 用 DBSCAN 或 Isolation Forest 找不屬於任何 cluster 的 noise points → 可能是 fraud。
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=10)
labels = dbscan.fit_predict(X_scaled)
# Noise points (-1) are potential anomalies
anomalies = X[labels == -1]
print(f"Detected {len(anomalies)} potential anomalies out of {len(X)}")
面試 follow-up:
- 「DBSCAN 把少數族群(例如高額交易)標為 noise,但它們不一定是 fraud?」— 對。需要 domain expert review anomalies。DBSCAN 只是 screening tool,不是 final classifier。
- 「什麼時候用 clustering-based anomaly detection vs supervised model?」— 沒有 labels 或 labels 極少時用 unsupervised。有足夠 labels 時 supervised(GBM)通常更好。
Case 3: 房價預測 — Geographic Clustering
Scenario: 房價受地理位置影響極大。直接用 latitude/longitude 做 feature 效果差(linear relationship 不成立)。
Approach: 先用 K-Means 把 houses 按地理位置分群,然後把 cluster label 作為 categorical feature 加入 regression model。
from sklearn.cluster import KMeans
# Cluster by location
geo_features = df[["latitude", "longitude"]]
km = KMeans(n_clusters=20, random_state=42)
df["geo_cluster"] = km.fit_predict(geo_features)
# Use cluster as categorical feature in regression
# One-hot encode geo_cluster → captures neighborhood effects
面試 follow-up:「為什麼不直接用 lat/lon?」— 房價和地理位置是 non-linear 關係(不是越北越貴)。K-Means 把相近的房子分到同一群 → 每個 cluster 代表一個 "neighborhood",one-hot encoding 後讓 model 學到 neighborhood effect。
Hands-on: Clustering in Python
Comparing Methods
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
methods = {
"K-Means (K=4)": KMeans(n_clusters=4, random_state=42),
"DBSCAN": DBSCAN(eps=0.5, min_samples=5),
"Hierarchical (Ward)": AgglomerativeClustering(n_clusters=4, linkage="ward"),
}
for name, model in methods.items():
labels = model.fit_predict(X_scaled)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
if n_clusters > 1:
sil = silhouette_score(X_scaled, labels)
print(f"{name}: {n_clusters} clusters, Silhouette={sil:.3f}")
# GMM (soft clustering)
gmm = GaussianMixture(n_components=4, random_state=42).fit(X_scaled)
gmm_labels = gmm.predict(X_scaled)
gmm_probs = gmm.predict_proba(X_scaled) # probability per cluster
print(f"GMM: BIC={gmm.bic(X_scaled):.0f}, Silhouette={silhouette_score(X_scaled, gmm_labels):.3f}")
Interview Signals
What interviewers listen for:
- 你能比較 K-Means、DBSCAN、Hierarchical、GMM 的適用場景
- 你知道 K-Means 的假設(spherical, equal size/variance)和什麼時候它會失敗
- 你能解釋怎麼選 K(elbow, silhouette, BIC)
- 你知道 clustering evaluation 的 circular reasoning 陷阱
- 你能用 clustering 解決 downstream tasks(segmentation, anomaly detection, feature engineering)
Practice
Flashcards
Flashcards (1/10)
K-Means 的三個主要假設是什麼?什麼時候它們會被違反?
(1)Spherical clusters — 非圓形/不規則形狀時失敗。(2)Equal-size clusters — 一大一小時小 cluster 會被吞掉。(3)Equal-variance — 密度差異大的 clusters 會分不好。用 DBSCAN(任意形狀)或 GMM(橢圓形)來處理。
Quiz
K-Means 每次 iteration 一定會讓 WCSS 下降(或不變)嗎?