Clustering

Interview Context

Clustering 是 unsupervised learning 的核心主題。面試中常以兩種形式出現：（1）「你會怎麼把客戶分群？」（2）「K-Means 有什麼限制？什麼時候用其他方法？」要能比較不同方法的假設、優缺點和適用場景。

What You Should Understand

知道 K-Means 的演算法步驟、k-means++ 初始化、和選 K 的方法
能比較 K-Means、DBSCAN、Hierarchical、GMM 的假設和適用場景
理解 clustering evaluation 在沒有 labels 時怎麼做
知道 clustering 在 production 中的 pitfalls（feature scaling, high-dim, interpretation）

K-Means

Algorithm

K-Means minimizes the within-cluster sum of squares (WCSS):

\min \sum_{k=1}^{K} \sum_{i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2

where $\boldsymbol{\mu}_k$ is the centroid of cluster $k$ .

Steps (Lloyd's algorithm):

Initialize $K$ centroids（random or k-means++）
Assign each point to the nearest centroid
Update each centroid to the mean of its assigned points
Repeat steps 2-3 until convergence（centroids stop moving）

Convergence is guaranteed（WCSS decreases every iteration），但可能 converge to a local minimum — 所以 initialization 很重要。

Geometric Interpretation (Linear Algebra Perspective)

K-Means 的 assign step 本質上是Voronoi tessellation — 把 feature space 切成 $K$ 個 Voronoi cells，每個 cell 的 points 離對應 centroid 最近。

Update step: centroid = mean of assigned points = minimize squared distances in that cluster。這等價於把 data 的 representation 從 continuous space 壓縮成 $K$ 個 discrete centroids — 一種 vector quantization。

如果用 squared Euclidean distance → 每個 cluster 的 optimal centroid = mean（因為 mean minimizes $\sum \|x_i - c\|^2$ ）。如果改用 L1 distance → optimal center = median。如果改用其他 distance → K-Medoids。

K-Means++ Initialization

Random initialization 的問題：如果初始 centroids 都落在同一個 cluster 裡 → terrible result。

K-Means++ solves this:

Pick the first centroid randomly
For each remaining centroid, choose the point with probability proportional to $D(\mathbf{x})^2$ （distance to nearest existing centroid）
Repeat until $K$ centroids are chosen

直覺：讓初始 centroids 盡量分散在不同區域。sklearn 的 KMeans 預設就是 k-means++。

Choosing K

Elbow method: Plot WCSS vs $K$ . 找 WCSS 下降速度開始變慢的「拐點」。

問題：拐點常常不明顯，很主觀。

Silhouette score: For each point, compute:

s_i = \frac{b_i - a_i}{\max(a_i, b_i)}

where $a_i$ = average distance to same-cluster points, $b_i$ = average distance to nearest-cluster points.

$s_i \approx 1$ : well-clustered
$s_i \approx 0$ : on the boundary
$s_i < 0$ : probably in the wrong cluster

Average silhouette score across all points → overall clustering quality。選 $K$ 使 average silhouette 最高。

K-Means Assumptions and Limitations

Assumption	What It Means	When It Fails
Spherical clusters	Clusters are roughly round	Elongated, irregular shapes
Equal-size clusters	All clusters have similar $n$	One cluster much larger than others
Equal-variance	All clusters have similar spread	Dense + sparse clusters together
Fixed K	Must specify number of clusters	True K is unknown
Euclidean distance	Uses L2 distance	Categorical features, high-dim data

K-Means Is Sensitive to Scale

K-Means 用 Euclidean distance — feature 範圍 0-1000 的會主導 clustering，範圍 0-1 的被忽略。Always standardize before K-Means。和 SVM 一樣的問題，原因也一樣（distance-based method）。

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

X_scaled = StandardScaler().fit_transform(X)

# Try different K values
for k in range(2, 8):
    km = KMeans(n_clusters=k, init="k-means++", n_init=10, random_state=42)
    labels = km.fit_predict(X_scaled)
    sil = silhouette_score(X_scaled, labels)
    print(f"K={k}: Silhouette={sil:.3f}, Inertia={km.inertia_:.0f}")

DBSCAN

Algorithm

Density-Based Spatial Clustering of Applications with Noise. 不需要指定 $K$ ，自動找任意形狀的 clusters，並標記 noise points。

Two parameters:

eps ( $\epsilon$ ): Radius of neighborhood
min_samples: Minimum points within $\epsilon$ to form a "core point"

Point types:

Core point: Has ≥ min_samples points within $\epsilon$
Border point: Within $\epsilon$ of a core point, but doesn't have enough neighbors itself
Noise point: Neither core nor border → labeled as -1

Algorithm:

Find all core points
Connect core points that are within $\epsilon$ of each other → form clusters
Assign border points to the nearest core point's cluster
Everything else is noise

DBSCAN vs K-Means

Aspect	K-Means	DBSCAN
Cluster shape	Spherical only	Arbitrary shapes
Requires $K$	Yes	No（automatically determined）
Noise handling	Every point assigned to a cluster	Explicit noise label (-1)
Cluster sizes	Tends toward equal sizes	Handles very different sizes
Sensitivity	To initialization	To $\epsilon$ and min_samples
Scalability	$O(n \cdot K \cdot I)$ — very fast	$O(n \log n)$ with spatial index, $O(n^2)$ worst case

Choosing eps and min_samples

min_samples: 通常設為 $\geq \text{dim} + 1$ 。更大的值 → 更 conservative（fewer clusters, more noise）。

eps: 用 k-distance plot — 計算每個點到第 k 個最近鄰的距離（k = min_samples），sorted descending。找 "elbow" → 那就是好的 $\epsilon$ 。

from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
import numpy as np

# Find good eps using k-distance plot
k = 5  # min_samples
nn = NearestNeighbors(n_neighbors=k).fit(X_scaled)
distances, _ = nn.kneighbors(X_scaled)
k_distances = np.sort(distances[:, -1])[::-1]
# Plot k_distances — look for the "elbow"

dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = (labels == -1).sum()
print(f"Clusters: {n_clusters}, Noise points: {n_noise}")

Hierarchical Clustering

Agglomerative (Bottom-Up)

Start with each point as its own cluster, then merge the two closest clusters repeatedly:

Start: $n$ clusters (one per point)
Merge the two closest clusters
Repeat until one cluster remains
Cut the dendrogram at desired level to get $K$ clusters

Linkage Criteria

How to define "distance between two clusters"?

Linkage	Distance Definition	Behavior
Single	Min distance between any two points	Finds elongated chains, sensitive to noise
Complete	Max distance between any two points	Compact clusters, biased toward spherical
Average	Mean pairwise distance	Compromise between single and complete
Ward's	Increase in WCSS from merging	Most similar to K-Means, produces compact clusters

Ward's linkage 最常用 — 效果類似 K-Means 但不需要指定 K（可以看 dendrogram 再決定）。

Dendrogram

Dendrogram 視覺化整個 merging 過程。Y-axis 是 merge 時的距離。找 longest vertical gap（最大的 distance jump）→ 在那裡 cut → 得到自然的 cluster 數量。

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Compute linkage matrix for dendrogram
Z = linkage(X_scaled, method="ward")
# Plot dendrogram to decide K
# dendrogram(Z)

# Fit with chosen K
hc = AgglomerativeClustering(n_clusters=4, linkage="ward")
labels = hc.fit_predict(X_scaled)

Hierarchical vs K-Means

Hierarchical clustering 的優勢：不需要預設 K（看 dendrogram 決定），可以看到不同粒度的 clustering 結構。缺點： $O(n^2)$ memory + $O(n^3)$ time（不適合大資料），不能 update（新資料要重跑）。

Gaussian Mixture Models (GMM)

Soft Clustering

K-Means assigns each point to exactly one cluster (hard assignment). GMM assigns probabilities — each point belongs to each cluster with some probability.

GMM assumes data is generated from a mixture of $K$ Gaussian distributions:

P(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \cdot \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)

where $\pi_k$ is the mixing weight（ $\sum \pi_k = 1$ ）， $\boldsymbol{\mu}_k$ and $\boldsymbol{\Sigma}_k$ are the mean and covariance of each component.

EM Algorithm

Parameters are estimated via Expectation-Maximization:

E-step: Compute the probability（responsibility）that each point belongs to each cluster:

r_{ik} = \frac{\pi_k \cdot \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_j \pi_j \cdot \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}

M-step: Update parameters using the responsibilities as weights:

\boldsymbol{\mu}_k = \frac{\sum_i r_{ik} \mathbf{x}_i}{\sum_i r_{ik}}, \quad \boldsymbol{\Sigma}_k = \frac{\sum_i r_{ik} (\mathbf{x}_i - \boldsymbol{\mu}_k)(\mathbf{x}_i - \boldsymbol{\mu}_k)^\top}{\sum_i r_{ik}}

Iterate E-M until convergence. 和 K-Means 的關係：K-Means 是 GMM 的特例 — 當所有 $\boldsymbol{\Sigma}_k = \sigma^2 \mathbf{I}$ （spherical, equal variance）且用 hard assignment 時。

GMM vs K-Means

Aspect	K-Means	GMM
Assignment	Hard (one cluster)	Soft (probability per cluster)
Cluster shape	Spherical	Ellipsoidal (covariance matrix)
Output	Cluster labels	$P(\text{cluster} \mid \mathbf{x})$ for each cluster
Model selection	Elbow, silhouette	BIC/AIC — penalize complexity
Handles overlap	No	Yes — overlapping clusters get shared probabilities
Speed	Fast	Slower（compute full covariance）

Choosing K with BIC

\text{BIC} = -2 \ln L + p \ln n

where $L$ is the likelihood, $p$ is the number of parameters, $n$ is the number of data points. Lower BIC = better model. BIC 自動 penalize 太多 clusters。

from sklearn.mixture import GaussianMixture
import numpy as np

bic_scores = []
for k in range(2, 8):
    gmm = GaussianMixture(n_components=k, random_state=42)
    gmm.fit(X_scaled)
    bic_scores.append((k, gmm.bic(X_scaled)))
    probs = gmm.predict_proba(X_scaled)  # soft assignment
    print(f"K={k}: BIC={gmm.bic(X_scaled):.0f}")

best_k = min(bic_scores, key=lambda x: x[1])[0]

Evaluation Metrics

Without Ground Truth Labels

Metric	Formula / Meaning	Range	Higher is Better?
Silhouette	$(b - a) / \max(a, b)$	$[-1, 1]$	Yes
Davies-Bouldin	Avg ratio of within-cluster to between-cluster distance	$[0, \infty)$	No（lower is better）
Calinski-Harabasz	Ratio of between-cluster to within-cluster variance	$[0, \infty)$	Yes
WCSS / Inertia	Sum of squared distances to centroids	$[0, \infty)$	No（lower is better）

With Ground Truth Labels

如果有 true labels（例如驗證 clustering 是否捕捉到已知結構）：

Metric	What It Measures	Range
Adjusted Rand Index (ARI)	Agreement between predicted and true labels, adjusted for chance	$[-1, 1]$ （1 = perfect）
Normalized Mutual Information (NMI)	Mutual information normalized by entropy	$[0, 1]$ （1 = perfect）
Homogeneity	Each cluster contains only members of a single class	$[0, 1]$
Completeness	All members of a class are assigned to the same cluster	$[0, 1]$

Clustering Evaluation 陷阱

不要用同一批資料做 clustering 然後用 ANOVA 驗證 clusters 之間有差異 — 這是 circular reasoning（K-Means 的目標函數本身就在最大化 cluster 差異）。正確做法：(1) 在 holdout set 上驗證，或 (2) 檢驗 downstream task 是否因為 clustering 而改善。

Method Selection Guide

Scenario	Best Method	Why
Spherical clusters, known $K$	K-Means	Fast, simple, works well for round clusters
Unknown $K$ , arbitrary shapes	DBSCAN	No $K$ needed, finds irregular shapes, handles noise
Want cluster hierarchy / different granularities	Hierarchical	Dendrogram shows structure at all levels
Overlapping clusters, need probabilities	GMM	Soft assignment, ellipsoidal shapes, BIC for model selection
Very large dataset ( $n > 100K$ )	K-Means or Mini-Batch K-Means	$O(n)$ per iteration, parallelizable
High-dimensional data	Spectral Clustering or reduce dim first	Euclidean distance degrades in high-dim
Clusters of very different densities	HDBSCAN	Extends DBSCAN to varying densities

Real-World Use Cases

Case 1: 客戶分群（Customer Segmentation）

Scenario: 電商平台想把客戶分成不同群，針對性投放行銷活動。

Features: RFM (Recency, Frequency, Monetary) — 最後一次購買距今多久、購買頻率、消費金額。

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# RFM features
rfm = df[["recency", "frequency", "monetary"]]
rfm_scaled = StandardScaler().fit_transform(rfm)

km = KMeans(n_clusters=4, init="k-means++", random_state=42)
df["segment"] = km.fit_predict(rfm_scaled)

# Interpret clusters
for seg in range(4):
    subset = df[df["segment"] == seg]
    print(f"Segment {seg}: n={len(subset)}, "
          f"avg_recency={subset['recency'].mean():.0f}, "
          f"avg_frequency={subset['frequency'].mean():.1f}, "
          f"avg_monetary={subset['monetary'].mean():.0f}")

面試 follow-up：

「怎麼決定分幾群？」— Silhouette score + business interpretability（4 群能不能對應到明確的行銷策略？）
「RFM 三個 features 的 scale 不同怎麼辦？」— StandardScaler。不 scale 的話 monetary（可能 0-10000）會主導 clustering。
「怎麼 validate clustering？」— 不能在 training data 上用 ANOVA。可以在 holdout 上看各 segment 的 response rate 是否真的不同，或 A/B test 不同 segment 的行銷策略。

Case 2: 信用卡詐欺偵測（Anomaly Detection via Clustering）

Scenario: 沒有 labeled fraud data，想用 unsupervised method 找異常交易。

Approach: 用 DBSCAN 或 Isolation Forest 找不屬於任何 cluster 的 noise points → 可能是 fraud。

from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.5, min_samples=10)
labels = dbscan.fit_predict(X_scaled)

# Noise points (-1) are potential anomalies
anomalies = X[labels == -1]
print(f"Detected {len(anomalies)} potential anomalies out of {len(X)}")

面試 follow-up：

「DBSCAN 把少數族群（例如高額交易）標為 noise，但它們不一定是 fraud？」— 對。需要 domain expert review anomalies。DBSCAN 只是 screening tool，不是 final classifier。
「什麼時候用 clustering-based anomaly detection vs supervised model？」— 沒有 labels 或 labels 極少時用 unsupervised。有足夠 labels 時 supervised（GBM）通常更好。

Case 3: 房價預測 — Geographic Clustering

Scenario: 房價受地理位置影響極大。直接用 latitude/longitude 做 feature 效果差（linear relationship 不成立）。

Approach: 先用 K-Means 把 houses 按地理位置分群，然後把 cluster label 作為 categorical feature 加入 regression model。

from sklearn.cluster import KMeans

# Cluster by location
geo_features = df[["latitude", "longitude"]]
km = KMeans(n_clusters=20, random_state=42)
df["geo_cluster"] = km.fit_predict(geo_features)

# Use cluster as categorical feature in regression
# One-hot encode geo_cluster → captures neighborhood effects

面試 follow-up：「為什麼不直接用 lat/lon？」— 房價和地理位置是 non-linear 關係（不是越北越貴）。K-Means 把相近的房子分到同一群 → 每個 cluster 代表一個 "neighborhood"，one-hot encoding 後讓 model 學到 neighborhood effect。

Hands-on: Clustering in Python

Comparing Methods

from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(X)

methods = {
    "K-Means (K=4)": KMeans(n_clusters=4, random_state=42),
    "DBSCAN": DBSCAN(eps=0.5, min_samples=5),
    "Hierarchical (Ward)": AgglomerativeClustering(n_clusters=4, linkage="ward"),
}

for name, model in methods.items():
    labels = model.fit_predict(X_scaled)
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    if n_clusters > 1:
        sil = silhouette_score(X_scaled, labels)
        print(f"{name}: {n_clusters} clusters, Silhouette={sil:.3f}")

# GMM (soft clustering)
gmm = GaussianMixture(n_components=4, random_state=42).fit(X_scaled)
gmm_labels = gmm.predict(X_scaled)
gmm_probs = gmm.predict_proba(X_scaled)  # probability per cluster
print(f"GMM: BIC={gmm.bic(X_scaled):.0f}, Silhouette={silhouette_score(X_scaled, gmm_labels):.3f}")

Interview Signals

What interviewers listen for:

你能比較 K-Means、DBSCAN、Hierarchical、GMM 的適用場景
你知道 K-Means 的假設（spherical, equal size/variance）和什麼時候它會失敗
你能解釋怎麼選 K（elbow, silhouette, BIC）
你知道 clustering evaluation 的 circular reasoning 陷阱
你能用 clustering 解決 downstream tasks（segmentation, anomaly detection, feature engineering）

Practice

Flashcards

Flashcards (1/10)

K-Means 的三個主要假設是什麼？什麼時候它們會被違反？

（1）Spherical clusters — 非圓形/不規則形狀時失敗。（2）Equal-size clusters — 一大一小時小 cluster 會被吞掉。（3）Equal-variance — 密度差異大的 clusters 會分不好。用 DBSCAN（任意形狀）或 GMM（橢圓形）來處理。

Click card to flip

Quiz

Question 1/10

K-Means 每次 iteration 一定會讓 WCSS 下降（或不變）嗎？

Mark as Complete

How confident are you with this topic?

3/5 — Okay