Clustering

Interview Context

Clustering 是 unsupervised learning 的核心主題。面試中常以兩種形式出現:(1)「你會怎麼把客戶分群?」(2)「K-Means 有什麼限制?什麼時候用其他方法?」要能比較不同方法的假設、優缺點和適用場景。

What You Should Understand

  • 知道 K-Means 的演算法步驟、k-means++ 初始化、和選 K 的方法
  • 能比較 K-Means、DBSCAN、Hierarchical、GMM 的假設和適用場景
  • 理解 clustering evaluation 在沒有 labels 時怎麼做
  • 知道 clustering 在 production 中的 pitfalls(feature scaling, high-dim, interpretation)

K-Means

Algorithm

K-Means minimizes the within-cluster sum of squares (WCSS):

mink=1KiCkxiμk2\min \sum_{k=1}^{K} \sum_{i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2

where μk\boldsymbol{\mu}_k is the centroid of cluster kk.

Steps (Lloyd's algorithm):

  1. Initialize KK centroids(random or k-means++)
  2. Assign each point to the nearest centroid
  3. Update each centroid to the mean of its assigned points
  4. Repeat steps 2-3 until convergence(centroids stop moving)

Convergence is guaranteed(WCSS decreases every iteration),但可能 converge to a local minimum — 所以 initialization 很重要。

Geometric Interpretation (Linear Algebra Perspective)

K-Means 的 assign step 本質上是Voronoi tessellation — 把 feature space 切成 KK 個 Voronoi cells,每個 cell 的 points 離對應 centroid 最近。

Update step: centroid = mean of assigned points = minimize squared distances in that cluster。這等價於把 data 的 representation 從 continuous space 壓縮成 KK 個 discrete centroids — 一種 vector quantization

如果用 squared Euclidean distance → 每個 cluster 的 optimal centroid = mean(因為 mean minimizes xic2\sum \|x_i - c\|^2)。如果改用 L1 distance → optimal center = median。如果改用其他 distance → K-Medoids。

K-Means++ Initialization

Random initialization 的問題:如果初始 centroids 都落在同一個 cluster 裡 → terrible result。

K-Means++ solves this:

  1. Pick the first centroid randomly
  2. For each remaining centroid, choose the point with probability proportional to D(x)2D(\mathbf{x})^2(distance to nearest existing centroid)
  3. Repeat until KK centroids are chosen

直覺:讓初始 centroids 盡量分散在不同區域。sklearn 的 KMeans 預設就是 k-means++。

Choosing K

Elbow method: Plot WCSS vs KK. 找 WCSS 下降速度開始變慢的「拐點」。

問題:拐點常常不明顯,很主觀。

Silhouette score: For each point, compute:

si=biaimax(ai,bi)s_i = \frac{b_i - a_i}{\max(a_i, b_i)}

where aia_i = average distance to same-cluster points, bib_i = average distance to nearest-cluster points.

  • si1s_i \approx 1: well-clustered
  • si0s_i \approx 0: on the boundary
  • si<0s_i < 0: probably in the wrong cluster

Average silhouette score across all points → overall clustering quality。選 KK 使 average silhouette 最高。

K-Means Assumptions and Limitations

AssumptionWhat It MeansWhen It Fails
Spherical clustersClusters are roughly roundElongated, irregular shapes
Equal-size clustersAll clusters have similar nnOne cluster much larger than others
Equal-varianceAll clusters have similar spreadDense + sparse clusters together
Fixed KMust specify number of clustersTrue K is unknown
Euclidean distanceUses L2 distanceCategorical features, high-dim data

K-Means Is Sensitive to Scale

K-Means 用 Euclidean distance — feature 範圍 0-1000 的會主導 clustering,範圍 0-1 的被忽略。Always standardize before K-Means。和 SVM 一樣的問題,原因也一樣(distance-based method)。

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

X_scaled = StandardScaler().fit_transform(X)

# Try different K values
for k in range(2, 8):
    km = KMeans(n_clusters=k, init="k-means++", n_init=10, random_state=42)
    labels = km.fit_predict(X_scaled)
    sil = silhouette_score(X_scaled, labels)
    print(f"K={k}: Silhouette={sil:.3f}, Inertia={km.inertia_:.0f}")

DBSCAN

Algorithm

Density-Based Spatial Clustering of Applications with Noise. 不需要指定 KK,自動找任意形狀的 clusters,並標記 noise points。

Two parameters:

  • eps (ϵ\epsilon): Radius of neighborhood
  • min_samples: Minimum points within ϵ\epsilon to form a "core point"

Point types:

  • Core point: Has ≥ min_samples points within ϵ\epsilon
  • Border point: Within ϵ\epsilon of a core point, but doesn't have enough neighbors itself
  • Noise point: Neither core nor border → labeled as -1

Algorithm:

  1. Find all core points
  2. Connect core points that are within ϵ\epsilon of each other → form clusters
  3. Assign border points to the nearest core point's cluster
  4. Everything else is noise

DBSCAN vs K-Means

AspectK-MeansDBSCAN
Cluster shapeSpherical onlyArbitrary shapes
Requires KKYesNo(automatically determined)
Noise handlingEvery point assigned to a clusterExplicit noise label (-1)
Cluster sizesTends toward equal sizesHandles very different sizes
SensitivityTo initializationTo ϵ\epsilon and min_samples
ScalabilityO(nKI)O(n \cdot K \cdot I) — very fastO(nlogn)O(n \log n) with spatial index, O(n2)O(n^2) worst case

Choosing eps and min_samples

min_samples: 通常設為 dim+1\geq \text{dim} + 1。更大的值 → 更 conservative(fewer clusters, more noise)。

eps: 用 k-distance plot — 計算每個點到第 k 個最近鄰的距離(k = min_samples),sorted descending。找 "elbow" → 那就是好的 ϵ\epsilon

from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
import numpy as np

# Find good eps using k-distance plot
k = 5  # min_samples
nn = NearestNeighbors(n_neighbors=k).fit(X_scaled)
distances, _ = nn.kneighbors(X_scaled)
k_distances = np.sort(distances[:, -1])[::-1]
# Plot k_distances — look for the "elbow"

dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = (labels == -1).sum()
print(f"Clusters: {n_clusters}, Noise points: {n_noise}")

Hierarchical Clustering

Agglomerative (Bottom-Up)

Start with each point as its own cluster, then merge the two closest clusters repeatedly:

  1. Start: nn clusters (one per point)
  2. Merge the two closest clusters
  3. Repeat until one cluster remains
  4. Cut the dendrogram at desired level to get KK clusters

Linkage Criteria

How to define "distance between two clusters"?

LinkageDistance DefinitionBehavior
SingleMin distance between any two pointsFinds elongated chains, sensitive to noise
CompleteMax distance between any two pointsCompact clusters, biased toward spherical
AverageMean pairwise distanceCompromise between single and complete
Ward'sIncrease in WCSS from mergingMost similar to K-Means, produces compact clusters

Ward's linkage 最常用 — 效果類似 K-Means 但不需要指定 K(可以看 dendrogram 再決定)。

Dendrogram

Dendrogram 視覺化整個 merging 過程。Y-axis 是 merge 時的距離。找 longest vertical gap(最大的 distance jump)→ 在那裡 cut → 得到自然的 cluster 數量。

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Compute linkage matrix for dendrogram
Z = linkage(X_scaled, method="ward")
# Plot dendrogram to decide K
# dendrogram(Z)

# Fit with chosen K
hc = AgglomerativeClustering(n_clusters=4, linkage="ward")
labels = hc.fit_predict(X_scaled)

Hierarchical vs K-Means

Hierarchical clustering 的優勢:不需要預設 K(看 dendrogram 決定),可以看到不同粒度的 clustering 結構。缺點:O(n2)O(n^2) memory + O(n3)O(n^3) time(不適合大資料),不能 update(新資料要重跑)。

Gaussian Mixture Models (GMM)

Soft Clustering

K-Means assigns each point to exactly one cluster (hard assignment). GMM assigns probabilities — each point belongs to each cluster with some probability.

GMM assumes data is generated from a mixture of KK Gaussian distributions:

P(x)=k=1KπkN(xμk,Σk)P(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \cdot \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)

where πk\pi_k is the mixing weight(πk=1\sum \pi_k = 1),μk\boldsymbol{\mu}_k and Σk\boldsymbol{\Sigma}_k are the mean and covariance of each component.

EM Algorithm

Parameters are estimated via Expectation-Maximization:

E-step: Compute the probability(responsibility)that each point belongs to each cluster:

rik=πkN(xiμk,Σk)jπjN(xiμj,Σj)r_{ik} = \frac{\pi_k \cdot \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_j \pi_j \cdot \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}

M-step: Update parameters using the responsibilities as weights:

μk=irikxiirik,Σk=irik(xiμk)(xiμk)irik\boldsymbol{\mu}_k = \frac{\sum_i r_{ik} \mathbf{x}_i}{\sum_i r_{ik}}, \quad \boldsymbol{\Sigma}_k = \frac{\sum_i r_{ik} (\mathbf{x}_i - \boldsymbol{\mu}_k)(\mathbf{x}_i - \boldsymbol{\mu}_k)^\top}{\sum_i r_{ik}}

Iterate E-M until convergence. 和 K-Means 的關係:K-Means 是 GMM 的特例 — 當所有 Σk=σ2I\boldsymbol{\Sigma}_k = \sigma^2 \mathbf{I}(spherical, equal variance)且用 hard assignment 時。

GMM vs K-Means

AspectK-MeansGMM
AssignmentHard (one cluster)Soft (probability per cluster)
Cluster shapeSphericalEllipsoidal (covariance matrix)
OutputCluster labelsP(clusterx)P(\text{cluster} \mid \mathbf{x}) for each cluster
Model selectionElbow, silhouetteBIC/AIC — penalize complexity
Handles overlapNoYes — overlapping clusters get shared probabilities
SpeedFastSlower(compute full covariance)

Choosing K with BIC

BIC=2lnL+plnn\text{BIC} = -2 \ln L + p \ln n

where LL is the likelihood, pp is the number of parameters, nn is the number of data points. Lower BIC = better model. BIC 自動 penalize 太多 clusters。

from sklearn.mixture import GaussianMixture
import numpy as np

bic_scores = []
for k in range(2, 8):
    gmm = GaussianMixture(n_components=k, random_state=42)
    gmm.fit(X_scaled)
    bic_scores.append((k, gmm.bic(X_scaled)))
    probs = gmm.predict_proba(X_scaled)  # soft assignment
    print(f"K={k}: BIC={gmm.bic(X_scaled):.0f}")

best_k = min(bic_scores, key=lambda x: x[1])[0]

Evaluation Metrics

Without Ground Truth Labels

MetricFormula / MeaningRangeHigher is Better?
Silhouette(ba)/max(a,b)(b - a) / \max(a, b)[1,1][-1, 1]Yes
Davies-BouldinAvg ratio of within-cluster to between-cluster distance[0,)[0, \infty)No(lower is better)
Calinski-HarabaszRatio of between-cluster to within-cluster variance[0,)[0, \infty)Yes
WCSS / InertiaSum of squared distances to centroids[0,)[0, \infty)No(lower is better)

With Ground Truth Labels

如果有 true labels(例如驗證 clustering 是否捕捉到已知結構):

MetricWhat It MeasuresRange
Adjusted Rand Index (ARI)Agreement between predicted and true labels, adjusted for chance[1,1][-1, 1](1 = perfect)
Normalized Mutual Information (NMI)Mutual information normalized by entropy[0,1][0, 1](1 = perfect)
HomogeneityEach cluster contains only members of a single class[0,1][0, 1]
CompletenessAll members of a class are assigned to the same cluster[0,1][0, 1]

Clustering Evaluation 陷阱

不要用同一批資料做 clustering 然後用 ANOVA 驗證 clusters 之間有差異 — 這是 circular reasoning(K-Means 的目標函數本身就在最大化 cluster 差異)。正確做法:(1) 在 holdout set 上驗證,或 (2) 檢驗 downstream task 是否因為 clustering 而改善。

Method Selection Guide

ScenarioBest MethodWhy
Spherical clusters, known KKK-MeansFast, simple, works well for round clusters
Unknown KK, arbitrary shapesDBSCANNo KK needed, finds irregular shapes, handles noise
Want cluster hierarchy / different granularitiesHierarchicalDendrogram shows structure at all levels
Overlapping clusters, need probabilitiesGMMSoft assignment, ellipsoidal shapes, BIC for model selection
Very large dataset (n>100Kn > 100K)K-Means or Mini-Batch K-MeansO(n)O(n) per iteration, parallelizable
High-dimensional dataSpectral Clustering or reduce dim firstEuclidean distance degrades in high-dim
Clusters of very different densitiesHDBSCANExtends DBSCAN to varying densities

Real-World Use Cases

Case 1: 客戶分群(Customer Segmentation)

Scenario: 電商平台想把客戶分成不同群,針對性投放行銷活動。

Features: RFM (Recency, Frequency, Monetary) — 最後一次購買距今多久、購買頻率、消費金額。

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# RFM features
rfm = df[["recency", "frequency", "monetary"]]
rfm_scaled = StandardScaler().fit_transform(rfm)

km = KMeans(n_clusters=4, init="k-means++", random_state=42)
df["segment"] = km.fit_predict(rfm_scaled)

# Interpret clusters
for seg in range(4):
    subset = df[df["segment"] == seg]
    print(f"Segment {seg}: n={len(subset)}, "
          f"avg_recency={subset['recency'].mean():.0f}, "
          f"avg_frequency={subset['frequency'].mean():.1f}, "
          f"avg_monetary={subset['monetary'].mean():.0f}")

面試 follow-up:

  • 「怎麼決定分幾群?」— Silhouette score + business interpretability(4 群能不能對應到明確的行銷策略?)
  • 「RFM 三個 features 的 scale 不同怎麼辦?」— StandardScaler。不 scale 的話 monetary(可能 0-10000)會主導 clustering。
  • 「怎麼 validate clustering?」— 不能在 training data 上用 ANOVA。可以在 holdout 上看各 segment 的 response rate 是否真的不同,或 A/B test 不同 segment 的行銷策略。

Case 2: 信用卡詐欺偵測(Anomaly Detection via Clustering)

Scenario: 沒有 labeled fraud data,想用 unsupervised method 找異常交易。

Approach: 用 DBSCAN 或 Isolation Forest 找不屬於任何 cluster 的 noise points → 可能是 fraud。

from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.5, min_samples=10)
labels = dbscan.fit_predict(X_scaled)

# Noise points (-1) are potential anomalies
anomalies = X[labels == -1]
print(f"Detected {len(anomalies)} potential anomalies out of {len(X)}")

面試 follow-up:

  • 「DBSCAN 把少數族群(例如高額交易)標為 noise,但它們不一定是 fraud?」— 對。需要 domain expert review anomalies。DBSCAN 只是 screening tool,不是 final classifier。
  • 「什麼時候用 clustering-based anomaly detection vs supervised model?」— 沒有 labels 或 labels 極少時用 unsupervised。有足夠 labels 時 supervised(GBM)通常更好。

Case 3: 房價預測 — Geographic Clustering

Scenario: 房價受地理位置影響極大。直接用 latitude/longitude 做 feature 效果差(linear relationship 不成立)。

Approach: 先用 K-Means 把 houses 按地理位置分群,然後把 cluster label 作為 categorical feature 加入 regression model。

from sklearn.cluster import KMeans

# Cluster by location
geo_features = df[["latitude", "longitude"]]
km = KMeans(n_clusters=20, random_state=42)
df["geo_cluster"] = km.fit_predict(geo_features)

# Use cluster as categorical feature in regression
# One-hot encode geo_cluster → captures neighborhood effects

面試 follow-up:「為什麼不直接用 lat/lon?」— 房價和地理位置是 non-linear 關係(不是越北越貴)。K-Means 把相近的房子分到同一群 → 每個 cluster 代表一個 "neighborhood",one-hot encoding 後讓 model 學到 neighborhood effect。

Hands-on: Clustering in Python

Comparing Methods

from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(X)

methods = {
    "K-Means (K=4)": KMeans(n_clusters=4, random_state=42),
    "DBSCAN": DBSCAN(eps=0.5, min_samples=5),
    "Hierarchical (Ward)": AgglomerativeClustering(n_clusters=4, linkage="ward"),
}

for name, model in methods.items():
    labels = model.fit_predict(X_scaled)
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    if n_clusters > 1:
        sil = silhouette_score(X_scaled, labels)
        print(f"{name}: {n_clusters} clusters, Silhouette={sil:.3f}")

# GMM (soft clustering)
gmm = GaussianMixture(n_components=4, random_state=42).fit(X_scaled)
gmm_labels = gmm.predict(X_scaled)
gmm_probs = gmm.predict_proba(X_scaled)  # probability per cluster
print(f"GMM: BIC={gmm.bic(X_scaled):.0f}, Silhouette={silhouette_score(X_scaled, gmm_labels):.3f}")

Interview Signals

What interviewers listen for:

  • 你能比較 K-Means、DBSCAN、Hierarchical、GMM 的適用場景
  • 你知道 K-Means 的假設(spherical, equal size/variance)和什麼時候它會失敗
  • 你能解釋怎麼選 K(elbow, silhouette, BIC)
  • 你知道 clustering evaluation 的 circular reasoning 陷阱
  • 你能用 clustering 解決 downstream tasks(segmentation, anomaly detection, feature engineering)

Practice

Flashcards

Flashcards (1/10)

K-Means 的三個主要假設是什麼?什麼時候它們會被違反?

(1)Spherical clusters — 非圓形/不規則形狀時失敗。(2)Equal-size clusters — 一大一小時小 cluster 會被吞掉。(3)Equal-variance — 密度差異大的 clusters 會分不好。用 DBSCAN(任意形狀)或 GMM(橢圓形)來處理。

Click card to flip

Quiz

Question 1/10

K-Means 每次 iteration 一定會讓 WCSS 下降(或不變)嗎?

Mark as Complete

3/5 — Okay