Trees & Ensembles

Why This Matters

Gradient-boosted trees (XGBoost, LightGBM) 是 tabular data 的王者 — 在業界和 Kaggle 中佔主導地位。面試官期待你能解釋它們如何運作、什麼時候會失敗、和 random forest 有什麼不同。

Decision Trees

A decision tree recursively partitions the feature space by choosing the split that maximizes some purity criterion.

Splitting Criteria

Gini Impurity (used in CART):

G = 1 - \sum_{k=1}^{K} p_k^2

where $p_k$ is the proportion of class $k$ in the node. $G = 0$ means pure node (only one class).

Entropy / Information Gain (used in ID3, C4.5):

H = -\sum_{k=1}^{K} p_k \log_2 p_k

IG = H(\text{parent}) - \sum_{j} \frac{n_j}{n} H(\text{child}_j)

Variance reduction (regression trees): split that minimizes the weighted sum of variances in child nodes.

Gini vs Entropy

兩者在實務中產出幾乎相同的 tree。Gini 計算稍快（沒有 logarithm）。面試官更在意的是你理解 purity-based splitting 的概念，而非你選哪一個。

Tree Construction

At each node, the algorithm:

Considers all features and all possible split points
Selects the split maximizing the chosen criterion
Creates two child nodes
Repeats until a stopping condition is met

Advantages of Decision Trees

在談 ensemble 之前，先理解 single tree 的優缺點：

Advantage	Detail
Interpretable	可以直接看到決策規則，容易向 stakeholder 解釋
No feature scaling	不需要 normalization/standardization（splits 是 threshold-based）
Handles mixed types	自然處理 numerical + categorical features
Handles missing values	XGBoost/LightGBM 原生支援（學習 default direction）
Captures interactions	自動學習 feature interactions（nested splits）
Non-parametric	不需要假設 linear relationship

Disadvantage	Detail
High variance	Unpruned tree 極度 overfitting — 換一點資料就產生完全不同的 tree
Greedy algorithm	每個 split 只看 local optimum，不保證 global optimum
Biased toward high-cardinality	Features with many unique values 有更多 split 機會
Axis-aligned splits	只能做 horizontal/vertical splits，不擅長 diagonal boundaries

Controlling Overfitting

An unpruned tree will perfectly memorize training data. Control via:

Hyperparameter	What It Does	Typical Range
`max_depth`	Limit tree height	3-10 (boosting), None (RF)
`min_samples_leaf`	Minimum samples in terminal node	1-20
`min_samples_split`	Minimum samples to consider a split	2-50
`max_features`	Features considered per split	$\sqrt{p}$ (clf), $p/3$ (reg)
`ccp_alpha`	Cost-complexity pruning parameter	Cross-validate

Ensemble Methods

The key insight: combining many weak learners produces a strong learner. 兩大策略是 bagging 和 boosting。

Bagging (Bootstrap Aggregating)

Create $B$ bootstrap samples (sample with replacement)
Train one model on each bootstrap sample
Aggregate: average (regression) or majority vote (classification)

Bagging reduces variance without increasing bias. 最適合 high-variance, low-bias base learners（like deep trees）。

Random Forest

Random Forest = Bagging + random feature subsampling at each split.

At each split, only a random subset of $m$ features is considered:

Classification: $m \approx \sqrt{p}$
Regression: $m \approx p/3$

Why feature subsampling? 如果所有 trees 都用同一個 dominant feature 做第一個 split，trees 就高度 correlated。Averaging correlated predictions 降低的 variance 很有限。Feature subsampling 讓 trees decorrelate，ensemble 效果更好。

Out-of-Bag (OOB) Error

Each bootstrap sample leaves out ~37% of observations ( $1 - 1/e \approx 0.368$ ).

這些 out-of-bag samples 提供免費的 validation：

每棵 tree 用它沒看過的 OOB samples 做預測
彙整所有 trees 的 OOB predictions → OOB error ≈ cross-validation error
不需要另外留 holdout set

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=500, oob_score=True)
rf.fit(X_train, y_train)
print(f"OOB accuracy: {rf.oob_score_:.4f}")  # free validation!

Boosting

Boosting builds models sequentially, each new model focuses on the errors of the ensemble so far.

Gradient Boosting fits each new tree to the negative gradient (pseudo-residuals) of the loss:

F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \cdot h_m(\mathbf{x})

r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F = F_{m-1}}

For squared error loss: $r_{im} = y_i - F_{m-1}(x_i)$ （就是 residuals）。

For log loss (classification): pseudo-residuals 是不同的形式，但核心思想一樣 — 每棵新 tree 在修正之前 trees 犯的錯。

Boosting reduces bias — each tree corrects remaining errors. 風險是 overfitting（尤其對 noisy labels），需要控制 learning rate 和 tree 數量。

Bagging vs Boosting

Aspect	Bagging (Random Forest)	Boosting (XGBoost/LightGBM)
Tree construction	Independent, parallel	Sequential, each corrects errors
What it reduces	Variance	Bias (primarily)
Base learner	Deep trees (high variance)	Shallow trees, depth 3-6 (high bias)
Overfitting risk	Low	Higher — needs careful tuning
Noise sensitivity	Robust	Can memorize noisy labels
Training speed	Parallelizable	Sequential (but optimized)
When to use first	想要 robust baseline, minimal tuning	想要 best possible performance, willing to tune

面試經典問題

「什麼時候用 Random Forest vs Gradient Boosting？」— RF 更 robust、需要的 tuning 少、不容易 overfit。GBM 上限更高但需要更多 tuning。資料 noisy 或時間有限 → RF。追求 best performance 且可以 tune → GBM。

XGBoost and LightGBM

XGBoost Innovations

XGBoost adds several key improvements to vanilla gradient boosting:

Regularized objective:

\text{Obj} = \sum_{i} L(y_i, \hat{y}_i) + \sum_{t} \left[\gamma T_t + \frac{1}{2}\lambda \sum_{j=1}^{T_t} w_j^2 \right]

where $T_t$ is the number of leaves and $w_j$ are leaf weights. $\gamma$ penalizes tree complexity, $\lambda$ shrinks leaf weights.

Innovation	What It Does
Second-order approximation	Uses gradient + Hessian for better split finding
Regularized objective	L1/L2 on leaf weights prevents overfitting
Sparsity-aware splitting	Handles missing values by learning default direction
Column subsampling	Like RF, samples features per tree or per split
Weighted quantile sketch	Approximate split finding for large datasets

LightGBM Innovations

LightGBM optimizes for speed and memory on large datasets:

Innovation	What It Does
GOSS (Gradient-based One-Side Sampling)	Keeps large-gradient instances, samples small-gradient ones
EFB (Exclusive Feature Bundling)	Bundles mutually exclusive sparse features
Leaf-wise growth	Splits the leaf with highest loss reduction (vs level-wise)
Histogram-based splitting	Bins continuous features → O(data × bins) instead of O(data × features)

Level-wise vs Leaf-wise

	Level-wise (XGBoost default)	Leaf-wise (LightGBM default)
Strategy	Grow all nodes at same depth	Grow the node with highest gain
Tree shape	Balanced	Potentially deep, asymmetric
Efficiency	More splits needed	Fewer splits for same loss reduction
Overfitting	More controlled	小資料集上更容易 overfit
Key param	`max_depth`	`num_leaves`

LightGBM on Small Datasets

Leaf-wise growth 在小資料集（< 10K）上容易 overfit — 產出非常深且不平衡的 tree。務必限制 num_leaves（通常 < 64）和 min_child_samples（通常 > 20）。

CatBoost

CatBoost 處理 categorical features 特別好：

Ordered target encoding: 用 ordered statistics 避免 target leakage
Symmetric trees: 所有 nodes 在同一 depth 用相同 split → 更 regularized
No need for one-hot encoding: 原生支援 categorical features

當資料有大量 categorical features 時（例如推薦系統的 user/item ID），CatBoost 通常比 XGBoost/LightGBM 更方便。

Hyperparameter Tuning Guide

Parameter	XGBoost	LightGBM	Effect
Learning rate	`eta` (0.01-0.3)	`learning_rate`	越小越穩定但需要更多 trees
Tree depth	`max_depth` (3-10)	`max_depth` (-1=unlimited)	控制 interaction depth
Leaf count	`max_leaves`	`num_leaves` (31)	LightGBM 的主要 complexity 控制
Min samples	`min_child_weight`	`min_child_samples` (20)	防止 leaf 太小
Row sampling	`subsample` (0.5-1.0)	`bagging_fraction`	像 bagging 一樣降低 variance
Col sampling	`colsample_bytree` (0.5-1.0)	`feature_fraction`	像 RF 的 feature subsampling
L1 reg	`alpha`	`reg_alpha`	Feature selection effect
L2 reg	`lambda`	`reg_lambda`	Shrinks leaf weights

Tuning Strategy

面試中被問 tuning 策略時的建議回答：「先固定 learning_rate=0.1 和足夠多的 n_estimators（用 early stopping），然後 tune tree structure（max_depth / num_leaves），再 tune regularization（subsample, colsample_bytree, reg_alpha/lambda），最後降低 learning_rate 並增加 n_estimators。」

Feature Importance

Three Methods Compared

Method	How It Works	Pros	Cons
Impurity-based (MDI)	Sum of impurity reduction across all splits	Fast, built-in	Biased toward high-cardinality features
Permutation	Shuffle feature, measure performance drop	Model-agnostic, reliable	Slow, affected by correlated features
SHAP	Shapley values from game theory	Theoretically grounded, per-prediction	Computationally expensive

SHAP Values

SHAP (SHapley Additive exPlanations) provides a theoretically grounded attribution based on cooperative game theory:

\phi_j = \sum_{S \subseteq N \setminus \{j\}} \frac{|S|!\;(|N|-|S|-1)!}{|N|!} \left[f(S \cup \{j\}) - f(S)\right]

Key properties:

Local accuracy: $\sum_j \phi_j = f(x) - E[f(x)]$ （所有 SHAP values 加起來 = model prediction - baseline）
Consistency: If a feature's contribution increases, its SHAP value won't decrease
TreeSHAP: Exact SHAP computation in polynomial time for tree models（比 brute-force 快很多）

import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Summary plot: global feature importance + direction
shap.summary_plot(shap_values, X_test)

# Force plot: explain a single prediction
shap.force_plot(explainer.expected_value, shap_values[0], X_test.iloc[0])

面試中的 SHAP

能說出 SHAP 的三個性質（local accuracy, consistency, missingness）和它跟 impurity importance 的差別，是很強的面試信號。特別是 TreeSHAP 在 tree models 上可以 exact 計算，不需要 approximate。

Real-World Use Cases

Case 1: 信用卡詐欺偵測

Why trees work well: Fraud patterns 常涉及 complex feature interactions（高金額 + 跨國 + 新卡 → 高風險）。Trees 自然捕捉 interactions，不需要手動 engineer。

import lightgbm as lgb

model = lgb.LGBMClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=6,
    is_unbalance=True,  # handles class imbalance
    early_stopping_rounds=50,
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)])

面試 follow-up：

「為什麼用 GBM 而不是 logistic regression？」— Fraud patterns 是 non-linear 且涉及 interactions，logistic regression 需要手動 engineer 這些 interactions
「怎麼處理 class imbalance？」— is_unbalance=True（自動調整 weight）、scale_pos_weight、或 sampling（SMOTE）
「怎麼解釋模型決策？」— SHAP values 可以解釋每筆交易為什麼被標記為 fraud

Case 2: 房價預測

Why trees work well: 房價和 features 的關係是 non-linear（坪數 50→60 的影響和 150→160 不同），且有 interactions（好學區 + 大坪數的效果不是 additive）。

面試 follow-up：

「Tree model vs Linear regression？」— 如果在乎 interpretability（coefficient = 每坪多少錢），用 linear。如果在乎 predictive accuracy，用 GBM。
「Feature importance 怎麼看？」— Impurity importance 可能讓 location 相關的 high-cardinality features（如 zip code）rank 太高。用 permutation importance 或 SHAP 更可靠。

Case 3: 客戶分群後的 Uplift Modeling

行銷團隊想知道：給客戶發折扣券，對哪些客戶最有效？

Uplift modeling: 不是預測「客戶會不會買」，而是預測「折扣券讓購買機率增加多少」。

# Two-model approach (T-learner)
from sklearn.ensemble import GradientBoostingClassifier

# Model 1: P(purchase | treated)
model_t = GradientBoostingClassifier().fit(X_treated, y_treated)

# Model 2: P(purchase | not treated)
model_c = GradientBoostingClassifier().fit(X_control, y_control)

# Uplift = P(purchase|treated) - P(purchase|control)
uplift = model_t.predict_proba(X_new)[:, 1] - model_c.predict_proba(X_new)[:, 1]
# Target customers with highest uplift (not highest purchase probability!)

只對 uplift 最高的客戶發券 → 最大化 ROI。單純對「最可能購買」的客戶發券是錯的 — 他們可能不發券也會買。

Hands-on: Trees & Ensembles in Python

Decision Tree vs Random Forest vs Gradient Boosting

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

X, y = make_classification(n_samples=1000, n_features=20, n_informative=10)

models = {
    "Decision Tree (unlimited)": DecisionTreeClassifier(),
    "Decision Tree (depth=5)":   DecisionTreeClassifier(max_depth=5),
    "Random Forest (100 trees)": RandomForestClassifier(n_estimators=100),
    "Gradient Boosting (100)":   GradientBoostingClassifier(
        n_estimators=100, max_depth=3, learning_rate=0.1
    ),
}
for name, model in models.items():
    cv = cross_val_score(model, X, y, cv=5, scoring="accuracy")
    model.fit(X, y)
    print(f"{name}: CV={cv.mean():.3f}, Train={model.score(X, y):.3f}")

# Unpruned tree: Train=1.000, CV~0.85 → overfitting
# Random Forest: reduces variance via bagging + feature subsampling
# Gradient Boosting: highest CV, reduces bias sequentially

Feature Importance: Impurity vs Permutation

from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
rf = RandomForestClassifier(n_estimators=100).fit(X_train, y_train)

# Impurity-based (biased toward high-cardinality)
imp_impurity = rf.feature_importances_

# Permutation (more reliable)
perm = permutation_importance(rf, X_test, y_test, n_repeats=10)
imp_perm = perm.importances_mean

# Compare: high-cardinality noise features rank high in impurity
# but low in permutation → always prefer permutation or SHAP

LightGBM with Early Stopping

import lightgbm as lgb

dtrain = lgb.Dataset(X_train, label=y_train)
dval = lgb.Dataset(X_val, label=y_val, reference=dtrain)

params = {
    "objective": "binary",
    "metric": "auc",
    "learning_rate": 0.05,
    "num_leaves": 31,
    "min_child_samples": 20,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "reg_alpha": 0.1,
    "reg_lambda": 1.0,
    "verbose": -1,
}

model = lgb.train(
    params, dtrain,
    num_boost_round=2000,
    valid_sets=[dval],
    callbacks=[lgb.early_stopping(50)],
)
# Early stopping finds the optimal number of trees automatically

Interview Signals

What interviewers listen for:

你能解釋 bagging 降 variance、boosting 降 bias 的核心機制
你知道 RF 和 GBM 各自的優缺點和適用場景
你能說出 XGBoost / LightGBM / CatBoost 的關鍵差異
你理解三種 feature importance 方法的 tradeoffs（impurity / permutation / SHAP）
你知道怎麼系統性地 tune hyperparameters

Practice

Flashcards

Flashcards (1/10)

Why does Random Forest add feature subsampling on top of bagging?

Without it, all trees split on the same dominant features → trees are correlated → averaging doesn't reduce variance much. Feature subsampling decorrelates trees, making the ensemble more effective. Classification: m≈√p, Regression: m≈p/3.

Click card to flip

Quiz

Question 1/10

Noisy dataset with many mislabeled examples. Which model is more likely to overfit to the noise?

Mark as Complete

How confident are you with this topic?

3/5 — Okay