Trees & Ensembles

Why This Matters

Gradient-boosted trees (XGBoost, LightGBM) 是 tabular data 的王者 — 在業界和 Kaggle 中佔主導地位。面試官期待你能解釋它們如何運作、什麼時候會失敗、和 random forest 有什麼不同。

Decision Trees

A decision tree recursively partitions the feature space by choosing the split that maximizes some purity criterion.

Splitting Criteria

Gini Impurity (used in CART):

G=1k=1Kpk2G = 1 - \sum_{k=1}^{K} p_k^2

where pkp_k is the proportion of class kk in the node. G=0G = 0 means pure node (only one class).

Entropy / Information Gain (used in ID3, C4.5):

H=k=1Kpklog2pkH = -\sum_{k=1}^{K} p_k \log_2 p_k IG=H(parent)jnjnH(childj)IG = H(\text{parent}) - \sum_{j} \frac{n_j}{n} H(\text{child}_j)

Variance reduction (regression trees): split that minimizes the weighted sum of variances in child nodes.

Gini vs Entropy

兩者在實務中產出幾乎相同的 tree。Gini 計算稍快(沒有 logarithm)。面試官更在意的是你理解 purity-based splitting 的概念,而非你選哪一個。

Tree Construction

At each node, the algorithm:

  1. Considers all features and all possible split points
  2. Selects the split maximizing the chosen criterion
  3. Creates two child nodes
  4. Repeats until a stopping condition is met

Advantages of Decision Trees

在談 ensemble 之前,先理解 single tree 的優缺點:

AdvantageDetail
Interpretable可以直接看到決策規則,容易向 stakeholder 解釋
No feature scaling不需要 normalization/standardization(splits 是 threshold-based)
Handles mixed types自然處理 numerical + categorical features
Handles missing valuesXGBoost/LightGBM 原生支援(學習 default direction)
Captures interactions自動學習 feature interactions(nested splits)
Non-parametric不需要假設 linear relationship
DisadvantageDetail
High varianceUnpruned tree 極度 overfitting — 換一點資料就產生完全不同的 tree
Greedy algorithm每個 split 只看 local optimum,不保證 global optimum
Biased toward high-cardinalityFeatures with many unique values 有更多 split 機會
Axis-aligned splits只能做 horizontal/vertical splits,不擅長 diagonal boundaries

Controlling Overfitting

An unpruned tree will perfectly memorize training data. Control via:

HyperparameterWhat It DoesTypical Range
max_depthLimit tree height3-10 (boosting), None (RF)
min_samples_leafMinimum samples in terminal node1-20
min_samples_splitMinimum samples to consider a split2-50
max_featuresFeatures considered per splitp\sqrt{p} (clf), p/3p/3 (reg)
ccp_alphaCost-complexity pruning parameterCross-validate

Ensemble Methods

The key insight: combining many weak learners produces a strong learner. 兩大策略是 baggingboosting

Bagging (Bootstrap Aggregating)

  1. Create BB bootstrap samples (sample with replacement)
  2. Train one model on each bootstrap sample
  3. Aggregate: average (regression) or majority vote (classification)

Bagging reduces variance without increasing bias. 最適合 high-variance, low-bias base learners(like deep trees)。

Random Forest

Random Forest = Bagging + random feature subsampling at each split.

At each split, only a random subset of mm features is considered:

  • Classification: mpm \approx \sqrt{p}
  • Regression: mp/3m \approx p/3

Why feature subsampling? 如果所有 trees 都用同一個 dominant feature 做第一個 split,trees 就高度 correlated。Averaging correlated predictions 降低的 variance 很有限。Feature subsampling 讓 trees decorrelate,ensemble 效果更好。

Out-of-Bag (OOB) Error

Each bootstrap sample leaves out ~37% of observations (11/e0.3681 - 1/e \approx 0.368).

這些 out-of-bag samples 提供免費的 validation:

  • 每棵 tree 用它沒看過的 OOB samples 做預測
  • 彙整所有 trees 的 OOB predictions → OOB error ≈ cross-validation error
  • 不需要另外留 holdout set
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=500, oob_score=True)
rf.fit(X_train, y_train)
print(f"OOB accuracy: {rf.oob_score_:.4f}")  # free validation!

Boosting

Boosting builds models sequentially, each new model focuses on the errors of the ensemble so far.

Gradient Boosting fits each new tree to the negative gradient (pseudo-residuals) of the loss:

Fm(x)=Fm1(x)+ηhm(x)F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \cdot h_m(\mathbf{x}) rim=[L(yi,F(xi))F(xi)]F=Fm1r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F = F_{m-1}}

For squared error loss: rim=yiFm1(xi)r_{im} = y_i - F_{m-1}(x_i)(就是 residuals)。

For log loss (classification): pseudo-residuals 是不同的形式,但核心思想一樣 — 每棵新 tree 在修正之前 trees 犯的錯。

Boosting reduces bias — each tree corrects remaining errors. 風險是 overfitting(尤其對 noisy labels),需要控制 learning rate 和 tree 數量。

Bagging vs Boosting

AspectBagging (Random Forest)Boosting (XGBoost/LightGBM)
Tree constructionIndependent, parallelSequential, each corrects errors
What it reducesVarianceBias (primarily)
Base learnerDeep trees (high variance)Shallow trees, depth 3-6 (high bias)
Overfitting riskLowHigher — needs careful tuning
Noise sensitivityRobustCan memorize noisy labels
Training speedParallelizableSequential (but optimized)
When to use first想要 robust baseline, minimal tuning想要 best possible performance, willing to tune

面試經典問題

「什麼時候用 Random Forest vs Gradient Boosting?」— RF 更 robust、需要的 tuning 少、不容易 overfit。GBM 上限更高但需要更多 tuning。資料 noisy 或時間有限 → RF。追求 best performance 且可以 tune → GBM。

XGBoost and LightGBM

XGBoost Innovations

XGBoost adds several key improvements to vanilla gradient boosting:

Regularized objective:

Obj=iL(yi,y^i)+t[γTt+12λj=1Ttwj2]\text{Obj} = \sum_{i} L(y_i, \hat{y}_i) + \sum_{t} \left[\gamma T_t + \frac{1}{2}\lambda \sum_{j=1}^{T_t} w_j^2 \right]

where TtT_t is the number of leaves and wjw_j are leaf weights. γ\gamma penalizes tree complexity, λ\lambda shrinks leaf weights.

InnovationWhat It Does
Second-order approximationUses gradient + Hessian for better split finding
Regularized objectiveL1/L2 on leaf weights prevents overfitting
Sparsity-aware splittingHandles missing values by learning default direction
Column subsamplingLike RF, samples features per tree or per split
Weighted quantile sketchApproximate split finding for large datasets

LightGBM Innovations

LightGBM optimizes for speed and memory on large datasets:

InnovationWhat It Does
GOSS (Gradient-based One-Side Sampling)Keeps large-gradient instances, samples small-gradient ones
EFB (Exclusive Feature Bundling)Bundles mutually exclusive sparse features
Leaf-wise growthSplits the leaf with highest loss reduction (vs level-wise)
Histogram-based splittingBins continuous features → O(data × bins) instead of O(data × features)

Level-wise vs Leaf-wise

Level-wise (XGBoost default)Leaf-wise (LightGBM default)
StrategyGrow all nodes at same depthGrow the node with highest gain
Tree shapeBalancedPotentially deep, asymmetric
EfficiencyMore splits neededFewer splits for same loss reduction
OverfittingMore controlled小資料集上更容易 overfit
Key parammax_depthnum_leaves

LightGBM on Small Datasets

Leaf-wise growth 在小資料集(< 10K)上容易 overfit — 產出非常深且不平衡的 tree。務必限制 num_leaves(通常 < 64)和 min_child_samples(通常 > 20)。

CatBoost

CatBoost 處理 categorical features 特別好:

  • Ordered target encoding: 用 ordered statistics 避免 target leakage
  • Symmetric trees: 所有 nodes 在同一 depth 用相同 split → 更 regularized
  • No need for one-hot encoding: 原生支援 categorical features

當資料有大量 categorical features 時(例如推薦系統的 user/item ID),CatBoost 通常比 XGBoost/LightGBM 更方便。

Hyperparameter Tuning Guide

ParameterXGBoostLightGBMEffect
Learning rateeta (0.01-0.3)learning_rate越小越穩定但需要更多 trees
Tree depthmax_depth (3-10)max_depth (-1=unlimited)控制 interaction depth
Leaf countmax_leavesnum_leaves (31)LightGBM 的主要 complexity 控制
Min samplesmin_child_weightmin_child_samples (20)防止 leaf 太小
Row samplingsubsample (0.5-1.0)bagging_fraction像 bagging 一樣降低 variance
Col samplingcolsample_bytree (0.5-1.0)feature_fraction像 RF 的 feature subsampling
L1 regalphareg_alphaFeature selection effect
L2 reglambdareg_lambdaShrinks leaf weights

Tuning Strategy

面試中被問 tuning 策略時的建議回答:「先固定 learning_rate=0.1 和足夠多的 n_estimators(用 early stopping),然後 tune tree structure(max_depth / num_leaves),再 tune regularization(subsample, colsample_bytree, reg_alpha/lambda),最後降低 learning_rate 並增加 n_estimators。」

Feature Importance

Three Methods Compared

MethodHow It WorksProsCons
Impurity-based (MDI)Sum of impurity reduction across all splitsFast, built-inBiased toward high-cardinality features
PermutationShuffle feature, measure performance dropModel-agnostic, reliableSlow, affected by correlated features
SHAPShapley values from game theoryTheoretically grounded, per-predictionComputationally expensive

SHAP Values

SHAP (SHapley Additive exPlanations) provides a theoretically grounded attribution based on cooperative game theory:

ϕj=SN{j}S!  (NS1)!N![f(S{j})f(S)]\phi_j = \sum_{S \subseteq N \setminus \{j\}} \frac{|S|!\;(|N|-|S|-1)!}{|N|!} \left[f(S \cup \{j\}) - f(S)\right]

Key properties:

  • Local accuracy: jϕj=f(x)E[f(x)]\sum_j \phi_j = f(x) - E[f(x)](所有 SHAP values 加起來 = model prediction - baseline)
  • Consistency: If a feature's contribution increases, its SHAP value won't decrease
  • TreeSHAP: Exact SHAP computation in polynomial time for tree models(比 brute-force 快很多)
import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Summary plot: global feature importance + direction
shap.summary_plot(shap_values, X_test)

# Force plot: explain a single prediction
shap.force_plot(explainer.expected_value, shap_values[0], X_test.iloc[0])

面試中的 SHAP

能說出 SHAP 的三個性質(local accuracy, consistency, missingness)和它跟 impurity importance 的差別,是很強的面試信號。特別是 TreeSHAP 在 tree models 上可以 exact 計算,不需要 approximate。

Real-World Use Cases

Case 1: 信用卡詐欺偵測

Why trees work well: Fraud patterns 常涉及 complex feature interactions(高金額 + 跨國 + 新卡 → 高風險)。Trees 自然捕捉 interactions,不需要手動 engineer。

import lightgbm as lgb

model = lgb.LGBMClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=6,
    is_unbalance=True,  # handles class imbalance
    early_stopping_rounds=50,
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)])

面試 follow-up:

  • 「為什麼用 GBM 而不是 logistic regression?」— Fraud patterns 是 non-linear 且涉及 interactions,logistic regression 需要手動 engineer 這些 interactions
  • 「怎麼處理 class imbalance?」— is_unbalance=True(自動調整 weight)、scale_pos_weight、或 sampling(SMOTE)
  • 「怎麼解釋模型決策?」— SHAP values 可以解釋每筆交易為什麼被標記為 fraud

Case 2: 房價預測

Why trees work well: 房價和 features 的關係是 non-linear(坪數 50→60 的影響和 150→160 不同),且有 interactions(好學區 + 大坪數 的效果不是 additive)。

面試 follow-up:

  • 「Tree model vs Linear regression?」— 如果在乎 interpretability(coefficient = 每坪多少錢),用 linear。如果在乎 predictive accuracy,用 GBM。
  • 「Feature importance 怎麼看?」— Impurity importance 可能讓 location 相關的 high-cardinality features(如 zip code)rank 太高。用 permutation importance 或 SHAP 更可靠。

Case 3: 客戶分群後的 Uplift Modeling

行銷團隊想知道:給客戶發折扣券,對哪些客戶最有效?

Uplift modeling: 不是預測「客戶會不會買」,而是預測「折扣券讓購買機率增加多少」。

# Two-model approach (T-learner)
from sklearn.ensemble import GradientBoostingClassifier

# Model 1: P(purchase | treated)
model_t = GradientBoostingClassifier().fit(X_treated, y_treated)

# Model 2: P(purchase | not treated)
model_c = GradientBoostingClassifier().fit(X_control, y_control)

# Uplift = P(purchase|treated) - P(purchase|control)
uplift = model_t.predict_proba(X_new)[:, 1] - model_c.predict_proba(X_new)[:, 1]
# Target customers with highest uplift (not highest purchase probability!)

只對 uplift 最高的客戶發券 → 最大化 ROI。單純對「最可能購買」的客戶發券是錯的 — 他們可能不發券也會買。

Hands-on: Trees & Ensembles in Python

Decision Tree vs Random Forest vs Gradient Boosting

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

X, y = make_classification(n_samples=1000, n_features=20, n_informative=10)

models = {
    "Decision Tree (unlimited)": DecisionTreeClassifier(),
    "Decision Tree (depth=5)":   DecisionTreeClassifier(max_depth=5),
    "Random Forest (100 trees)": RandomForestClassifier(n_estimators=100),
    "Gradient Boosting (100)":   GradientBoostingClassifier(
        n_estimators=100, max_depth=3, learning_rate=0.1
    ),
}
for name, model in models.items():
    cv = cross_val_score(model, X, y, cv=5, scoring="accuracy")
    model.fit(X, y)
    print(f"{name}: CV={cv.mean():.3f}, Train={model.score(X, y):.3f}")

# Unpruned tree: Train=1.000, CV~0.85 → overfitting
# Random Forest: reduces variance via bagging + feature subsampling
# Gradient Boosting: highest CV, reduces bias sequentially

Feature Importance: Impurity vs Permutation

from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
rf = RandomForestClassifier(n_estimators=100).fit(X_train, y_train)

# Impurity-based (biased toward high-cardinality)
imp_impurity = rf.feature_importances_

# Permutation (more reliable)
perm = permutation_importance(rf, X_test, y_test, n_repeats=10)
imp_perm = perm.importances_mean

# Compare: high-cardinality noise features rank high in impurity
# but low in permutation → always prefer permutation or SHAP

LightGBM with Early Stopping

import lightgbm as lgb

dtrain = lgb.Dataset(X_train, label=y_train)
dval = lgb.Dataset(X_val, label=y_val, reference=dtrain)

params = {
    "objective": "binary",
    "metric": "auc",
    "learning_rate": 0.05,
    "num_leaves": 31,
    "min_child_samples": 20,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "reg_alpha": 0.1,
    "reg_lambda": 1.0,
    "verbose": -1,
}

model = lgb.train(
    params, dtrain,
    num_boost_round=2000,
    valid_sets=[dval],
    callbacks=[lgb.early_stopping(50)],
)
# Early stopping finds the optimal number of trees automatically

Interview Signals

What interviewers listen for:

  • 你能解釋 bagging 降 variance、boosting 降 bias 的核心機制
  • 你知道 RF 和 GBM 各自的優缺點和適用場景
  • 你能說出 XGBoost / LightGBM / CatBoost 的關鍵差異
  • 你理解三種 feature importance 方法的 tradeoffs(impurity / permutation / SHAP)
  • 你知道怎麼系統性地 tune hyperparameters

Practice

Flashcards

Flashcards (1/10)

Why does Random Forest add feature subsampling on top of bagging?

Without it, all trees split on the same dominant features → trees are correlated → averaging doesn't reduce variance much. Feature subsampling decorrelates trees, making the ensemble more effective. Classification: m≈√p, Regression: m≈p/3.

Click card to flip

Quiz

Question 1/10

Noisy dataset with many mislabeled examples. Which model is more likely to overfit to the noise?

Mark as Complete

3/5 — Okay