Trees & Ensembles
Why This Matters
Gradient-boosted trees (XGBoost, LightGBM) 是 tabular data 的王者 — 在業界和 Kaggle 中佔主導地位。面試官期待你能解釋它們如何運作、什麼時候會失敗、和 random forest 有什麼不同。
Decision Trees
A decision tree recursively partitions the feature space by choosing the split that maximizes some purity criterion.
Splitting Criteria
Gini Impurity (used in CART):
where is the proportion of class in the node. means pure node (only one class).
Entropy / Information Gain (used in ID3, C4.5):
Variance reduction (regression trees): split that minimizes the weighted sum of variances in child nodes.
Gini vs Entropy
兩者在實務中產出幾乎相同的 tree。Gini 計算稍快(沒有 logarithm)。面試官更在意的是你理解 purity-based splitting 的概念,而非你選哪一個。
Tree Construction
At each node, the algorithm:
- Considers all features and all possible split points
- Selects the split maximizing the chosen criterion
- Creates two child nodes
- Repeats until a stopping condition is met
Advantages of Decision Trees
在談 ensemble 之前,先理解 single tree 的優缺點:
| Advantage | Detail |
|---|---|
| Interpretable | 可以直接看到決策規則,容易向 stakeholder 解釋 |
| No feature scaling | 不需要 normalization/standardization(splits 是 threshold-based) |
| Handles mixed types | 自然處理 numerical + categorical features |
| Handles missing values | XGBoost/LightGBM 原生支援(學習 default direction) |
| Captures interactions | 自動學習 feature interactions(nested splits) |
| Non-parametric | 不需要假設 linear relationship |
| Disadvantage | Detail |
|---|---|
| High variance | Unpruned tree 極度 overfitting — 換一點資料就產生完全不同的 tree |
| Greedy algorithm | 每個 split 只看 local optimum,不保證 global optimum |
| Biased toward high-cardinality | Features with many unique values 有更多 split 機會 |
| Axis-aligned splits | 只能做 horizontal/vertical splits,不擅長 diagonal boundaries |
Controlling Overfitting
An unpruned tree will perfectly memorize training data. Control via:
| Hyperparameter | What It Does | Typical Range |
|---|---|---|
max_depth | Limit tree height | 3-10 (boosting), None (RF) |
min_samples_leaf | Minimum samples in terminal node | 1-20 |
min_samples_split | Minimum samples to consider a split | 2-50 |
max_features | Features considered per split | (clf), (reg) |
ccp_alpha | Cost-complexity pruning parameter | Cross-validate |
Ensemble Methods
The key insight: combining many weak learners produces a strong learner. 兩大策略是 bagging 和 boosting。
Bagging (Bootstrap Aggregating)
- Create bootstrap samples (sample with replacement)
- Train one model on each bootstrap sample
- Aggregate: average (regression) or majority vote (classification)
Bagging reduces variance without increasing bias. 最適合 high-variance, low-bias base learners(like deep trees)。
Random Forest
Random Forest = Bagging + random feature subsampling at each split.
At each split, only a random subset of features is considered:
- Classification:
- Regression:
Why feature subsampling? 如果所有 trees 都用同一個 dominant feature 做第一個 split,trees 就高度 correlated。Averaging correlated predictions 降低的 variance 很有限。Feature subsampling 讓 trees decorrelate,ensemble 效果更好。
Out-of-Bag (OOB) Error
Each bootstrap sample leaves out ~37% of observations ().
這些 out-of-bag samples 提供免費的 validation:
- 每棵 tree 用它沒看過的 OOB samples 做預測
- 彙整所有 trees 的 OOB predictions → OOB error ≈ cross-validation error
- 不需要另外留 holdout set
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=500, oob_score=True)
rf.fit(X_train, y_train)
print(f"OOB accuracy: {rf.oob_score_:.4f}") # free validation!
Boosting
Boosting builds models sequentially, each new model focuses on the errors of the ensemble so far.
Gradient Boosting fits each new tree to the negative gradient (pseudo-residuals) of the loss:
For squared error loss: (就是 residuals)。
For log loss (classification): pseudo-residuals 是不同的形式,但核心思想一樣 — 每棵新 tree 在修正之前 trees 犯的錯。
Boosting reduces bias — each tree corrects remaining errors. 風險是 overfitting(尤其對 noisy labels),需要控制 learning rate 和 tree 數量。
Bagging vs Boosting
| Aspect | Bagging (Random Forest) | Boosting (XGBoost/LightGBM) |
|---|---|---|
| Tree construction | Independent, parallel | Sequential, each corrects errors |
| What it reduces | Variance | Bias (primarily) |
| Base learner | Deep trees (high variance) | Shallow trees, depth 3-6 (high bias) |
| Overfitting risk | Low | Higher — needs careful tuning |
| Noise sensitivity | Robust | Can memorize noisy labels |
| Training speed | Parallelizable | Sequential (but optimized) |
| When to use first | 想要 robust baseline, minimal tuning | 想要 best possible performance, willing to tune |
面試經典問題
「什麼時候用 Random Forest vs Gradient Boosting?」— RF 更 robust、需要的 tuning 少、不容易 overfit。GBM 上限更高但需要更多 tuning。資料 noisy 或時間有限 → RF。追求 best performance 且可以 tune → GBM。
XGBoost and LightGBM
XGBoost Innovations
XGBoost adds several key improvements to vanilla gradient boosting:
Regularized objective:
where is the number of leaves and are leaf weights. penalizes tree complexity, shrinks leaf weights.
| Innovation | What It Does |
|---|---|
| Second-order approximation | Uses gradient + Hessian for better split finding |
| Regularized objective | L1/L2 on leaf weights prevents overfitting |
| Sparsity-aware splitting | Handles missing values by learning default direction |
| Column subsampling | Like RF, samples features per tree or per split |
| Weighted quantile sketch | Approximate split finding for large datasets |
LightGBM Innovations
LightGBM optimizes for speed and memory on large datasets:
| Innovation | What It Does |
|---|---|
| GOSS (Gradient-based One-Side Sampling) | Keeps large-gradient instances, samples small-gradient ones |
| EFB (Exclusive Feature Bundling) | Bundles mutually exclusive sparse features |
| Leaf-wise growth | Splits the leaf with highest loss reduction (vs level-wise) |
| Histogram-based splitting | Bins continuous features → O(data × bins) instead of O(data × features) |
Level-wise vs Leaf-wise
| Level-wise (XGBoost default) | Leaf-wise (LightGBM default) | |
|---|---|---|
| Strategy | Grow all nodes at same depth | Grow the node with highest gain |
| Tree shape | Balanced | Potentially deep, asymmetric |
| Efficiency | More splits needed | Fewer splits for same loss reduction |
| Overfitting | More controlled | 小資料集上更容易 overfit |
| Key param | max_depth | num_leaves |
LightGBM on Small Datasets
Leaf-wise growth 在小資料集(< 10K)上容易 overfit — 產出非常深且不平衡的 tree。務必限制 num_leaves(通常 < 64)和 min_child_samples(通常 > 20)。
CatBoost
CatBoost 處理 categorical features 特別好:
- Ordered target encoding: 用 ordered statistics 避免 target leakage
- Symmetric trees: 所有 nodes 在同一 depth 用相同 split → 更 regularized
- No need for one-hot encoding: 原生支援 categorical features
當資料有大量 categorical features 時(例如推薦系統的 user/item ID),CatBoost 通常比 XGBoost/LightGBM 更方便。
Hyperparameter Tuning Guide
| Parameter | XGBoost | LightGBM | Effect |
|---|---|---|---|
| Learning rate | eta (0.01-0.3) | learning_rate | 越小越穩定但需要更多 trees |
| Tree depth | max_depth (3-10) | max_depth (-1=unlimited) | 控制 interaction depth |
| Leaf count | max_leaves | num_leaves (31) | LightGBM 的主要 complexity 控制 |
| Min samples | min_child_weight | min_child_samples (20) | 防止 leaf 太小 |
| Row sampling | subsample (0.5-1.0) | bagging_fraction | 像 bagging 一樣降低 variance |
| Col sampling | colsample_bytree (0.5-1.0) | feature_fraction | 像 RF 的 feature subsampling |
| L1 reg | alpha | reg_alpha | Feature selection effect |
| L2 reg | lambda | reg_lambda | Shrinks leaf weights |
Tuning Strategy
面試中被問 tuning 策略時的建議回答:「先固定 learning_rate=0.1 和足夠多的 n_estimators(用 early stopping),然後 tune tree structure(max_depth / num_leaves),再 tune regularization(subsample, colsample_bytree, reg_alpha/lambda),最後降低 learning_rate 並增加 n_estimators。」
Feature Importance
Three Methods Compared
| Method | How It Works | Pros | Cons |
|---|---|---|---|
| Impurity-based (MDI) | Sum of impurity reduction across all splits | Fast, built-in | Biased toward high-cardinality features |
| Permutation | Shuffle feature, measure performance drop | Model-agnostic, reliable | Slow, affected by correlated features |
| SHAP | Shapley values from game theory | Theoretically grounded, per-prediction | Computationally expensive |
SHAP Values
SHAP (SHapley Additive exPlanations) provides a theoretically grounded attribution based on cooperative game theory:
Key properties:
- Local accuracy: (所有 SHAP values 加起來 = model prediction - baseline)
- Consistency: If a feature's contribution increases, its SHAP value won't decrease
- TreeSHAP: Exact SHAP computation in polynomial time for tree models(比 brute-force 快很多)
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
# Summary plot: global feature importance + direction
shap.summary_plot(shap_values, X_test)
# Force plot: explain a single prediction
shap.force_plot(explainer.expected_value, shap_values[0], X_test.iloc[0])
面試中的 SHAP
能說出 SHAP 的三個性質(local accuracy, consistency, missingness)和它跟 impurity importance 的差別,是很強的面試信號。特別是 TreeSHAP 在 tree models 上可以 exact 計算,不需要 approximate。
Real-World Use Cases
Case 1: 信用卡詐欺偵測
Why trees work well: Fraud patterns 常涉及 complex feature interactions(高金額 + 跨國 + 新卡 → 高風險)。Trees 自然捕捉 interactions,不需要手動 engineer。
import lightgbm as lgb
model = lgb.LGBMClassifier(
n_estimators=1000,
learning_rate=0.05,
max_depth=6,
is_unbalance=True, # handles class imbalance
early_stopping_rounds=50,
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)])
面試 follow-up:
- 「為什麼用 GBM 而不是 logistic regression?」— Fraud patterns 是 non-linear 且涉及 interactions,logistic regression 需要手動 engineer 這些 interactions
- 「怎麼處理 class imbalance?」—
is_unbalance=True(自動調整 weight)、scale_pos_weight、或 sampling(SMOTE) - 「怎麼解釋模型決策?」— SHAP values 可以解釋每筆交易為什麼被標記為 fraud
Case 2: 房價預測
Why trees work well: 房價和 features 的關係是 non-linear(坪數 50→60 的影響和 150→160 不同),且有 interactions(好學區 + 大坪數 的效果不是 additive)。
面試 follow-up:
- 「Tree model vs Linear regression?」— 如果在乎 interpretability(coefficient = 每坪多少錢),用 linear。如果在乎 predictive accuracy,用 GBM。
- 「Feature importance 怎麼看?」— Impurity importance 可能讓 location 相關的 high-cardinality features(如 zip code)rank 太高。用 permutation importance 或 SHAP 更可靠。
Case 3: 客戶分群後的 Uplift Modeling
行銷團隊想知道:給客戶發折扣券,對哪些客戶最有效?
Uplift modeling: 不是預測「客戶會不會買」,而是預測「折扣券讓購買機率增加多少」。
# Two-model approach (T-learner)
from sklearn.ensemble import GradientBoostingClassifier
# Model 1: P(purchase | treated)
model_t = GradientBoostingClassifier().fit(X_treated, y_treated)
# Model 2: P(purchase | not treated)
model_c = GradientBoostingClassifier().fit(X_control, y_control)
# Uplift = P(purchase|treated) - P(purchase|control)
uplift = model_t.predict_proba(X_new)[:, 1] - model_c.predict_proba(X_new)[:, 1]
# Target customers with highest uplift (not highest purchase probability!)
只對 uplift 最高的客戶發券 → 最大化 ROI。單純對「最可能購買」的客戶發券是錯的 — 他們可能不發券也會買。
Hands-on: Trees & Ensembles in Python
Decision Tree vs Random Forest vs Gradient Boosting
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10)
models = {
"Decision Tree (unlimited)": DecisionTreeClassifier(),
"Decision Tree (depth=5)": DecisionTreeClassifier(max_depth=5),
"Random Forest (100 trees)": RandomForestClassifier(n_estimators=100),
"Gradient Boosting (100)": GradientBoostingClassifier(
n_estimators=100, max_depth=3, learning_rate=0.1
),
}
for name, model in models.items():
cv = cross_val_score(model, X, y, cv=5, scoring="accuracy")
model.fit(X, y)
print(f"{name}: CV={cv.mean():.3f}, Train={model.score(X, y):.3f}")
# Unpruned tree: Train=1.000, CV~0.85 → overfitting
# Random Forest: reduces variance via bagging + feature subsampling
# Gradient Boosting: highest CV, reduces bias sequentially
Feature Importance: Impurity vs Permutation
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
rf = RandomForestClassifier(n_estimators=100).fit(X_train, y_train)
# Impurity-based (biased toward high-cardinality)
imp_impurity = rf.feature_importances_
# Permutation (more reliable)
perm = permutation_importance(rf, X_test, y_test, n_repeats=10)
imp_perm = perm.importances_mean
# Compare: high-cardinality noise features rank high in impurity
# but low in permutation → always prefer permutation or SHAP
LightGBM with Early Stopping
import lightgbm as lgb
dtrain = lgb.Dataset(X_train, label=y_train)
dval = lgb.Dataset(X_val, label=y_val, reference=dtrain)
params = {
"objective": "binary",
"metric": "auc",
"learning_rate": 0.05,
"num_leaves": 31,
"min_child_samples": 20,
"subsample": 0.8,
"colsample_bytree": 0.8,
"reg_alpha": 0.1,
"reg_lambda": 1.0,
"verbose": -1,
}
model = lgb.train(
params, dtrain,
num_boost_round=2000,
valid_sets=[dval],
callbacks=[lgb.early_stopping(50)],
)
# Early stopping finds the optimal number of trees automatically
Interview Signals
What interviewers listen for:
- 你能解釋 bagging 降 variance、boosting 降 bias 的核心機制
- 你知道 RF 和 GBM 各自的優缺點和適用場景
- 你能說出 XGBoost / LightGBM / CatBoost 的關鍵差異
- 你理解三種 feature importance 方法的 tradeoffs(impurity / permutation / SHAP)
- 你知道怎麼系統性地 tune hyperparameters
Practice
Flashcards
Flashcards (1/10)
Why does Random Forest add feature subsampling on top of bagging?
Without it, all trees split on the same dominant features → trees are correlated → averaging doesn't reduce variance much. Feature subsampling decorrelates trees, making the ensemble more effective. Classification: m≈√p, Regression: m≈p/3.
Quiz
Noisy dataset with many mislabeled examples. Which model is more likely to overfit to the noise?