Feature Engineering
Interview Context
Feature engineering 是 data scientist 的核心技能 — 模型的好壞往往取決於 features 而非 algorithm。面試常問:「你怎麼處理 categorical features?」「怎麼 handle missing data?」「怎麼做 feature selection?」。要能結合 domain knowledge 做出有 business sense 的 features。
What You Should Understand
- 知道不同 encoding 方法的適用場景和陷阱
- 理解 scaling 對不同 model 的影響(哪些需要、哪些不需要)
- 能辨認 missing data 的機制(MCAR/MAR/MNAR)並選擇正確的 imputation 策略
- 掌握 feature selection 的三大類方法和 tradeoffs
- 能根據 domain knowledge 設計 time-based、aggregation、interaction features
Categorical Encoding
One-Hot Encoding
Each category becomes a binary column:
# city: [NYC, SF, LA] → city_NYC, city_SF, city_LA
pd.get_dummies(df, columns=["city"], drop_first=True)
# drop_first=True: avoid dummy variable trap (perfect multicollinearity)
| Pros | Cons |
|---|---|
| No ordinal assumption | High-cardinality → dimension explosion |
| Works with all models | Sparse matrix(most values are 0) |
| Easy to interpret | 10K categories = 10K new columns |
Dummy Variable Trap
categories 只需要 dummies。如果用 個,會造成 perfect multicollinearity(任一 column 可以被其他推出)。Linear models 需要 drop_first=True,tree-based models 不受影響。
Label Encoding
Map each category to an integer: NYC→0, SF→1, LA→2.
| Pros | Cons |
|---|---|
| Memory efficient(1 column) | Introduces fake ordinal relationship(model 會以為 LA > SF > NYC) |
| Required by some models(LightGBM categorical) | Inappropriate for linear/distance-based models |
只在 tree-based models 中使用 — trees 只做 threshold splits(),不在意數字大小的 ordinal meaning。Linear models 會被假的 ordinal 關係誤導。
Target Encoding (Mean Encoding)
Replace category with the mean of the target for that category:
# city → average house price in that city
city_mean = df.groupby("city")["price"].mean()
df["city_encoded"] = df["city"].map(city_mean)
| Pros | Cons |
|---|---|
| Single column(no dimension explosion) | Target leakage — encoding 包含 target info |
| Captures target relationship | Rare categories → noisy estimates |
| Works for high-cardinality | Overfitting(training set 的 mean 不等於 test 的) |
Mitigating leakage:
- Leave-one-out: Calculate mean excluding the current row
- K-Fold: Calculate mean using only out-of-fold data(和 CV 一樣的概念)
- Smoothing: Blend category mean with global mean(Bayesian shrinkage)
是 smoothing parameter。Category 的 sample size 小時 → 更靠近 global mean(避免 noisy estimate)。CatBoost 的 ordered target encoding 就是這個思路的延伸。
Hash Encoding (Feature Hashing)
Map categories to a fixed-size vector via hash function:
from sklearn.feature_extraction import FeatureHasher
hasher = FeatureHasher(n_features=100, input_type="string")
X_hashed = hasher.transform(df["city"])
# 10K unique cities → fixed 100 columns (with hash collisions)
適合 ultra-high cardinality(百萬 level 的 user/item IDs)。缺點:hash collision 會混淆不同 categories,不可逆(無法從 hash 推回原始 category)。
Encoding Selection Guide
| Method | Cardinality | Model Type | Leakage Risk |
|---|---|---|---|
| One-Hot | Low (< 20) | All models | None |
| Label | Any | Tree-based only | None |
| Target | Medium-High | All (with care) | High — must use K-fold or LOO |
| Hash | Very high (> 10K) | All models | None |
| Ordinal | Naturally ordered | All models | None(if truly ordinal) |
Feature Scaling
Methods
| Method | Formula | When to Use |
|---|---|---|
| StandardScaler | → mean=0, std=1 | SVM, KNN, PCA, Neural Networks, Logistic Regression |
| MinMaxScaler | → [0, 1] | Neural Networks(when need bounded input), image pixels |
| RobustScaler | 有 outliers 時(median/IQR 不受 extreme values 影響) | |
| Log transform | Right-skewed data(income, prices, counts) | |
| No scaling | — | Tree-based models(RF, GBM, XGBoost) |
Which Models Need Scaling?
| Needs Scaling | Doesn't Need Scaling |
|---|---|
| SVM(distance-based margin) | Decision Trees |
| KNN(distance-based) | Random Forest |
| PCA(variance-based) | Gradient Boosting(XGBoost, LightGBM) |
| Neural Networks(gradient-based) | Naive Bayes |
| Linear/Logistic Regression(gradient + regularization) | Rule-based models |
面試經典問題
「為什麼 tree-based models 不需要 scaling?」— Trees 用 threshold splits(),只看 feature 值的 ordering,不看 magnitude。Scaling 不改變 ordering → 不影響 splits。Distance-based 和 gradient-based methods 受 scale 影響,因為 magnitude 直接進入 distance 計算或 gradient 更新。
Missing Data
Missing Mechanisms
| Mechanism | Definition | Example | Handling |
|---|---|---|---|
| MCAR | Missing completely at random — unrelated to any variable | Sensor randomly fails | Any imputation OK,listwise deletion OK |
| MAR | Missing depends on observed variables | 高學歷者更常填收入 | Model-based imputation using other features |
| MNAR | Missing depends on the missing value itself | 高收入者不填收入 | 最難處理 — 需要 domain knowledge 或 sensitivity analysis |
Imputation Strategies
| Strategy | How | Pros | Cons |
|---|---|---|---|
| Drop rows | Remove rows with missing | Simple | Loses data, biased if not MCAR |
| Drop columns | Remove feature if > X% missing | Simple | Loses potentially useful feature |
| Mean/Median | Replace with column mean or median | Simple, fast | Ignores relationships, reduces variance |
| Mode | For categorical features | Simple | Ignores relationships |
| KNN Imputer | Use K nearest neighbors' values | Captures local patterns | Slow, sensitive to K and scale |
| Iterative (MICE) | Multiple imputation by chained equations | Most principled, captures uncertainty | Complex, slow |
| Model-based | Train a model to predict missing values | Captures complex patterns | Risk of leakage if not done carefully |
| Indicator variable | Add binary flag is_missing + simple impute | Preserves missingness signal | Doubles feature count |
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Simple: mean for numeric, most_frequent for categorical
num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="most_frequent")
# KNN: uses nearest neighbors
knn_imputer = KNNImputer(n_neighbors=5)
# MICE: iterative model-based
mice_imputer = IterativeImputer(max_iter=10, random_state=42)
Imputation Must Be Inside CV
Imputation statistics(mean, median)必須只用 training fold 計算。如果用整個 dataset 的 mean → test fold 的資訊 leak 到 training → data leakage。用 sklearn.pipeline.Pipeline 確保正確。
Missing as a Feature
有時候 missingness 本身就是有用的 signal:
- 「裝修年份」missing → 沒裝修過 → 房子可能比較舊
- 「收入」missing → 可能是高收入不想透露
- XGBoost/LightGBM 原生支援 missing values — 自動學習 default split direction
import numpy as np
# Add missing indicator before imputing
df["income_missing"] = df["income"].isna().astype(int)
df["income"] = df["income"].fillna(df["income"].median())
# Model can learn from both the imputed value AND the missingness pattern
Feature Selection
Three Paradigms
| Paradigm | How It Works | Examples | Speed | Quality |
|---|---|---|---|---|
| Filter | Score each feature independently by statistical test | Correlation, mutual information, chi-squared, ANOVA F | Fast | May miss interactions |
| Wrapper | Train model with different feature subsets, pick best | Forward/backward selection, recursive feature elimination | Slow | Captures interactions |
| Embedded | Model training automatically selects features | Lasso (L1), tree-based importance, ElasticNet | Medium | Built into training |
Filter Methods
from sklearn.feature_selection import mutual_info_classif, f_classif
# Mutual Information (non-linear relationships)
mi_scores = mutual_info_classif(X, y)
# ANOVA F-test (linear relationships, for classification)
f_scores, p_values = f_classif(X, y)
# Correlation with target (for regression)
correlations = X.corrwith(y).abs().sort_values(ascending=False)
Filter methods 的限制:每次只看一個 feature 和 target 的關係,忽略 feature interactions。Feature A 和 B 各自和 target 無關,但 A×B 可能非常有用(例如 XOR pattern)。
Wrapper Methods
Recursive Feature Elimination (RFE):
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
rfe = RFE(
estimator=RandomForestClassifier(n_estimators=100),
n_features_to_select=20,
step=5, # remove 5 features per iteration
)
rfe.fit(X, y)
selected = X.columns[rfe.support_]
每次移除最不重要的 features → 重新訓練 → 再移除 → 直到剩下目標數量。缺點:非常慢(每 step 都要重新訓練)。
Embedded Methods
from sklearn.linear_model import LassoCV
# Lasso automatically zeros out unimportant features
lasso = LassoCV(cv=5).fit(X, y)
selected = X.columns[lasso.coef_ != 0]
print(f"Selected {len(selected)} / {X.shape[1]} features")
# Tree-based importance
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100).fit(X, y)
important = X.columns[rf.feature_importances_ > 0.01]
面試中的 Feature Selection 建議
好的回答框架:「我會先用 correlation matrix 和 mutual information 做 filter(快速移除明顯無用的),然後用 Lasso 或 tree importance 做 embedded selection。如果 performance 沒有明顯提升,可能 features 已經夠好 — 不需要過度 engineer。」
Feature Transformation
Numeric Transformations
| Transform | When to Use | Effect |
|---|---|---|
| Log () | Right-skewed positive data | Reduces skewness, stabilizes variance |
| Square root () | Count data, moderate skew | Milder than log |
| Box-Cox | General power transform(auto-selects best λ) | Makes data more Gaussian |
| Yeo-Johnson | Like Box-Cox but supports negatives | More general |
| Binning | Continuous → categorical | Captures non-linear effects, loses info |
| Polynomial | Add | Captures non-linear relationships |
from sklearn.preprocessing import PowerTransformer
# Yeo-Johnson: auto power transform (works with negatives)
pt = PowerTransformer(method="yeo-johnson")
X_transformed = pt.fit_transform(X)
# Makes features more Gaussian-like
Feature Interaction
Explicitly create interaction features when domain knowledge suggests it:
# Domain knowledge: area × floor_count = total living space
df["total_space"] = df["area"] * df["floor_count"]
# Polynomial interactions (auto-generate)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_interactions = poly.fit_transform(X)
# Warning: p features → p + p*(p-1)/2 columns → can explode
Tree-based models 自動學 interactions(nested splits = implicit interaction),但 linear models 需要 explicitly add them。
Domain-Specific Features
Time-Based Features
# From a datetime column
df["hour"] = df["timestamp"].dt.hour
df["day_of_week"] = df["timestamp"].dt.dayofweek
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
df["month"] = df["timestamp"].dt.month
df["is_holiday"] = df["date"].isin(holiday_dates).astype(int)
# Cyclical encoding for periodic features
df["hour_sin"] = np.sin(2 * np.pi * df["hour"] / 24)
df["hour_cos"] = np.cos(2 * np.pi * df["hour"] / 24)
# hour 23 and hour 0 are "close" — sin/cos captures this, label encoding doesn't
Cyclical Encoding
Hour 23 和 hour 0 只差 1 小時,但 label encoding 會把它們當成差 23。Sin/cos encoding 讓週期性 feature 的首尾相連。同理適用於 day_of_week、month 等。
Aggregation Features
# User-level aggregations from transaction data
user_agg = transactions.groupby("user_id").agg(
total_spend=("amount", "sum"),
avg_spend=("amount", "mean"),
max_spend=("amount", "max"),
transaction_count=("amount", "count"),
unique_merchants=("merchant_id", "nunique"),
days_since_last=("date", lambda x: (today - x.max()).days),
)
Lag Features (Time Series)
# Previous values as features
df["sales_lag_1"] = df.groupby("store_id")["sales"].shift(1)
df["sales_lag_7"] = df.groupby("store_id")["sales"].shift(7)
df["sales_rolling_7d_mean"] = (
df.groupby("store_id")["sales"]
.transform(lambda x: x.shift(1).rolling(7).mean())
)
# shift(1) → avoid leaking current value
Lag Feature Leakage
Lag features 如果不做 shift() 就 rolling → 用到了當前和未來的值 → temporal leakage。永遠用 shift(1) 確保只用過去的資料。Cross-validation 也要用 TimeSeriesSplit(不能 random split)。
Real-World Use Cases
Case 1: 信用卡詐欺偵測
| Feature Type | Examples |
|---|---|
| Transaction features | Amount, merchant category, is_international |
| Aggregation | avg_spend_last_7d, n_transactions_last_1h, max_amount_last_30d |
| Velocity | transactions_per_hour(突然大量交易 → suspicious) |
| Distance | distance_from_home, distance_from_last_transaction |
| Time | hour_of_day, is_weekend, days_since_card_issued |
| Ratio | amount / avg_spend_last_30d(異常大額交易 ratio 高) |
面試 follow-up:「哪些 features 最重要?」— 通常 velocity features(每小時交易次數)和 ratio features(金額 / 歷史平均)最有 signal。Domain knowledge(和 fraud analyst 合作)比 blind feature generation 有效得多。
Case 2: 房價預測
| Feature Type | Examples |
|---|---|
| Numeric | Area, rooms, bathrooms, floor, age |
| Categorical | Neighborhood(target encode or cluster), building_type |
| Derived | price_per_sqft(if you have similar properties),age = current_year - built_year |
| Geographic | distance_to_subway, distance_to_school, walk_score |
| Interaction | area × rooms(total living space),neighborhood × building_type |
| Transform | log(price) as target(right-skewed),log(area) |
Case 3: 客戶分群 / Churn Prediction
| Feature Type | Examples |
|---|---|
| RFM | Recency, frequency, monetary |
| Engagement | sessions_last_30d, avg_session_duration, feature_X_usage_count |
| Behavioral | support_tickets_count, payment_failures, plan_downgrades |
| Lifecycle | account_age_days, days_since_last_login, trial_vs_paid |
| Trend | activity_this_month / activity_last_month(declining ratio → churn signal) |
Feature Engineering Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
numeric_features = ["age", "income", "transaction_count"]
categorical_features = ["city", "product_type"]
preprocessor = ColumnTransformer([
("num", Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
]), numeric_features),
("cat", Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", OneHotEncoder(handle_unknown="ignore")),
]), categorical_features),
])
model = Pipeline([
("preprocessor", preprocessor),
("classifier", GradientBoostingClassifier()),
])
# Entire pipeline: impute → scale/encode → train
# CV-safe: preprocessing happens inside each fold
Interview Signals
What interviewers listen for:
- 你能根據 model 類型選擇正確的 encoding 和 scaling
- 你知道 target encoding 的 leakage risk 和 mitigation
- 你能辨認 missing data mechanism 並選擇正確的 imputation
- 你會結合 domain knowledge 做 meaningful features(不只是 blind polynomial)
- 你知道 feature engineering pipeline 要在 CV 內部做(avoid leakage)
Practice
Flashcards
Flashcards (1/10)
One-hot encoding 的 dummy variable trap 是什麼?
K categories 用 K 個 dummies 會造成 perfect multicollinearity(任一 column = 1 - sum of others)。Linear models 需要 drop_first=True(只用 K-1 個)。Tree-based models 不受影響。
Quiz
你有一個 'city' feature 有 5000 個 unique values。用 one-hot encoding 最大的問題是?