Feature Engineering

Interview Context

Feature engineering 是 data scientist 的核心技能 — 模型的好壞往往取決於 features 而非 algorithm。面試常問:「你怎麼處理 categorical features?」「怎麼 handle missing data?」「怎麼做 feature selection?」。要能結合 domain knowledge 做出有 business sense 的 features。

What You Should Understand

  • 知道不同 encoding 方法的適用場景和陷阱
  • 理解 scaling 對不同 model 的影響(哪些需要、哪些不需要)
  • 能辨認 missing data 的機制(MCAR/MAR/MNAR)並選擇正確的 imputation 策略
  • 掌握 feature selection 的三大類方法和 tradeoffs
  • 能根據 domain knowledge 設計 time-based、aggregation、interaction features

Categorical Encoding

One-Hot Encoding

Each category becomes a binary column:

# city: [NYC, SF, LA] → city_NYC, city_SF, city_LA
pd.get_dummies(df, columns=["city"], drop_first=True)
# drop_first=True: avoid dummy variable trap (perfect multicollinearity)
ProsCons
No ordinal assumptionHigh-cardinality → dimension explosion
Works with all modelsSparse matrix(most values are 0)
Easy to interpret10K categories = 10K new columns

Dummy Variable Trap

KK categories 只需要 K1K-1 dummies。如果用 KK 個,會造成 perfect multicollinearity(任一 column 可以被其他推出)。Linear models 需要 drop_first=True,tree-based models 不受影響。

Label Encoding

Map each category to an integer: NYC→0, SF→1, LA→2.

ProsCons
Memory efficient(1 column)Introduces fake ordinal relationship(model 會以為 LA > SF > NYC)
Required by some models(LightGBM categorical)Inappropriate for linear/distance-based models

只在 tree-based models 中使用 — trees 只做 threshold splits(x1x \leq 1),不在意數字大小的 ordinal meaning。Linear models 會被假的 ordinal 關係誤導。

Target Encoding (Mean Encoding)

Replace category with the mean of the target for that category:

encode(city)=E[ycity]\text{encode}(\text{city}) = E[y \mid \text{city}]
# city → average house price in that city
city_mean = df.groupby("city")["price"].mean()
df["city_encoded"] = df["city"].map(city_mean)
ProsCons
Single column(no dimension explosion)Target leakage — encoding 包含 target info
Captures target relationshipRare categories → noisy estimates
Works for high-cardinalityOverfitting(training set 的 mean 不等於 test 的)

Mitigating leakage:

  • Leave-one-out: Calculate mean excluding the current row
  • K-Fold: Calculate mean using only out-of-fold data(和 CV 一樣的概念)
  • Smoothing: Blend category mean with global mean(Bayesian shrinkage)
encoded=ncyˉc+myˉglobalnc+m\text{encoded} = \frac{n_c \cdot \bar{y}_c + m \cdot \bar{y}_{\text{global}}}{n_c + m}

mm 是 smoothing parameter。Category 的 sample size ncn_c 小時 → 更靠近 global mean(避免 noisy estimate)。CatBoost 的 ordered target encoding 就是這個思路的延伸。

Hash Encoding (Feature Hashing)

Map categories to a fixed-size vector via hash function:

from sklearn.feature_extraction import FeatureHasher

hasher = FeatureHasher(n_features=100, input_type="string")
X_hashed = hasher.transform(df["city"])
# 10K unique cities → fixed 100 columns (with hash collisions)

適合 ultra-high cardinality(百萬 level 的 user/item IDs)。缺點:hash collision 會混淆不同 categories,不可逆(無法從 hash 推回原始 category)。

Encoding Selection Guide

MethodCardinalityModel TypeLeakage Risk
One-HotLow (< 20)All modelsNone
LabelAnyTree-based onlyNone
TargetMedium-HighAll (with care)High — must use K-fold or LOO
HashVery high (> 10K)All modelsNone
OrdinalNaturally orderedAll modelsNone(if truly ordinal)

Feature Scaling

Methods

MethodFormulaWhen to Use
StandardScalerxμσ\frac{x - \mu}{\sigma} → mean=0, std=1SVM, KNN, PCA, Neural Networks, Logistic Regression
MinMaxScalerxxminxmaxxmin\frac{x - x_{\min}}{x_{\max} - x_{\min}} → [0, 1]Neural Networks(when need bounded input), image pixels
RobustScalerxmedianIQR\frac{x - \text{median}}{\text{IQR}}有 outliers 時(median/IQR 不受 extreme values 影響)
Log transformlog(1+x)\log(1 + x)Right-skewed data(income, prices, counts)
No scalingTree-based models(RF, GBM, XGBoost)

Which Models Need Scaling?

Needs ScalingDoesn't Need Scaling
SVM(distance-based margin)Decision Trees
KNN(distance-based)Random Forest
PCA(variance-based)Gradient Boosting(XGBoost, LightGBM)
Neural Networks(gradient-based)Naive Bayes
Linear/Logistic Regression(gradient + regularization)Rule-based models

面試經典問題

「為什麼 tree-based models 不需要 scaling?」— Trees 用 threshold splits(x5.3x \leq 5.3),只看 feature 值的 ordering,不看 magnitude。Scaling 不改變 ordering → 不影響 splits。Distance-based 和 gradient-based methods 受 scale 影響,因為 magnitude 直接進入 distance 計算或 gradient 更新。

Missing Data

Missing Mechanisms

MechanismDefinitionExampleHandling
MCARMissing completely at random — unrelated to any variableSensor randomly failsAny imputation OK,listwise deletion OK
MARMissing depends on observed variables高學歷者更常填收入Model-based imputation using other features
MNARMissing depends on the missing value itself高收入者不填收入最難處理 — 需要 domain knowledge 或 sensitivity analysis

Imputation Strategies

StrategyHowProsCons
Drop rowsRemove rows with missingSimpleLoses data, biased if not MCAR
Drop columnsRemove feature if > X% missingSimpleLoses potentially useful feature
Mean/MedianReplace with column mean or medianSimple, fastIgnores relationships, reduces variance
ModeFor categorical featuresSimpleIgnores relationships
KNN ImputerUse K nearest neighbors' valuesCaptures local patternsSlow, sensitive to K and scale
Iterative (MICE)Multiple imputation by chained equationsMost principled, captures uncertaintyComplex, slow
Model-basedTrain a model to predict missing valuesCaptures complex patternsRisk of leakage if not done carefully
Indicator variableAdd binary flag is_missing + simple imputePreserves missingness signalDoubles feature count
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Simple: mean for numeric, most_frequent for categorical
num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="most_frequent")

# KNN: uses nearest neighbors
knn_imputer = KNNImputer(n_neighbors=5)

# MICE: iterative model-based
mice_imputer = IterativeImputer(max_iter=10, random_state=42)

Imputation Must Be Inside CV

Imputation statistics(mean, median)必須只用 training fold 計算。如果用整個 dataset 的 mean → test fold 的資訊 leak 到 training → data leakage。用 sklearn.pipeline.Pipeline 確保正確。

Missing as a Feature

有時候 missingness 本身就是有用的 signal:

  • 「裝修年份」missing → 沒裝修過 → 房子可能比較舊
  • 「收入」missing → 可能是高收入不想透露
  • XGBoost/LightGBM 原生支援 missing values — 自動學習 default split direction
import numpy as np

# Add missing indicator before imputing
df["income_missing"] = df["income"].isna().astype(int)
df["income"] = df["income"].fillna(df["income"].median())
# Model can learn from both the imputed value AND the missingness pattern

Feature Selection

Three Paradigms

ParadigmHow It WorksExamplesSpeedQuality
FilterScore each feature independently by statistical testCorrelation, mutual information, chi-squared, ANOVA FFastMay miss interactions
WrapperTrain model with different feature subsets, pick bestForward/backward selection, recursive feature eliminationSlowCaptures interactions
EmbeddedModel training automatically selects featuresLasso (L1), tree-based importance, ElasticNetMediumBuilt into training

Filter Methods

from sklearn.feature_selection import mutual_info_classif, f_classif

# Mutual Information (non-linear relationships)
mi_scores = mutual_info_classif(X, y)

# ANOVA F-test (linear relationships, for classification)
f_scores, p_values = f_classif(X, y)

# Correlation with target (for regression)
correlations = X.corrwith(y).abs().sort_values(ascending=False)

Filter methods 的限制:每次只看一個 feature 和 target 的關係,忽略 feature interactions。Feature A 和 B 各自和 target 無關,但 A×B 可能非常有用(例如 XOR pattern)。

Wrapper Methods

Recursive Feature Elimination (RFE):

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

rfe = RFE(
    estimator=RandomForestClassifier(n_estimators=100),
    n_features_to_select=20,
    step=5,  # remove 5 features per iteration
)
rfe.fit(X, y)
selected = X.columns[rfe.support_]

每次移除最不重要的 features → 重新訓練 → 再移除 → 直到剩下目標數量。缺點:非常慢(每 step 都要重新訓練)。

Embedded Methods

from sklearn.linear_model import LassoCV

# Lasso automatically zeros out unimportant features
lasso = LassoCV(cv=5).fit(X, y)
selected = X.columns[lasso.coef_ != 0]
print(f"Selected {len(selected)} / {X.shape[1]} features")

# Tree-based importance
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100).fit(X, y)
important = X.columns[rf.feature_importances_ > 0.01]

面試中的 Feature Selection 建議

好的回答框架:「我會先用 correlation matrix 和 mutual information 做 filter(快速移除明顯無用的),然後用 Lasso 或 tree importance 做 embedded selection。如果 performance 沒有明顯提升,可能 features 已經夠好 — 不需要過度 engineer。」

Feature Transformation

Numeric Transformations

TransformWhen to UseEffect
Log (log(1+x)\log(1+x))Right-skewed positive dataReduces skewness, stabilizes variance
Square root (x\sqrt{x})Count data, moderate skewMilder than log
Box-CoxGeneral power transform(auto-selects best λ)Makes data more Gaussian
Yeo-JohnsonLike Box-Cox but supports negativesMore general
BinningContinuous → categoricalCaptures non-linear effects, loses info
PolynomialAdd x2,x3,x1x2x^2, x^3, x_1 \cdot x_2Captures non-linear relationships
from sklearn.preprocessing import PowerTransformer

# Yeo-Johnson: auto power transform (works with negatives)
pt = PowerTransformer(method="yeo-johnson")
X_transformed = pt.fit_transform(X)
# Makes features more Gaussian-like

Feature Interaction

Explicitly create interaction features when domain knowledge suggests it:

# Domain knowledge: area × floor_count = total living space
df["total_space"] = df["area"] * df["floor_count"]

# Polynomial interactions (auto-generate)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_interactions = poly.fit_transform(X)
# Warning: p features → p + p*(p-1)/2 columns → can explode

Tree-based models 自動學 interactions(nested splits = implicit interaction),但 linear models 需要 explicitly add them。

Domain-Specific Features

Time-Based Features

# From a datetime column
df["hour"] = df["timestamp"].dt.hour
df["day_of_week"] = df["timestamp"].dt.dayofweek
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
df["month"] = df["timestamp"].dt.month
df["is_holiday"] = df["date"].isin(holiday_dates).astype(int)

# Cyclical encoding for periodic features
df["hour_sin"] = np.sin(2 * np.pi * df["hour"] / 24)
df["hour_cos"] = np.cos(2 * np.pi * df["hour"] / 24)
# hour 23 and hour 0 are "close" — sin/cos captures this, label encoding doesn't

Cyclical Encoding

Hour 23 和 hour 0 只差 1 小時,但 label encoding 會把它們當成差 23。Sin/cos encoding 讓週期性 feature 的首尾相連。同理適用於 day_of_week、month 等。

Aggregation Features

# User-level aggregations from transaction data
user_agg = transactions.groupby("user_id").agg(
    total_spend=("amount", "sum"),
    avg_spend=("amount", "mean"),
    max_spend=("amount", "max"),
    transaction_count=("amount", "count"),
    unique_merchants=("merchant_id", "nunique"),
    days_since_last=("date", lambda x: (today - x.max()).days),
)

Lag Features (Time Series)

# Previous values as features
df["sales_lag_1"] = df.groupby("store_id")["sales"].shift(1)
df["sales_lag_7"] = df.groupby("store_id")["sales"].shift(7)
df["sales_rolling_7d_mean"] = (
    df.groupby("store_id")["sales"]
    .transform(lambda x: x.shift(1).rolling(7).mean())
)
# shift(1) → avoid leaking current value

Lag Feature Leakage

Lag features 如果不做 shift() 就 rolling → 用到了當前和未來的值 → temporal leakage。永遠用 shift(1) 確保只用過去的資料。Cross-validation 也要用 TimeSeriesSplit(不能 random split)。

Real-World Use Cases

Case 1: 信用卡詐欺偵測

Feature TypeExamples
Transaction featuresAmount, merchant category, is_international
Aggregationavg_spend_last_7d, n_transactions_last_1h, max_amount_last_30d
Velocitytransactions_per_hour(突然大量交易 → suspicious)
Distancedistance_from_home, distance_from_last_transaction
Timehour_of_day, is_weekend, days_since_card_issued
Ratioamount / avg_spend_last_30d(異常大額交易 ratio 高)

面試 follow-up:「哪些 features 最重要?」— 通常 velocity features(每小時交易次數)和 ratio features(金額 / 歷史平均)最有 signal。Domain knowledge(和 fraud analyst 合作)比 blind feature generation 有效得多。

Case 2: 房價預測

Feature TypeExamples
NumericArea, rooms, bathrooms, floor, age
CategoricalNeighborhood(target encode or cluster), building_type
Derivedprice_per_sqft(if you have similar properties),age = current_year - built_year
Geographicdistance_to_subway, distance_to_school, walk_score
Interactionarea × rooms(total living space),neighborhood × building_type
Transformlog(price) as target(right-skewed),log(area)

Case 3: 客戶分群 / Churn Prediction

Feature TypeExamples
RFMRecency, frequency, monetary
Engagementsessions_last_30d, avg_session_duration, feature_X_usage_count
Behavioralsupport_tickets_count, payment_failures, plan_downgrades
Lifecycleaccount_age_days, days_since_last_login, trial_vs_paid
Trendactivity_this_month / activity_last_month(declining ratio → churn signal)

Feature Engineering Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier

numeric_features = ["age", "income", "transaction_count"]
categorical_features = ["city", "product_type"]

preprocessor = ColumnTransformer([
    ("num", Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]), numeric_features),
    ("cat", Pipeline([
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder(handle_unknown="ignore")),
    ]), categorical_features),
])

model = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", GradientBoostingClassifier()),
])
# Entire pipeline: impute → scale/encode → train
# CV-safe: preprocessing happens inside each fold

Interview Signals

What interviewers listen for:

  • 你能根據 model 類型選擇正確的 encoding 和 scaling
  • 你知道 target encoding 的 leakage risk 和 mitigation
  • 你能辨認 missing data mechanism 並選擇正確的 imputation
  • 你會結合 domain knowledge 做 meaningful features(不只是 blind polynomial)
  • 你知道 feature engineering pipeline 要在 CV 內部做(avoid leakage)

Practice

Flashcards

Flashcards (1/10)

One-hot encoding 的 dummy variable trap 是什麼?

K categories 用 K 個 dummies 會造成 perfect multicollinearity(任一 column = 1 - sum of others)。Linear models 需要 drop_first=True(只用 K-1 個)。Tree-based models 不受影響。

Click card to flip

Quiz

Question 1/10

你有一個 'city' feature 有 5000 個 unique values。用 one-hot encoding 最大的問題是?

Mark as Complete

3/5 — Okay