Ch 10: Feature Engineering & The ML Pipeline

Ch 10 — Feature Engineering & The ML Pipeline

The art that separates good models from great ones — and when to reach for deep learning

arrow_backIndex

Toolkit

straighten

Scaling

arrow_forward

tag

Encoding

arrow_forward

healing

Imputation

arrow_forward

auto_fix_high

Create

arrow_forward

filter_alt

Select

arrow_forward

plumbing

Pipeline

arrow_forward

view_column

ColTransform

arrow_forward

neurology

Deep Learning?

Click play or press Space to begin...

Step- / 8

straighten

Feature Scaling: StandardScaler, MinMaxScaler, RobustScaler

Putting all features on the same playing field

Why Scaling Matters

A dataset with “age” (0–100) and “income” (0–1,000,000) will be dominated by income in any distance-based algorithm (KNN, SVM, K-Means). Scaling puts all features on equal footing.

StandardScaler: z = (x − μ) / σ. Transforms to mean=0, std=1. Best for normally distributed features. Used by logistic regression, SVMs, PCA, neural networks.

MinMaxScaler: x’ = (x − min) / (max − min). Scales to [0, 1]. Preserves zero entries in sparse data. Sensitive to outliers.

RobustScaler: Uses median and IQR instead of mean and std. Robust to outliers. x’ = (x − median) / IQR.

Tree-based models (Random Forest, XGBoost) don’t need scaling — they split on thresholds, not distances. But scaling never hurts and sometimes helps.

Scaling in Code

from sklearn.preprocessing import ( StandardScaler, MinMaxScaler, RobustScaler ) # StandardScaler: mean=0, std=1 scaler = StandardScaler() X_train_s = scaler.fit_transform(X_train) X_test_s = scaler.transform(X_test) # same params! # MinMaxScaler: [0, 1] mm = MinMaxScaler() X_train_mm = mm.fit_transform(X_train) # RobustScaler: median/IQR (outlier-resistant) rs = RobustScaler() X_train_rs = rs.fit_transform(X_train) Which algorithms need scaling? ✓ LogisticRegression, SVM, KNN, K-Means, PCA ✗ DecisionTree, RandomForest, GradientBoosting ✓ Neural networks (always scale!) # CRITICAL: fit on train, transform both. # Never fit on test data — that's data leakage.

Key insight: Scaling is like converting currencies before comparing prices. A laptop costs 1,200 USD or 160,000 JPY — without conversion, the yen price “dominates.” StandardScaler converts all features to the same “currency” (standard deviations from the mean) so no feature unfairly dominates.

tag

Encoding Categorical Features

Turning “red, blue, green” into numbers the model understands

Three Encoding Strategies

Ordinal Encoding: Map categories to integers. “small”=0, “medium”=1, “large”=2. Only use when there’s a natural order. If you encode “red”=0, “blue”=1, “green”=2, the model thinks green > blue > red, which is meaningless.

One-Hot Encoding: Create a binary column for each category. “color_red”=1/0, “color_blue”=1/0, “color_green”=1/0. No false ordering. But with 1,000 categories, you get 1,000 new columns (high dimensionality).

Target Encoding: Replace each category with the mean of the target variable for that category. “red” → 0.73 (mean price for red items). Powerful but risks data leakage — must be computed on training data only, ideally with cross-validation.

Encoding in Code

from sklearn.preprocessing import ( OrdinalEncoder, OneHotEncoder ) # Ordinal: for ordered categories oe = OrdinalEncoder( categories=[['small','medium','large']] ) # One-Hot: for unordered categories ohe = OneHotEncoder( sparse_output=False, drop='first', # avoid multicollinearity handle_unknown='ignore' ) Decision guide: 2-5 categories, unordered → OneHotEncoder Ordered categories → OrdinalEncoder 100+ categories → TargetEncoder Tree-based models → OrdinalEncoder (trees handle arbitrary splits) # drop='first' removes one column to avoid # perfect multicollinearity (the dropped column # is implied by all others being 0).

Key insight: Encoding is like translating a foreign language for the model. One-hot encoding gives each word its own flashcard (precise but bulky). Ordinal encoding assigns numbers (compact but implies order). Target encoding summarizes each word by its average effect (powerful but needs care to avoid cheating).

healing

Handling Missing Values

Imputation strategies — because real data is messy

Imputation Strategies

Simple imputation:
• Mean/Median: Replace NaN with the column’s mean (for normal data) or median (for skewed data). Fast, but ignores relationships between features.
• Most frequent: For categorical features, replace with the mode.
• Constant: Replace with a fixed value (e.g., 0 or “missing”).

Advanced imputation:
• KNN Imputer: Fill missing values using the K nearest neighbors’ values. Captures feature relationships but slow for large data.
• Iterative Imputer: Models each feature with missing values as a function of other features. Essentially multiple regression imputation. Most accurate but slowest.

Add a missing indicator: Create a binary column “feature_was_missing” alongside the imputed value. Sometimes the fact that data is missing is informative (e.g., income not reported may correlate with low income).

Imputation in Code

from sklearn.impute import SimpleImputer, KNNImputer # Simple: median for numeric imp_median = SimpleImputer(strategy='median') # Simple: most frequent for categorical imp_mode = SimpleImputer(strategy='most_frequent') # KNN: uses 5 nearest neighbors imp_knn = KNNImputer(n_neighbors=5) # Add missing indicator imp_with_flag = SimpleImputer( strategy='median', add_indicator=True # adds binary columns ) Decision guide: Quick baseline → SimpleImputer(median) Few missing values → KNNImputer Missingness matters → add_indicator=True Tree models → can handle NaN natively (HistGradientBoosting)

Key insight: Imputation is like filling in a crossword puzzle. Mean imputation fills every blank with the most common letter (fast but often wrong). KNN imputation looks at the surrounding clues to make an educated guess (slower but smarter). Sometimes the blank itself is a clue — that’s why adding a missing indicator can boost accuracy.

auto_fix_high

Feature Creation: The Highest-ROI Step

Domain knowledge + creativity = features the model can’t discover alone

Feature Engineering Techniques

Interaction features: Multiply features together. “bedrooms × sqft” captures “spaciousness per room” — a concept neither feature expresses alone.

Polynomial features: Add x², x³ to capture nonlinear relationships. Linear regression with polynomial features becomes polynomial regression.

Binning: Convert continuous to categorical. Age → “child/teen/adult/senior.” Useful when the relationship is step-wise, not smooth.

Log/sqrt transforms: Compress right-skewed distributions (income, population). log(income) is often more predictive than raw income because the difference between $50K and $100K matters more than $950K and $1M.

Date/time features: Extract day_of_week, month, hour, is_weekend, days_since_event from timestamps. A model can’t learn “weekends have higher sales” from a raw timestamp.

Feature Creation Examples

import numpy as np import pandas as pd # Interaction features df['rooms_per_sqft'] = df['bedrooms'] / df['sqft'] df['price_per_sqft'] = df['price'] / df['sqft'] # Log transform (skewed features) df['log_income'] = np.log1p(df['income']) # Date features df['day_of_week'] = df['date'].dt.dayofweek df['is_weekend'] = df['day_of_week'].isin([5,6]) df['month'] = df['date'].dt.month # Binning df['age_group'] = pd.cut(df['age'], bins=[0,18,35,55,100], labels=['young','adult','middle','senior']) # Good features often boost accuracy more # than switching from LogReg to XGBoost.

Key insight: Feature engineering is where domain expertise meets ML. A data scientist who knows that “price per square foot” matters in real estate will build a better model than one who throws raw features at XGBoost. The best feature engineers ask: “What would a human expert look at?” and encode that knowledge as features.

filter_alt

Feature Selection: RFE, Lasso, and Importance

Fewer features = faster training, less overfitting, better interpretability

Three Approaches

1. Filter methods: Score each feature independently (correlation with target, mutual information, chi-squared). Fast but ignores feature interactions. Good for initial screening.

2. Wrapper methods (RFE): Recursive Feature Elimination trains a model, removes the least important feature, retrains, repeats. Considers feature interactions but slow (trains many models).

3. Embedded methods: Feature selection built into the model. Lasso zeros out unimportant features (L1 regularization). Tree importance ranks features by Gini reduction. Fast and considers interactions.

In practice, start with Lasso or tree importance for a quick ranking, then use RFE for fine-tuning if needed.

Feature Selection in Code

from sklearn.feature_selection import ( RFE, SelectFromModel ) from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import Lasso # RFE: keep top 10 features rfe = RFE( estimator=RandomForestClassifier(n_estimators=50), n_features_to_select=10 ) rfe.fit(X_train, y_train) X_selected = rfe.transform(X_train) print(f"Selected: {rfe.support_}") # Lasso: automatic feature selection lasso = Lasso(alpha=0.01) selector = SelectFromModel(lasso) selector.fit(X_train, y_train) X_selected = selector.transform(X_train) print(f"Features kept: {selector.get_support().sum()}") # 100 features → 20 selected → same accuracy # but 5x faster training and more interpretable.

Key insight: Feature selection is like packing for a trip. You could bring everything (overfit), but a lighter suitcase (fewer features) is easier to carry (faster training), less likely to include things you don’t need (less noise), and you can actually find what you’re looking for (interpretability).

plumbing

scikit-learn Pipeline: No Data Leakage

Chain preprocessing and model into one leak-proof object

Why Pipelines?

Without a pipeline, you might accidentally:
• Fit the scaler on the full dataset (leaks test statistics into training)
• Forget to apply the same transforms to test data
• Apply transforms in the wrong order

A Pipeline chains steps sequentially. When you call pipeline.fit(X_train, y_train), each step’s fit_transform() is called in order, feeding output to the next step. The final step calls fit().

When you call pipeline.predict(X_test), each step calls transform() (not fit_transform!) and the final step calls predict(). This guarantees no data leakage.

Pipelines also work with GridSearchCV — you can tune hyperparameters of any step using the stepname__param syntax.

Pipeline in Code

from sklearn.pipeline import make_pipeline, Pipeline from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.svm import SVC # Simple pipeline pipe = make_pipeline( StandardScaler(), PCA(n_components=50), SVC(kernel='rbf') ) # fit: scaler.fit_transform → PCA.fit_transform → SVC.fit pipe.fit(X_train, y_train) # predict: scaler.transform → PCA.transform → SVC.predict y_pred = pipe.predict(X_test) score = pipe.score(X_test, y_test) # GridSearch with pipeline from sklearn.model_selection import GridSearchCV grid = GridSearchCV(pipe, { 'svc__C': [1, 10], 'pca__n_components': [30, 50] }, cv=5) grid.fit(X_train, y_train)

Key insight: A Pipeline is like an assembly line in a factory. Raw materials (data) enter one end, pass through stations (scaler, PCA, model) in order, and a finished product (prediction) comes out the other end. The assembly line guarantees every piece goes through every station in the right order — no shortcuts, no skipped steps, no contamination.

view_column

ColumnTransformer: Different Transforms for Different Features

Scale numbers, encode categories, impute missing — all in one step

The Problem

Real datasets have mixed types: numeric features need scaling, categorical features need encoding, and some features need imputation. You can’t apply StandardScaler to “color” or OneHotEncoder to “age.”

ColumnTransformer applies different transformations to different columns, then concatenates the results. It’s the key to building production-grade ML pipelines.

Combined with Pipeline, you get a single object that handles the entire workflow from raw data to prediction — including mixed types, missing values, and feature engineering.

ColumnTransformer Pipeline

from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.ensemble import HistGradientBoostingClassifier num_features = ['age', 'income', 'score'] cat_features = ['color', 'city', 'gender'] preprocessor = ColumnTransformer([ ('num', Pipeline([ ('impute', SimpleImputer(strategy='median')), ('scale', StandardScaler()), ]), num_features), ('cat', Pipeline([ ('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotEncoder(handle_unknown='ignore')), ]), cat_features), ]) full_pipe = Pipeline([ ('preprocess', preprocessor), ('model', HistGradientBoostingClassifier()), ]) full_pipe.fit(X_train, y_train) print(f"Test: {full_pipe.score(X_test, y_test):.3f}")

Key insight: ColumnTransformer is like a hospital triage system. Numeric patients go to the scaling ward, categorical patients go to the encoding ward, and missing-value patients get imputation treatment. After treatment, everyone is combined into a single healthy dataset ready for the model. One pipeline handles everything.

neurology

When to Bridge to Deep Learning

Classic ML vs deep learning — the honest comparison

Classic ML Wins When…

Tabular data: Gradient boosted trees (XGBoost, LightGBM) consistently beat neural networks on structured/tabular data. Kaggle competitions confirm this year after year.

Small data (<10K samples): Deep learning needs massive data to avoid overfitting. Classic ML works well with hundreds or thousands of samples.

Interpretability required: Logistic regression coefficients, tree feature importance, and SHAP values are straightforward. Neural networks are black boxes.

Fast iteration: Train a Random Forest in seconds. Train a neural network in hours/days. For rapid prototyping, classic ML wins.

Limited compute: Classic ML runs on a laptop CPU. Deep learning often needs GPUs.

Deep Learning Wins When…

Images: CNNs dominate image classification, object detection, segmentation. No amount of feature engineering matches learned convolutional filters.

Text/NLP: Transformers (BERT, GPT) understand language context that TF-IDF + NB cannot. For complex NLP tasks, deep learning is essential.

Audio/Video: Speech recognition, music generation, video understanding — all deep learning territory.

Massive data (>100K+ samples): Neural networks improve with more data, while classic ML plateaus.

End-to-end learning: Deep learning learns features automatically. No manual feature engineering needed for images, text, or audio.

Key insight: Classic ML and deep learning are complementary tools, not competitors. For tabular data with <100K rows, gradient boosted trees are king. For images, text, and audio, deep learning is unbeatable. The best practitioners know both and pick the right tool for each job. This course gave you the classic ML toolkit — now you’re ready to add deep learning when the problem demands it.

arrow_back Model Evaluation Back to Index arrow_forward