Ch 10 — Feature Engineering & The ML Pipeline

The art that separates good models from great ones — and when to reach for deep learning
Toolkit
straighten
Scaling
arrow_forward
tag
Encoding
arrow_forward
healing
Imputation
arrow_forward
auto_fix_high
Create
arrow_forward
filter_alt
Select
arrow_forward
plumbing
Pipeline
arrow_forward
view_column
ColTransform
arrow_forward
neurology
Deep Learning?
-
Click play or press Space to begin...
Step- / 8
straighten
Feature Scaling: StandardScaler, MinMaxScaler, RobustScaler
Putting all features on the same playing field
Why Scaling Matters
A dataset with “age” (0–100) and “income” (0–1,000,000) will be dominated by income in any distance-based algorithm (KNN, SVM, K-Means). Scaling puts all features on equal footing.

StandardScaler: z = (x − μ) / σ. Transforms to mean=0, std=1. Best for normally distributed features. Used by logistic regression, SVMs, PCA, neural networks.

MinMaxScaler: x’ = (x − min) / (max − min). Scales to [0, 1]. Preserves zero entries in sparse data. Sensitive to outliers.

RobustScaler: Uses median and IQR instead of mean and std. Robust to outliers. x’ = (x − median) / IQR.

Tree-based models (Random Forest, XGBoost) don’t need scaling — they split on thresholds, not distances. But scaling never hurts and sometimes helps.
Scaling in Code
from sklearn.preprocessing import ( StandardScaler, MinMaxScaler, RobustScaler ) # StandardScaler: mean=0, std=1 scaler = StandardScaler() X_train_s = scaler.fit_transform(X_train) X_test_s = scaler.transform(X_test) # same params! # MinMaxScaler: [0, 1] mm = MinMaxScaler() X_train_mm = mm.fit_transform(X_train) # RobustScaler: median/IQR (outlier-resistant) rs = RobustScaler() X_train_rs = rs.fit_transform(X_train) Which algorithms need scaling? ✓ LogisticRegression, SVM, KNN, K-Means, PCA ✗ DecisionTree, RandomForest, GradientBoosting ✓ Neural networks (always scale!) # CRITICAL: fit on train, transform both. # Never fit on test data — that's data leakage.
Key insight: Scaling is like converting currencies before comparing prices. A laptop costs 1,200 USD or 160,000 JPY — without conversion, the yen price “dominates.” StandardScaler converts all features to the same “currency” (standard deviations from the mean) so no feature unfairly dominates.
tag
Encoding Categorical Features
Turning “red, blue, green” into numbers the model understands
Three Encoding Strategies
Ordinal Encoding: Map categories to integers. “small”=0, “medium”=1, “large”=2. Only use when there’s a natural order. If you encode “red”=0, “blue”=1, “green”=2, the model thinks green > blue > red, which is meaningless.

One-Hot Encoding: Create a binary column for each category. “color_red”=1/0, “color_blue”=1/0, “color_green”=1/0. No false ordering. But with 1,000 categories, you get 1,000 new columns (high dimensionality).

Target Encoding: Replace each category with the mean of the target variable for that category. “red” → 0.73 (mean price for red items). Powerful but risks data leakage — must be computed on training data only, ideally with cross-validation.
Encoding in Code
from sklearn.preprocessing import ( OrdinalEncoder, OneHotEncoder ) # Ordinal: for ordered categories oe = OrdinalEncoder( categories=[['small','medium','large']] ) # One-Hot: for unordered categories ohe = OneHotEncoder( sparse_output=False, drop='first', # avoid multicollinearity handle_unknown='ignore' ) Decision guide: 2-5 categories, unordered → OneHotEncoder Ordered categories → OrdinalEncoder 100+ categories → TargetEncoder Tree-based models → OrdinalEncoder (trees handle arbitrary splits) # drop='first' removes one column to avoid # perfect multicollinearity (the dropped column # is implied by all others being 0).
Key insight: Encoding is like translating a foreign language for the model. One-hot encoding gives each word its own flashcard (precise but bulky). Ordinal encoding assigns numbers (compact but implies order). Target encoding summarizes each word by its average effect (powerful but needs care to avoid cheating).
healing
Handling Missing Values
Imputation strategies — because real data is messy
Imputation Strategies
Simple imputation:
Mean/Median: Replace NaN with the column’s mean (for normal data) or median (for skewed data). Fast, but ignores relationships between features.
Most frequent: For categorical features, replace with the mode.
Constant: Replace with a fixed value (e.g., 0 or “missing”).

Advanced imputation:
KNN Imputer: Fill missing values using the K nearest neighbors’ values. Captures feature relationships but slow for large data.
Iterative Imputer: Models each feature with missing values as a function of other features. Essentially multiple regression imputation. Most accurate but slowest.

Add a missing indicator: Create a binary column “feature_was_missing” alongside the imputed value. Sometimes the fact that data is missing is informative (e.g., income not reported may correlate with low income).
Imputation in Code
from sklearn.impute import SimpleImputer, KNNImputer # Simple: median for numeric imp_median = SimpleImputer(strategy='median') # Simple: most frequent for categorical imp_mode = SimpleImputer(strategy='most_frequent') # KNN: uses 5 nearest neighbors imp_knn = KNNImputer(n_neighbors=5) # Add missing indicator imp_with_flag = SimpleImputer( strategy='median', add_indicator=True # adds binary columns ) Decision guide: Quick baseline → SimpleImputer(median) Few missing values → KNNImputer Missingness matters → add_indicator=True Tree models → can handle NaN natively (HistGradientBoosting)
Key insight: Imputation is like filling in a crossword puzzle. Mean imputation fills every blank with the most common letter (fast but often wrong). KNN imputation looks at the surrounding clues to make an educated guess (slower but smarter). Sometimes the blank itself is a clue — that’s why adding a missing indicator can boost accuracy.
auto_fix_high
Feature Creation: The Highest-ROI Step
Domain knowledge + creativity = features the model can’t discover alone
Feature Engineering Techniques
Interaction features: Multiply features together. “bedrooms × sqft” captures “spaciousness per room” — a concept neither feature expresses alone.

Polynomial features: Add x², x³ to capture nonlinear relationships. Linear regression with polynomial features becomes polynomial regression.

Binning: Convert continuous to categorical. Age → “child/teen/adult/senior.” Useful when the relationship is step-wise, not smooth.

Log/sqrt transforms: Compress right-skewed distributions (income, population). log(income) is often more predictive than raw income because the difference between $50K and $100K matters more than $950K and $1M.

Date/time features: Extract day_of_week, month, hour, is_weekend, days_since_event from timestamps. A model can’t learn “weekends have higher sales” from a raw timestamp.
Feature Creation Examples
import numpy as np import pandas as pd # Interaction features df['rooms_per_sqft'] = df['bedrooms'] / df['sqft'] df['price_per_sqft'] = df['price'] / df['sqft'] # Log transform (skewed features) df['log_income'] = np.log1p(df['income']) # Date features df['day_of_week'] = df['date'].dt.dayofweek df['is_weekend'] = df['day_of_week'].isin([5,6]) df['month'] = df['date'].dt.month # Binning df['age_group'] = pd.cut(df['age'], bins=[0,18,35,55,100], labels=['young','adult','middle','senior']) # Good features often boost accuracy more # than switching from LogReg to XGBoost.
Key insight: Feature engineering is where domain expertise meets ML. A data scientist who knows that “price per square foot” matters in real estate will build a better model than one who throws raw features at XGBoost. The best feature engineers ask: “What would a human expert look at?” and encode that knowledge as features.
filter_alt
Feature Selection: RFE, Lasso, and Importance
Fewer features = faster training, less overfitting, better interpretability
Three Approaches
1. Filter methods: Score each feature independently (correlation with target, mutual information, chi-squared). Fast but ignores feature interactions. Good for initial screening.

2. Wrapper methods (RFE): Recursive Feature Elimination trains a model, removes the least important feature, retrains, repeats. Considers feature interactions but slow (trains many models).

3. Embedded methods: Feature selection built into the model. Lasso zeros out unimportant features (L1 regularization). Tree importance ranks features by Gini reduction. Fast and considers interactions.

In practice, start with Lasso or tree importance for a quick ranking, then use RFE for fine-tuning if needed.
Feature Selection in Code
from sklearn.feature_selection import ( RFE, SelectFromModel ) from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import Lasso # RFE: keep top 10 features rfe = RFE( estimator=RandomForestClassifier(n_estimators=50), n_features_to_select=10 ) rfe.fit(X_train, y_train) X_selected = rfe.transform(X_train) print(f"Selected: {rfe.support_}") # Lasso: automatic feature selection lasso = Lasso(alpha=0.01) selector = SelectFromModel(lasso) selector.fit(X_train, y_train) X_selected = selector.transform(X_train) print(f"Features kept: {selector.get_support().sum()}") # 100 features → 20 selected → same accuracy # but 5x faster training and more interpretable.
Key insight: Feature selection is like packing for a trip. You could bring everything (overfit), but a lighter suitcase (fewer features) is easier to carry (faster training), less likely to include things you don’t need (less noise), and you can actually find what you’re looking for (interpretability).
plumbing
scikit-learn Pipeline: No Data Leakage
Chain preprocessing and model into one leak-proof object
Why Pipelines?
Without a pipeline, you might accidentally:
• Fit the scaler on the full dataset (leaks test statistics into training)
• Forget to apply the same transforms to test data
• Apply transforms in the wrong order

A Pipeline chains steps sequentially. When you call pipeline.fit(X_train, y_train), each step’s fit_transform() is called in order, feeding output to the next step. The final step calls fit().

When you call pipeline.predict(X_test), each step calls transform() (not fit_transform!) and the final step calls predict(). This guarantees no data leakage.

Pipelines also work with GridSearchCV — you can tune hyperparameters of any step using the stepname__param syntax.
Pipeline in Code
from sklearn.pipeline import make_pipeline, Pipeline from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.svm import SVC # Simple pipeline pipe = make_pipeline( StandardScaler(), PCA(n_components=50), SVC(kernel='rbf') ) # fit: scaler.fit_transform → PCA.fit_transform → SVC.fit pipe.fit(X_train, y_train) # predict: scaler.transform → PCA.transform → SVC.predict y_pred = pipe.predict(X_test) score = pipe.score(X_test, y_test) # GridSearch with pipeline from sklearn.model_selection import GridSearchCV grid = GridSearchCV(pipe, { 'svc__C': [1, 10], 'pca__n_components': [30, 50] }, cv=5) grid.fit(X_train, y_train)
Key insight: A Pipeline is like an assembly line in a factory. Raw materials (data) enter one end, pass through stations (scaler, PCA, model) in order, and a finished product (prediction) comes out the other end. The assembly line guarantees every piece goes through every station in the right order — no shortcuts, no skipped steps, no contamination.
view_column
ColumnTransformer: Different Transforms for Different Features
Scale numbers, encode categories, impute missing — all in one step
The Problem
Real datasets have mixed types: numeric features need scaling, categorical features need encoding, and some features need imputation. You can’t apply StandardScaler to “color” or OneHotEncoder to “age.”

ColumnTransformer applies different transformations to different columns, then concatenates the results. It’s the key to building production-grade ML pipelines.

Combined with Pipeline, you get a single object that handles the entire workflow from raw data to prediction — including mixed types, missing values, and feature engineering.
ColumnTransformer Pipeline
from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.ensemble import HistGradientBoostingClassifier num_features = ['age', 'income', 'score'] cat_features = ['color', 'city', 'gender'] preprocessor = ColumnTransformer([ ('num', Pipeline([ ('impute', SimpleImputer(strategy='median')), ('scale', StandardScaler()), ]), num_features), ('cat', Pipeline([ ('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotEncoder(handle_unknown='ignore')), ]), cat_features), ]) full_pipe = Pipeline([ ('preprocess', preprocessor), ('model', HistGradientBoostingClassifier()), ]) full_pipe.fit(X_train, y_train) print(f"Test: {full_pipe.score(X_test, y_test):.3f}")
Key insight: ColumnTransformer is like a hospital triage system. Numeric patients go to the scaling ward, categorical patients go to the encoding ward, and missing-value patients get imputation treatment. After treatment, everyone is combined into a single healthy dataset ready for the model. One pipeline handles everything.
neurology
When to Bridge to Deep Learning
Classic ML vs deep learning — the honest comparison
Classic ML Wins When…
Tabular data: Gradient boosted trees (XGBoost, LightGBM) consistently beat neural networks on structured/tabular data. Kaggle competitions confirm this year after year.

Small data (<10K samples): Deep learning needs massive data to avoid overfitting. Classic ML works well with hundreds or thousands of samples.

Interpretability required: Logistic regression coefficients, tree feature importance, and SHAP values are straightforward. Neural networks are black boxes.

Fast iteration: Train a Random Forest in seconds. Train a neural network in hours/days. For rapid prototyping, classic ML wins.

Limited compute: Classic ML runs on a laptop CPU. Deep learning often needs GPUs.
Deep Learning Wins When…
Images: CNNs dominate image classification, object detection, segmentation. No amount of feature engineering matches learned convolutional filters.

Text/NLP: Transformers (BERT, GPT) understand language context that TF-IDF + NB cannot. For complex NLP tasks, deep learning is essential.

Audio/Video: Speech recognition, music generation, video understanding — all deep learning territory.

Massive data (>100K+ samples): Neural networks improve with more data, while classic ML plateaus.

End-to-end learning: Deep learning learns features automatically. No manual feature engineering needed for images, text, or audio.
Key insight: Classic ML and deep learning are complementary tools, not competitors. For tabular data with <100K rows, gradient boosted trees are king. For images, text, and audio, deep learning is unbeatable. The best practitioners know both and pick the right tool for each job. This course gave you the classic ML toolkit — now you’re ready to add deep learning when the problem demands it.