Ch 4 — Data & Feature Engineering

Garbage in, garbage out — how to collect, clean, transform, and split data for ML
High Level
database
Collect
arrow_forward
cleaning_services
Clean
arrow_forward
transform
Transform
arrow_forward
auto_fix_high
Engineer
arrow_forward
call_split
Split
arrow_forward
verified
Validate
-
Click play or press Space to begin the journey...
Step- / 8
database
Why Data Is Everything
Garbage in, garbage out — the most important rule in ML
The Reality
Data scientists spend 60–80% of their time on data preparation — not building models. The best algorithm in the world cannot overcome bad data. A simple model on clean, well-engineered data will outperform a complex model on messy data almost every time.
# The ML data pipeline Raw DataCollectClean (fix errors, handle missing) → Transform (scale, encode, normalize) → Engineer (create new features) → Split (train / val / test) → Validate (check for leakage) → Ready for training
Data Types
Structured: Tables with rows and columns (spreadsheets, databases). Each column is a feature. Most traditional ML uses structured data.

Unstructured: Images, text, audio, video. No predefined schema. Requires specialized preprocessing (tokenization for text, pixel normalization for images).

Semi-structured: JSON, XML, logs. Has some organization but no rigid table format.
Andrew Ng’s data-centric AI: Instead of iterating on models, iterate on data quality. Cleaning labels, removing noisy examples, and improving data consistency often yields bigger gains than switching algorithms. This shift — from model-centric to data-centric AI — is reshaping the field.
cloud_download
Data Collection & Quality
Where data comes from and what makes it good or bad
Data Sources
Databases & APIs: Company databases, REST APIs, web scraping
Public datasets: ImageNet (14M images), Common Crawl (petabytes of web text), Kaggle competitions
Sensors & IoT: Camera feeds, temperature sensors, GPS logs
User-generated: Reviews, clicks, purchases, social media
Synthetic: Generated data for training (GANs, simulation)
Common Quality Issues
Missing values NaN, null, empty cells Duplicates Same record entered twice Inconsistency "USA", "US", "United States" Outliers Age = 999, salary = -$50K Label noise Mislabeled training examples Selection bias Data not representative of reality Stale data Trained on 2020, deployed in 2026
Handling Missing Values
# Strategy depends on context Drop rows When: few missing, large dataset Risk: lose information Impute with mean/median When: numerical, random missingness Risk: reduces variance Impute with mode When: categorical features Risk: overrepresents common value Predictive imputation When: complex patterns in missingness Method: train model to predict missing Add "is_missing" indicator When: missingness itself is informative Example: missing income → unemployed?
Missing data is rarely random. A patient with no blood test result may be too sick to test. A customer with no purchase history may have churned. The pattern of missingness often carries signal — don’t just delete it blindly.
cleaning_services
Data Cleaning
Fixing errors, removing noise, standardizing formats
Cleaning Checklist
1. Remove duplicates: Exact and near-duplicates (fuzzy matching)
2. Fix data types: Dates stored as strings, numbers as text
3. Standardize formats: “Jan 5, 2024” vs “2024-01-05” vs “1/5/24”
4. Handle outliers: Cap, remove, or transform extreme values
5. Correct labels: Audit a sample of labels for accuracy
6. Resolve inconsistencies: Merge “NYC” / “New York City” / “new york”
Outlier strategies: Not all outliers are errors. A $10M transaction might be fraud (remove it) or a legitimate whale customer (keep it). Domain knowledge determines the right approach. Z-score > 3 or IQR method are common detection techniques.
Text Cleaning Pipeline
# NLP text preprocessing Raw: " The Cat SAT on 3 mats!! @home " 1. Lowercase "the cat sat on 3 mats!! @home" 2. Strip spaces "the cat sat on 3 mats!! @home" 3. Remove punct "the cat sat on 3 mats home" 4. Tokenize ["the","cat","sat","on","3","mats","home"] 5. Remove stops ["cat","sat","3","mats","home"] 6. Lemmatize ["cat","sit","3","mat","home"]
Image Cleaning
Resize to uniform dimensions (224×224 typical)
Normalize pixel values to [0,1] or [-1,1]
Remove corrupted, blurry, or mislabeled images
Balance class representation across categories
transform
Feature Transformation
Scaling, encoding, and normalizing features for ML algorithms
Why Transform?
ML algorithms are sensitive to feature scales. If “age” ranges 0–100 and “salary” ranges 30K–500K, the model will be dominated by salary. Scaling puts all features on comparable ranges so each contributes fairly.
# Numerical scaling methods Min-Max Scaling (normalization) x_scaled = (x - min) / (max - min) Range: [0, 1] Use: neural networks, image pixels Standardization (Z-score) x_scaled = (x - mean) / std Range: centered at 0, std = 1 Use: SVM, logistic regression, PCA Log Transform x_scaled = log(x + 1) Use: right-skewed data (income, prices)
Encoding Categorical Features
# One-Hot Encoding Color: [red, blue, green] red → [1, 0, 0] blue → [0, 1, 0] green → [0, 0, 1] # Label Encoding (ordinal) Size: [S, M, L, XL] S → 0, M → 1, L → 2, XL → 3 # Target Encoding (advanced) City: replace with mean target value NYC → 0.73 (mean price in NYC) LA → 0.61 (mean price in LA)
One-hot explosion: A feature with 10,000 unique values (like zip codes) creates 10,000 new columns. Use target encoding, embedding layers, or hash encoding for high-cardinality categoricals. Tree-based models handle categoricals natively.
auto_fix_high
Feature Engineering
Creating new features that capture domain knowledge
The Art of Feature Engineering
Feature engineering is creating new input variables from raw data that make patterns easier for models to learn. It’s where domain expertise meets ML — a skilled engineer who understands the problem can create features that dramatically improve model performance.
# Feature engineering examples From dates: purchase_date → day_of_week, is_weekend, month, quarter, days_since_last From text: email_body → word_count, has_urgency_words, num_links, caps_ratio From location: lat, lon → distance_to_city_center, neighborhood_avg_income Interactions: price, sqft → price_per_sqft clicks, views → click_through_rate
Feature Selection
Not all features help. Irrelevant or redundant features add noise and slow training. Feature selection removes the least useful features.
# Feature selection methods Filter (fast, model-independent) Correlation, mutual information, chi-squared Remove features with low relevance to target Wrapper (accurate, expensive) Forward selection: add best feature one by one Backward elimination: remove worst one by one Embedded (built into model training) L1 regularization drives weights to zero Tree feature importance (Gini, information gain) Automatically selects during training
Deep learning reduces manual feature engineering. CNNs learn image features automatically. Transformers learn text representations. But for tabular data, manual feature engineering still outperforms deep learning in most cases — domain knowledge remains irreplaceable.
call_split
Train / Validation / Test Split
The most important rule: never evaluate on data you trained on
The Three Splits
Training set (70–80%): The model learns from this data. Weights are updated based on training examples.

Validation set (10–15%): Used to tune hyperparameters and select the best model. The model never trains on this data, but you use it to make decisions.

Test set (10–15%): The final, untouched evaluation. Used once at the very end to report performance. Simulates real-world deployment.
# Typical split ratios Small dataset (< 10K samples) Train: 60% Val: 20% Test: 20% + Use K-fold cross-validation Medium dataset (10K – 1M) Train: 80% Val: 10% Test: 10% Large dataset (> 1M) Train: 98% Val: 1% Test: 1% (1% of 1M = 10K, still plenty)
Time-series warning: Never randomly shuffle time-series data. Use chronological splits — train on past, validate on recent, test on most recent. Random splitting creates data leakage because the model sees “future” data during training.
warning
Data Leakage & Class Imbalance
Two silent killers of ML projects
Data Leakage
Leakage occurs when information from outside the training set “leaks” into the model during training. The model appears to perform brilliantly in testing but fails completely in production.
Leakage
Scale entire dataset, then split into train/test. Test set statistics leak into training via the scaler’s mean and std. Model sees “future” information.
Correct
Split first, then fit scaler on training set only. Apply the same scaler to validation and test sets. No information leaks.
Common leakage sources: Preprocessing before splitting, using future data in time-series, including the target variable (or a proxy) as a feature, duplicate records across splits.
Class Imbalance
When one class vastly outnumbers another (99% legitimate transactions, 1% fraud), the model can achieve 99% accuracy by predicting “legitimate” for everything — while catching zero fraud.
# Handling class imbalance Oversampling minority SMOTE: create synthetic minority examples by interpolating between neighbors Undersampling majority Randomly remove majority class examples Risk: lose useful information Class weights Tell the model: misclassifying fraud costs 100x more than misclassifying legit Better metrics Use F1, precision, recall, AUC-ROC instead of raw accuracy
add_photo_alternate
Data Augmentation & Modern Pipelines
Creating more data and scaling to production
Data Augmentation
When you can’t collect more data, create variations of existing data. This increases effective dataset size and improves generalization.
# Image augmentation Geometric: rotate, flip, crop, zoom, shear Color: brightness, contrast, saturation Noise: Gaussian noise, blur, cutout Advanced: MixUp, CutMix, random erasing # Text augmentation Synonym: replace words with synonyms Back-trans: translate to French → back to English LLM-gen: use GPT to paraphrase examples # Tabular augmentation SMOTE: synthetic minority oversampling Noise: add small random perturbations
Key Takeaways
1. Data quality matters more than model complexity

2. Always split before preprocessing (prevent leakage)

3. Scale numerical features; encode categorical features

4. Feature engineering encodes domain knowledge

5. Handle class imbalance with SMOTE, class weights, or better metrics

6. Augmentation creates more training data from what you have

7. Deep learning automates feature extraction for images and text, but tabular data still benefits from manual engineering
Coming up: Ch 5 introduces the perceptron — the simplest neural network — and shows how neurons transform input features into predictions. Ch 6 covers how training actually works (backpropagation, gradient descent in practice).