Ch 4: Data & Features

Ch 4 — Data & Feature Engineering

Garbage in, garbage out — how to collect, clean, transform, and split data for ML

Index Under the Hood →

High Level

database

Collect

arrow_forward

cleaning_services

Clean

arrow_forward

transform

Transform

arrow_forward

auto_fix_high

Engineer

arrow_forward

call_split

Split

arrow_forward

verified

Validate

Click play or press Space to begin the journey...

Step- / 8

database

Why Data Is Everything

Garbage in, garbage out — the most important rule in ML

The Reality

Data scientists spend 60–80% of their time on data preparation — not building models. The best algorithm in the world cannot overcome bad data. A simple model on clean, well-engineered data will outperform a complex model on messy data almost every time.

# The ML data pipeline Raw Data → Collect → Clean (fix errors, handle missing) → Transform (scale, encode, normalize) → Engineer (create new features) → Split (train / val / test) → Validate (check for leakage) → Ready for training

Data Types

Structured: Tables with rows and columns (spreadsheets, databases). Each column is a feature. Most traditional ML uses structured data.

Unstructured: Images, text, audio, video. No predefined schema. Requires specialized preprocessing (tokenization for text, pixel normalization for images).

Semi-structured: JSON, XML, logs. Has some organization but no rigid table format.

Andrew Ng’s data-centric AI: Instead of iterating on models, iterate on data quality. Cleaning labels, removing noisy examples, and improving data consistency often yields bigger gains than switching algorithms. This shift — from model-centric to data-centric AI — is reshaping the field.

cloud_download

Data Collection & Quality

Where data comes from and what makes it good or bad

Data Sources

Databases & APIs: Company databases, REST APIs, web scraping
Public datasets: ImageNet (14M images), Common Crawl (petabytes of web text), Kaggle competitions
Sensors & IoT: Camera feeds, temperature sensors, GPS logs
User-generated: Reviews, clicks, purchases, social media
Synthetic: Generated data for training (GANs, simulation)

Common Quality Issues

Missing values NaN, null, empty cells Duplicates Same record entered twice Inconsistency "USA", "US", "United States" Outliers Age = 999, salary = -$50K Label noise Mislabeled training examples Selection bias Data not representative of reality Stale data Trained on 2020, deployed in 2026

Handling Missing Values

# Strategy depends on context Drop rows When: few missing, large dataset Risk: lose information Impute with mean/median When: numerical, random missingness Risk: reduces variance Impute with mode When: categorical features Risk: overrepresents common value Predictive imputation When: complex patterns in missingness Method: train model to predict missing Add "is_missing" indicator When: missingness itself is informative Example: missing income → unemployed?

Missing data is rarely random. A patient with no blood test result may be too sick to test. A customer with no purchase history may have churned. The pattern of missingness often carries signal — don’t just delete it blindly.

cleaning_services

Data Cleaning

Fixing errors, removing noise, standardizing formats

Cleaning Checklist

1. Remove duplicates: Exact and near-duplicates (fuzzy matching)
2. Fix data types: Dates stored as strings, numbers as text
3. Standardize formats: “Jan 5, 2024” vs “2024-01-05” vs “1/5/24”
4. Handle outliers: Cap, remove, or transform extreme values
5. Correct labels: Audit a sample of labels for accuracy
6. Resolve inconsistencies: Merge “NYC” / “New York City” / “new york”

Outlier strategies: Not all outliers are errors. A $10M transaction might be fraud (remove it) or a legitimate whale customer (keep it). Domain knowledge determines the right approach. Z-score > 3 or IQR method are common detection techniques.

Text Cleaning Pipeline

# NLP text preprocessing Raw: " The Cat SAT on 3 mats!! @home " 1. Lowercase "the cat sat on 3 mats!! @home" 2. Strip spaces "the cat sat on 3 mats!! @home" 3. Remove punct "the cat sat on 3 mats home" 4. Tokenize ["the","cat","sat","on","3","mats","home"] 5. Remove stops ["cat","sat","3","mats","home"] 6. Lemmatize ["cat","sit","3","mat","home"]

Image Cleaning

Resize to uniform dimensions (224×224 typical)
Normalize pixel values to [0,1] or [-1,1]
Remove corrupted, blurry, or mislabeled images
Balance class representation across categories

transform

Feature Transformation

Scaling, encoding, and normalizing features for ML algorithms

Why Transform?

ML algorithms are sensitive to feature scales. If “age” ranges 0–100 and “salary” ranges 30K–500K, the model will be dominated by salary. Scaling puts all features on comparable ranges so each contributes fairly.

# Numerical scaling methods Min-Max Scaling (normalization) x_scaled = (x - min) / (max - min) Range: [0, 1] Use: neural networks, image pixels Standardization (Z-score) x_scaled = (x - mean) / std Range: centered at 0, std = 1 Use: SVM, logistic regression, PCA Log Transform x_scaled = log(x + 1) Use: right-skewed data (income, prices)

Encoding Categorical Features

# One-Hot Encoding Color: [red, blue, green] red → [1, 0, 0] blue → [0, 1, 0] green → [0, 0, 1] # Label Encoding (ordinal) Size: [S, M, L, XL] S → 0, M → 1, L → 2, XL → 3 # Target Encoding (advanced) City: replace with mean target value NYC → 0.73 (mean price in NYC) LA → 0.61 (mean price in LA)

One-hot explosion: A feature with 10,000 unique values (like zip codes) creates 10,000 new columns. Use target encoding, embedding layers, or hash encoding for high-cardinality categoricals. Tree-based models handle categoricals natively.

auto_fix_high

Feature Engineering

Creating new features that capture domain knowledge

The Art of Feature Engineering

Feature engineering is creating new input variables from raw data that make patterns easier for models to learn. It’s where domain expertise meets ML — a skilled engineer who understands the problem can create features that dramatically improve model performance.

# Feature engineering examples From dates: purchase_date → day_of_week, is_weekend, month, quarter, days_since_last From text: email_body → word_count, has_urgency_words, num_links, caps_ratio From location: lat, lon → distance_to_city_center, neighborhood_avg_income Interactions: price, sqft → price_per_sqft clicks, views → click_through_rate

Feature Selection

Not all features help. Irrelevant or redundant features add noise and slow training. Feature selection removes the least useful features.

# Feature selection methods Filter (fast, model-independent) Correlation, mutual information, chi-squared Remove features with low relevance to target Wrapper (accurate, expensive) Forward selection: add best feature one by one Backward elimination: remove worst one by one Embedded (built into model training) L1 regularization drives weights to zero Tree feature importance (Gini, information gain) Automatically selects during training

Deep learning reduces manual feature engineering. CNNs learn image features automatically. Transformers learn text representations. But for tabular data, manual feature engineering still outperforms deep learning in most cases — domain knowledge remains irreplaceable.

call_split

Train / Validation / Test Split

The most important rule: never evaluate on data you trained on

The Three Splits

Training set (70–80%): The model learns from this data. Weights are updated based on training examples.

Validation set (10–15%): Used to tune hyperparameters and select the best model. The model never trains on this data, but you use it to make decisions.

Test set (10–15%): The final, untouched evaluation. Used once at the very end to report performance. Simulates real-world deployment.

# Typical split ratios Small dataset (< 10K samples) Train: 60% Val: 20% Test: 20% + Use K-fold cross-validation Medium dataset (10K – 1M) Train: 80% Val: 10% Test: 10% Large dataset (> 1M) Train: 98% Val: 1% Test: 1% (1% of 1M = 10K, still plenty)

Time-series warning: Never randomly shuffle time-series data. Use chronological splits — train on past, validate on recent, test on most recent. Random splitting creates data leakage because the model sees “future” data during training.

warning

Data Leakage & Class Imbalance

Two silent killers of ML projects

Data Leakage

Leakage occurs when information from outside the training set “leaks” into the model during training. The model appears to perform brilliantly in testing but fails completely in production.

Leakage

Scale entire dataset, then split into train/test. Test set statistics leak into training via the scaler’s mean and std. Model sees “future” information.

Correct

Split first, then fit scaler on training set only. Apply the same scaler to validation and test sets. No information leaks.

Common leakage sources: Preprocessing before splitting, using future data in time-series, including the target variable (or a proxy) as a feature, duplicate records across splits.

Class Imbalance

When one class vastly outnumbers another (99% legitimate transactions, 1% fraud), the model can achieve 99% accuracy by predicting “legitimate” for everything — while catching zero fraud.

# Handling class imbalance Oversampling minority SMOTE: create synthetic minority examples by interpolating between neighbors Undersampling majority Randomly remove majority class examples Risk: lose useful information Class weights Tell the model: misclassifying fraud costs 100x more than misclassifying legit Better metrics Use F1, precision, recall, AUC-ROC instead of raw accuracy

add_photo_alternate

Data Augmentation & Modern Pipelines

Creating more data and scaling to production

Data Augmentation

When you can’t collect more data, create variations of existing data. This increases effective dataset size and improves generalization.

# Image augmentation Geometric: rotate, flip, crop, zoom, shear Color: brightness, contrast, saturation Noise: Gaussian noise, blur, cutout Advanced: MixUp, CutMix, random erasing # Text augmentation Synonym: replace words with synonyms Back-trans: translate to French → back to English LLM-gen: use GPT to paraphrase examples # Tabular augmentation SMOTE: synthetic minority oversampling Noise: add small random perturbations

Key Takeaways

1. Data quality matters more than model complexity

2. Always split before preprocessing (prevent leakage)

3. Scale numerical features; encode categorical features

4. Feature engineering encodes domain knowledge

5. Handle class imbalance with SMOTE, class weights, or better metrics

6. Augmentation creates more training data from what you have

7. Deep learning automates feature extraction for images and text, but tabular data still benefits from manual engineering

Coming up: Ch 5 introduces the perceptron — the simplest neural network — and shows how neurons transform input features into predictions. Ch 6 covers how training actually works (backpropagation, gradient descent in practice).