Ch 4: Data & Features — Under the Hood

Ch 4 — Data & Features — Under the Hood

Scaling math, encoding internals, SMOTE algorithm, leakage detection, and augmentation pipelines

Under the Hood

Click play or press Space to begin. Click any node for deep-dive details...

Step- / 10

AScaling & Normalization MathMin-max · Z-score · robust scaling · when to use each

straighten

Min-MaxScale to [0, 1]
preserve distribution

functions

Z-ScoreMean=0, std=1
standardization

shield

RobustMedian & IQR
outlier-resistant

arrow_downward From numerical scaling to categorical encoding

BEncoding InternalsOne-hot · ordinal · target encoding · embeddings

grid_on

One-HotBinary columns
per category

tag

Target Enc.Replace with
mean target value

data_array

EmbeddingsLearned dense
representations

arrow_downward Feature engineering: creating signal from raw data

CFeature Engineering InternalsInteraction features · binning · feature selection math

join

InteractionsCross features
polynomial terms

filter_alt

SelectionMutual information
importance scores

arrow_downward Data leakage: the silent model killer

DLeakage Detection & SplittingPreprocessing order · temporal splits · stratification

warning

Leakage TypesTarget leakage
train-test contamination

call_split

Split StrategiesStratified, temporal
group-aware splits

arrow_downward Class imbalance & data augmentation techniques

EImbalance & AugmentationSMOTE algorithm · class weights · image/text augmentation

balance

SMOTESynthetic minority
oversampling

add_photo_alternate

AugmentationImage, text, tabular
data expansion