Ch 4 — Data & Features — Under the Hood
Scaling math, encoding internals, SMOTE algorithm, leakage detection, and augmentation pipelines
Under the Hood
-
Click play or press Space to begin. Click any node for deep-dive details...
AScaling & Normalization MathMin-max · Z-score · robust scaling · when to use each
1straighten
Min-MaxScale to [0, 1]
preserve distribution
2functions
Z-ScoreMean=0, std=1
standardization
shield
RobustMedian & IQR
outlier-resistant
3arrow_downward From numerical scaling to categorical encoding
BEncoding InternalsOne-hot · ordinal · target encoding · embeddings
grid_on
One-HotBinary columns
per category
4tag
Target Enc.Replace with
mean target value
data_array
EmbeddingsLearned dense
representations
5arrow_downward Feature engineering: creating signal from raw data
CFeature Engineering InternalsInteraction features · binning · feature selection math
join
InteractionsCross features
polynomial terms
6filter_alt
SelectionMutual information
importance scores
7arrow_downward Data leakage: the silent model killer
DLeakage Detection & SplittingPreprocessing order · temporal splits · stratification
warning
Leakage TypesTarget leakage
train-test contamination
8call_split
Split StrategiesStratified, temporal
group-aware splits
9arrow_downward Class imbalance & data augmentation techniques
EImbalance & AugmentationSMOTE algorithm · class weights · image/text augmentation
balance
SMOTESynthetic minority
oversampling
10add_photo_alternate
AugmentationImage, text, tabular
data expansion