Ch 4 — Data & Features — Under the Hood

Scaling math, encoding internals, SMOTE algorithm, leakage detection, and augmentation pipelines
Under the Hood
-
Click play or press Space to begin. Click any node for deep-dive details...
Step- / 10
AScaling & Normalization MathMin-max · Z-score · robust scaling · when to use each
1
straighten
Min-MaxScale to [0, 1]
preserve distribution
2
functions
Z-ScoreMean=0, std=1
standardization
shield
RobustMedian & IQR
outlier-resistant
3
arrow_downward From numerical scaling to categorical encoding
BEncoding InternalsOne-hot · ordinal · target encoding · embeddings
grid_on
One-HotBinary columns
per category
4
tag
Target Enc.Replace with
mean target value
data_array
EmbeddingsLearned dense
representations
5
arrow_downward Feature engineering: creating signal from raw data
CFeature Engineering InternalsInteraction features · binning · feature selection math
join
InteractionsCross features
polynomial terms
6
filter_alt
SelectionMutual information
importance scores
7
arrow_downward Data leakage: the silent model killer
DLeakage Detection & SplittingPreprocessing order · temporal splits · stratification
warning
Leakage TypesTarget leakage
train-test contamination
8
call_split
Split StrategiesStratified, temporal
group-aware splits
9
arrow_downward Class imbalance & data augmentation techniques
EImbalance & AugmentationSMOTE algorithm · class weights · image/text augmentation
balance
SMOTESynthetic minority
oversampling
10
add_photo_alternate
AugmentationImage, text, tabular
data expansion