Splitting in scikit-learn
from sklearn.model_selection import train_test_split
# Two-way split (simple)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Three-way split (with validation)
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.15, stratify=y, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.18, stratify=y_temp,
random_state=42
)
# Result: ~70% train, ~15% val, ~15% test
stratify=y ensures class ratios are preserved.
# If 30% spam in full data → 30% in each split.