Ch 3 — Dataset Preparation — Under the Hood

HuggingFace Datasets, format conversion, synthetic generation code, dedup, and packing
Under the Hood
-
Click play or press Space to begin...
Step- / 10
ALoading & Format ConversionHuggingFace Datasets, JSONL, Alpaca-to-messages
1
cloud_download
Load Data
HF Datasets / JSONL
parse
1
transform
Convert
To messages format
inspect
2
analytics
Analyze
Length distribution
3
auto_awesomeSynthetic data: GPT-4o generation with seed examples, batch API, and quality filtering
BCleaning, Filtering & DeduplicationRemoving noise and duplicates from the dataset
4
cleaning_services
Clean Text
Normalize, strip
filter
4
filter_alt
Filter
Length, quality
dedup
5
content_copy
Dedup
Hash + MinHash
6
balanceData mixing: combine task-specific data with general-purpose data to prevent catastrophic forgetting
CChat Template & TokenizationApplying model-specific formatting and creating training tensors
7
chat
Apply Template
Jinja2 formatting
tokenize
7
token
Tokenize
input_ids + labels
mask
8
visibility_off
Label Mask
-100 on prompts
DSequence Packing & DataLoaderEfficient batching for training
9
view_compact
Pack
Concat sequences
batch
9
grid_on
Batch
DataLoader
EOpen-Source Datasets & HuggingFace HubReady-to-use datasets for fine-tuning
10
public
HF Hub
Open datasets
curate
10
library_books
Popular Sets
Alpaca, Dolly, etc.
combine
10
merge
Mix
Your data + open
1
Title