Alpaca Format
The simplest and most common format. Each example has an instruction, optional input, and output. Created by Stanford for the Alpaca project (2023). Best for single-turn instruction-following tasks.
// Alpaca format
{
"instruction": "Classify the sentiment",
"input": "This product is amazing!",
"output": "Positive"
}
ShareGPT / Conversation Format
Multi-turn conversations. Each example is a list of messages with roles. Used by Vicuna, ShareGPT datasets, and most chat fine-tuning. Best for conversational models.
// ShareGPT format
{
"conversations": [
{"from": "system", "value": "You are a legal expert."},
{"from": "human", "value": "What is force majeure?"},
{"from": "gpt", "value": "Force majeure is..."}
]
}
OpenAI Messages Format
The format used by OpenAI's fine-tuning API and increasingly adopted as a standard. Each line in a JSONL file contains a messages array with role and content fields.
// OpenAI / HuggingFace messages format
{
"messages": [
{"role": "system", "content": "You are a legal expert."},
{"role": "user", "content": "What is force majeure?"},
{"role": "assistant", "content": "Force majeure is..."}
]
}
Recommendation: Use the OpenAI messages format for new projects. It works with HuggingFace SFTTrainer, OpenAI fine-tuning API, and most tools. SFTTrainer automatically applies the correct chat template when data is in this format. Store as JSONL (one JSON object per line).
Alpaca
Single-turn
instruction/input/output
Simple tasks
ShareGPT
Multi-turn
conversations array
Chat models
OpenAI Messages
Multi-turn
messages array
Universal standard
Completion
Raw text
prompt + completion
Legacy / CPT