The Analogy
Instead of always considering exactly K options, top-p says: “Consider the smallest set of options that covers P% of the probability.” If one token has 95% probability, top-p=0.95 might only include 1 token. If the distribution is flat, it might include 200 tokens. It adapts to the situation. This is why top-p (also called “nucleus sampling”) is the most popular method in production.
Key insight: Most LLM APIs (OpenAI, Anthropic, etc.) use temperature + top-p as the default combination. Typical defaults: T=1.0, top_p=0.95. This means: use the model’s natural probabilities, but cut off the long tail of unlikely tokens. For deterministic output, set T=0 (which makes top-p irrelevant). For creative output, T=0.8 + top_p=0.95 is a good starting point.
Top-P Implementation
def top_p_sample(logits, p=0.95, T=1.0):
scaled = logits / T
probs = torch.softmax(scaled, dim=-1)
# Sort by probability (descending)
sorted_probs, sorted_idx = probs.sort(
descending=True
)
# Cumulative sum
cumsum = sorted_probs.cumsum(dim=-1)
# Remove tokens beyond threshold p
mask = cumsum - sorted_probs > p
sorted_probs[mask] = 0
# Renormalize and sample
sorted_probs /= sorted_probs.sum()
return sorted_idx[
torch.multinomial(sorted_probs, 1)
]
# Adaptive behavior:
# "Capital of France is ___"
# P(Paris)=0.97 → nucleus = {Paris}
# Only 1 token! (very confident)
# "I enjoy eating ___"
# P(pizza)=0.08, P(pasta)=0.07, ...
# Nucleus = {pizza, pasta, sushi, ...}
# ~50 tokens (uncertain, many valid)