Sampling & Decoding
A language model produces a probability distribution over the next token. How you pick from that distribution dramatically changes the output. The same model + same prompt + different sampling = wildly different behavior.
Greedy decoding
Always pick the highest-probability token.
next_token = torch.argmax(logits, dim=-1)
Pros:
- Deterministic.
- Highest single-token probability at every step.
Cons:
- Repetitive (loops on common phrases).
- Boring.
- Often produces locally-optimal but globally-bad text.
Used for: extractive tasks where there’s a single “correct” output (some classification, simple QA).
Beam search
Keep the top-k partial sequences at each step; pick the highest cumulative probability at the end.
Pros:
- Better than greedy for output quality.
- Used in machine translation historically.
Cons:
- Still tends to be bland and repetitive.
- Computationally expensive.
- Has been mostly abandoned for open-ended generation in the LLM era.
For chat / generation, beam search is rarely used.
Temperature
Divide the logits by T before softmax:
softmax(logits / T)
T < 1: sharper distribution, greedier.T = 1: model’s natural distribution.T > 1: flatter, more random.
Common settings:
- 0: greedy (most APIs treat T=0 as argmax).
- 0.0–0.3: factual, deterministic, code-y.
- 0.5–0.7: chat default; balances coherence and variety.
- 0.8–1.0: creative writing, brainstorming.
- >1.0: increasingly random; rarely useful.
Top-k sampling
Restrict to the top-k most likely tokens; sample from those.
top_k_values, top_k_indices = torch.topk(logits, k=40)
probs = F.softmax(top_k_values, dim=-1)
sampled = torch.multinomial(probs, 1)
Cuts off the long tail of unlikely (often nonsensical) tokens.
Common: k = 40. Works well in practice.
Top-p (nucleus) sampling
Restrict to the smallest set of tokens whose cumulative probability exceeds p.
- For peaked distributions, this is a small set.
- For flat distributions, this is a large set.
Adapts to the local distribution shape — generally preferred over fixed top-k.
Common: p = 0.9 to 0.95.
sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumulative_probs = torch.softmax(sorted_logits, dim=-1).cumsum(dim=-1)
nucleus = cumulative_probs <= p
# include the first token that crosses the threshold
nucleus[..., 1:] = nucleus[..., :-1].clone()
nucleus[..., 0] = True
Combining: temperature + top-p + top-k
Most APIs let you set all three. Common defaults:
temperature = 0.7top_p = 0.95top_k = 40(sometimes off)
Order: top-k filter → top-p filter → temperature → softmax → sample.
Min-p sampling
Newer (2024): keep tokens whose probability is at least min_p × max_prob. Adapts to confidence — for confident distributions, only top tokens make it; for flat ones, more do.
Often a better default than top-p for generation quality. Supported in some frameworks (e.g. llama.cpp).
Repetition penalty
Penalize tokens that have already been generated:
adjusted_logits[token] /= repetition_penalty for token in past
Common: 1.1 to 1.3. Prevents looping. Modern frontier models rarely need this; smaller models often do.
Frequency / presence penalties (OpenAI-style)
- Frequency penalty: penalize tokens proportional to how often they’ve appeared.
- Presence penalty: penalize tokens that have appeared at all (binary).
Both push the model toward more varied output. 0.0 to 2.0 typical range.
Stop sequences
Halt generation when a specific string is produced:
stop_sequences = ["\n\nUser:", "###", "<|end|>"]
Useful for:
- Multi-turn formats (stop at next role marker).
- Few-shot prompting (stop before the next example starts).
- Controlling output length.
Max tokens
Hard cap on output length. Watch out:
- Hitting the cap mid-sentence produces truncated outputs.
- Cost is bounded by max_tokens; budget accordingly.
- Some APIs include reasoning tokens in the cap; others don’t.
Logit bias
Force or forbid specific tokens by adjusting their logits:
logit_bias = {
token_id_yes: 100,
token_id_no: 100,
everything_else: -100,
}
Used for forcing yes/no answers, restricting vocabulary, or implementing custom constraints.
OpenAI exposes logit_bias directly. With local models, you can manipulate logits between sample steps.
Speculative decoding
Inference optimization, not a sampling strategy per se. A small “draft” model proposes the next several tokens; the main model verifies them in parallel. Most accepted → faster generation, same quality as the main model alone.
Used in vLLM, TensorRT-LLM, llama.cpp. Free 2–3× speedup with minimal setup.
Constrained decoding
Restrict generation to a grammar or schema:
- JSON-schema-constrained decoding: only generate tokens that keep the output valid JSON matching a schema.
- Regex-constrained decoding: only tokens consistent with a regex.
- Custom grammars: e.g. SQL grammar.
Tools: Outlines, Guidance, llama.cpp’s grammar support, OpenAI’s strict mode.
This is how “100% valid JSON” guarantees actually work — they’re not “the model is good,” they’re “the decoder physically cannot emit invalid tokens.”
Choosing parameters by task
| Task | Temperature | Top-p | Other |
|---|---|---|---|
| Code generation | 0.0–0.3 | 0.95 | repetition_penalty 1.0 |
| Factual Q&A | 0.0–0.3 | 0.95 | |
| Chat | 0.6–0.8 | 0.95 | |
| Creative writing | 0.8–1.0 | 0.95 | |
| Brainstorming | 0.9–1.2 | 1.0 | |
| Self-consistency CoT | 0.7–0.9 | 0.95 | sample N times |
| Beam search | n/a | n/a | mostly avoid |
| Structured output | 0.0 | n/a | constrained decoding |
Reproducibility
Even with temperature=0, you may not get bit-exact reproducibility:
- Hardware non-determinism: floating-point operations on GPU may vary.
- Concurrent requests: server-side batching can interact.
- Sampling RNG: set seed if your API supports it.
For exact reproducibility, run locally with fixed seeds. For production, log inputs/outputs but accept some non-determinism.
Common pitfalls
- Setting temperature=0 doesn’t make the model deterministic across hardware/setups.
- High temperature with structured output is dangerous — JSON validity drops.
- Forgetting stop sequences → models ramble.
- Repetition penalty too high → unnatural avoidance of common words.
- Logit bias too strong → model breaks (e.g. force “yes” but it tries to emit “yes.” with a period).
See also
- Prompt fundamentals
- Structured outputs
- Stage 13 — Cost & latency — speculative decoding economics