Sampling & Decoding

A language model produces a probability distribution over the next token. How you pick from that distribution dramatically changes the output. The same model + same prompt + different sampling = wildly different behavior.

Greedy decoding

Always pick the highest-probability token.

next_token = torch.argmax(logits, dim=-1)

Pros:

Deterministic.
Highest single-token probability at every step.

Cons:

Repetitive (loops on common phrases).
Boring.
Often produces locally-optimal but globally-bad text.

Used for: extractive tasks where there’s a single “correct” output (some classification, simple QA).

Beam search

Keep the top-k partial sequences at each step; pick the highest cumulative probability at the end.

Pros:

Better than greedy for output quality.
Used in machine translation historically.

Cons:

Still tends to be bland and repetitive.
Computationally expensive.
Has been mostly abandoned for open-ended generation in the LLM era.

For chat / generation, beam search is rarely used.

Temperature

Divide the logits by T before softmax:

softmax(logits / T)

T < 1: sharper distribution, greedier.
T = 1: model’s natural distribution.
T > 1: flatter, more random.

Common settings:

0: greedy (most APIs treat T=0 as argmax).
0.0–0.3: factual, deterministic, code-y.
0.5–0.7: chat default; balances coherence and variety.
0.8–1.0: creative writing, brainstorming.
>1.0: increasingly random; rarely useful.

Top-k sampling

Restrict to the top-k most likely tokens; sample from those.

top_k_values, top_k_indices = torch.topk(logits, k=40)
probs = F.softmax(top_k_values, dim=-1)
sampled = torch.multinomial(probs, 1)

Cuts off the long tail of unlikely (often nonsensical) tokens.

Common: k = 40. Works well in practice.

Top-p (nucleus) sampling

Restrict to the smallest set of tokens whose cumulative probability exceeds p.

For peaked distributions, this is a small set.
For flat distributions, this is a large set.

Adapts to the local distribution shape — generally preferred over fixed top-k.

Common: p = 0.9 to 0.95.

sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumulative_probs = torch.softmax(sorted_logits, dim=-1).cumsum(dim=-1)
nucleus = cumulative_probs <= p
# include the first token that crosses the threshold
nucleus[..., 1:] = nucleus[..., :-1].clone()
nucleus[..., 0] = True

Combining: temperature + top-p + top-k

Most APIs let you set all three. Common defaults:

temperature = 0.7
top_p = 0.95
top_k = 40 (sometimes off)

Order: top-k filter → top-p filter → temperature → softmax → sample.

Min-p sampling

Newer (2024): keep tokens whose probability is at least min_p × max_prob. Adapts to confidence — for confident distributions, only top tokens make it; for flat ones, more do.

Often a better default than top-p for generation quality. Supported in some frameworks (e.g. llama.cpp).

Repetition penalty

Penalize tokens that have already been generated:

adjusted_logits[token] /= repetition_penalty   for token in past

Common: 1.1 to 1.3. Prevents looping. Modern frontier models rarely need this; smaller models often do.

Frequency / presence penalties (OpenAI-style)

Frequency penalty: penalize tokens proportional to how often they’ve appeared.
Presence penalty: penalize tokens that have appeared at all (binary).

Both push the model toward more varied output. 0.0 to 2.0 typical range.

Stop sequences

Halt generation when a specific string is produced:

stop_sequences = ["\n\nUser:", "###", "<|end|>"]

Useful for:

Multi-turn formats (stop at next role marker).
Few-shot prompting (stop before the next example starts).
Controlling output length.

Max tokens

Hard cap on output length. Watch out:

Hitting the cap mid-sentence produces truncated outputs.
Cost is bounded by max_tokens; budget accordingly.
Some APIs include reasoning tokens in the cap; others don’t.

Logit bias

Force or forbid specific tokens by adjusting their logits:

logit_bias = {
    token_id_yes: 100,
    token_id_no: 100,
    everything_else: -100,
}

Used for forcing yes/no answers, restricting vocabulary, or implementing custom constraints.

OpenAI exposes logit_bias directly. With local models, you can manipulate logits between sample steps.

Speculative decoding

Inference optimization, not a sampling strategy per se. A small “draft” model proposes the next several tokens; the main model verifies them in parallel. Most accepted → faster generation, same quality as the main model alone.

Used in vLLM, TensorRT-LLM, llama.cpp. Free 2–3× speedup with minimal setup.

Constrained decoding

Restrict generation to a grammar or schema:

JSON-schema-constrained decoding: only generate tokens that keep the output valid JSON matching a schema.
Regex-constrained decoding: only tokens consistent with a regex.
Custom grammars: e.g. SQL grammar.

Tools: Outlines, Guidance, llama.cpp’s grammar support, OpenAI’s strict mode.

This is how “100% valid JSON” guarantees actually work — they’re not “the model is good,” they’re “the decoder physically cannot emit invalid tokens.”

Choosing parameters by task

Task	Temperature	Top-p	Other
Code generation	0.0–0.3	0.95	repetition_penalty 1.0
Factual Q&A	0.0–0.3	0.95
Chat	0.6–0.8	0.95
Creative writing	0.8–1.0	0.95
Brainstorming	0.9–1.2	1.0
Self-consistency CoT	0.7–0.9	0.95	sample N times
Beam search	n/a	n/a	mostly avoid
Structured output	0.0	n/a	constrained decoding

Reproducibility

Even with temperature=0, you may not get bit-exact reproducibility:

Hardware non-determinism: floating-point operations on GPU may vary.
Concurrent requests: server-side batching can interact.
Sampling RNG: set seed if your API supports it.

For exact reproducibility, run locally with fixed seeds. For production, log inputs/outputs but accept some non-determinism.

Common pitfalls

Setting temperature=0 doesn’t make the model deterministic across hardware/setups.
High temperature with structured output is dangerous — JSON validity drops.
Forgetting stop sequences → models ramble.
Repetition penalty too high → unnatural avoidance of common words.
Logit bias too strong → model breaks (e.g. force “yes” but it tries to emit “yes.” with a period).