Sampling & Decoding

A language model produces a probability distribution over the next token. How you pick from that distribution dramatically changes the output. The same model + same prompt + different sampling = wildly different behavior.

Greedy decoding

Always pick the highest-probability token.

next_token = torch.argmax(logits, dim=-1)

Pros:

  • Deterministic.
  • Highest single-token probability at every step.

Cons:

  • Repetitive (loops on common phrases).
  • Boring.
  • Often produces locally-optimal but globally-bad text.

Used for: extractive tasks where there’s a single “correct” output (some classification, simple QA).

Keep the top-k partial sequences at each step; pick the highest cumulative probability at the end.

Pros:

  • Better than greedy for output quality.
  • Used in machine translation historically.

Cons:

  • Still tends to be bland and repetitive.
  • Computationally expensive.
  • Has been mostly abandoned for open-ended generation in the LLM era.

For chat / generation, beam search is rarely used.

Temperature

Divide the logits by T before softmax:

softmax(logits / T)
  • T < 1: sharper distribution, greedier.
  • T = 1: model’s natural distribution.
  • T > 1: flatter, more random.

Common settings:

  • 0: greedy (most APIs treat T=0 as argmax).
  • 0.0–0.3: factual, deterministic, code-y.
  • 0.5–0.7: chat default; balances coherence and variety.
  • 0.8–1.0: creative writing, brainstorming.
  • >1.0: increasingly random; rarely useful.

Top-k sampling

Restrict to the top-k most likely tokens; sample from those.

top_k_values, top_k_indices = torch.topk(logits, k=40)
probs = F.softmax(top_k_values, dim=-1)
sampled = torch.multinomial(probs, 1)

Cuts off the long tail of unlikely (often nonsensical) tokens.

Common: k = 40. Works well in practice.

Top-p (nucleus) sampling

Restrict to the smallest set of tokens whose cumulative probability exceeds p.

  • For peaked distributions, this is a small set.
  • For flat distributions, this is a large set.

Adapts to the local distribution shape — generally preferred over fixed top-k.

Common: p = 0.9 to 0.95.

sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumulative_probs = torch.softmax(sorted_logits, dim=-1).cumsum(dim=-1)
nucleus = cumulative_probs <= p
# include the first token that crosses the threshold
nucleus[..., 1:] = nucleus[..., :-1].clone()
nucleus[..., 0] = True

Combining: temperature + top-p + top-k

Most APIs let you set all three. Common defaults:

  • temperature = 0.7
  • top_p = 0.95
  • top_k = 40 (sometimes off)

Order: top-k filter → top-p filter → temperature → softmax → sample.

Min-p sampling

Newer (2024): keep tokens whose probability is at least min_p × max_prob. Adapts to confidence — for confident distributions, only top tokens make it; for flat ones, more do.

Often a better default than top-p for generation quality. Supported in some frameworks (e.g. llama.cpp).

Repetition penalty

Penalize tokens that have already been generated:

adjusted_logits[token] /= repetition_penalty   for token in past

Common: 1.1 to 1.3. Prevents looping. Modern frontier models rarely need this; smaller models often do.

Frequency / presence penalties (OpenAI-style)

  • Frequency penalty: penalize tokens proportional to how often they’ve appeared.
  • Presence penalty: penalize tokens that have appeared at all (binary).

Both push the model toward more varied output. 0.0 to 2.0 typical range.

Stop sequences

Halt generation when a specific string is produced:

stop_sequences = ["\n\nUser:", "###", "<|end|>"]

Useful for:

  • Multi-turn formats (stop at next role marker).
  • Few-shot prompting (stop before the next example starts).
  • Controlling output length.

Max tokens

Hard cap on output length. Watch out:

  • Hitting the cap mid-sentence produces truncated outputs.
  • Cost is bounded by max_tokens; budget accordingly.
  • Some APIs include reasoning tokens in the cap; others don’t.

Logit bias

Force or forbid specific tokens by adjusting their logits:

logit_bias = {
    token_id_yes: 100,
    token_id_no: 100,
    everything_else: -100,
}

Used for forcing yes/no answers, restricting vocabulary, or implementing custom constraints.

OpenAI exposes logit_bias directly. With local models, you can manipulate logits between sample steps.

Speculative decoding

Inference optimization, not a sampling strategy per se. A small “draft” model proposes the next several tokens; the main model verifies them in parallel. Most accepted → faster generation, same quality as the main model alone.

Used in vLLM, TensorRT-LLM, llama.cpp. Free 2–3× speedup with minimal setup.

Constrained decoding

Restrict generation to a grammar or schema:

  • JSON-schema-constrained decoding: only generate tokens that keep the output valid JSON matching a schema.
  • Regex-constrained decoding: only tokens consistent with a regex.
  • Custom grammars: e.g. SQL grammar.

Tools: Outlines, Guidance, llama.cpp’s grammar support, OpenAI’s strict mode.

This is how “100% valid JSON” guarantees actually work — they’re not “the model is good,” they’re “the decoder physically cannot emit invalid tokens.”

Choosing parameters by task

TaskTemperatureTop-pOther
Code generation0.0–0.30.95repetition_penalty 1.0
Factual Q&A0.0–0.30.95
Chat0.6–0.80.95
Creative writing0.8–1.00.95
Brainstorming0.9–1.21.0
Self-consistency CoT0.7–0.90.95sample N times
Beam searchn/an/amostly avoid
Structured output0.0n/aconstrained decoding

Reproducibility

Even with temperature=0, you may not get bit-exact reproducibility:

  • Hardware non-determinism: floating-point operations on GPU may vary.
  • Concurrent requests: server-side batching can interact.
  • Sampling RNG: set seed if your API supports it.

For exact reproducibility, run locally with fixed seeds. For production, log inputs/outputs but accept some non-determinism.

Common pitfalls

  • Setting temperature=0 doesn’t make the model deterministic across hardware/setups.
  • High temperature with structured output is dangerous — JSON validity drops.
  • Forgetting stop sequences → models ramble.
  • Repetition penalty too high → unnatural avoidance of common words.
  • Logit bias too strong → model breaks (e.g. force “yes” but it tries to emit “yes.” with a period).

See also