Few-Shot & Chain-of-Thought

The two most important prompting techniques. Both turn a model from “guess based on instructions” to “follow demonstrated patterns.”

Zero-shot

The baseline. Just describe the task and ask:

Classify the sentiment of this review as positive, negative, or neutral.

Review: "I love this product!"
Sentiment:

Modern models do well at zero-shot for common tasks. For unusual tasks, schemas, or rare formats — zero-shot often fails subtly.

Few-shot prompting

Show the model a few examples of input → output before the real query:

Classify the sentiment of this review as positive, negative, or neutral.

Review: "Excellent quality, fast shipping."
Sentiment: positive

Review: "It works but the manual is unclear."
Sentiment: neutral

Review: "Broke after a week, do not buy."
Sentiment: negative

Review: "I love this product!"
Sentiment:

The model continues the pattern. Few-shot is the single most effective generic technique for getting the right output format and behavior.

How many examples

0: works for simple, common tasks.
1–3: huge jump in reliability for format-sensitive tasks.
5–10: marginal returns; sometimes useful for rare tasks.
20+: usually wasteful — diminishing returns, increased cost, more chances for distractors.

Choosing examples

Cover the distribution of inputs you’ll see.
Include edge cases you want handled — empty inputs, unusual phrasings.
Avoid examples in a single style that the model will overfit to.
Keep them short — long examples eat your context budget.
Order matters: the most recent example has more influence than earlier ones.

Dynamic few-shot (RAG-for-prompts)

Instead of fixed examples, retrieve relevant ones from a database:

def build_prompt(query, example_db):
    similar_examples = example_db.retrieve_top_k(query, k=3)
    return format_few_shot(similar_examples + [query])

This is a form of RAG (Stage 09) — the retrieval target is example demonstrations.

Chain-of-Thought (CoT)

Wei et al. (2022): for multi-step problems, having the model reason step by step before answering dramatically improves accuracy.

Zero-shot CoT

The famous one-line magic:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has
3 tennis balls. How many tennis balls does he have now?

Let's think step by step.

The “Let’s think step by step” trigger (Kojima et al. 2022) reliably elicits CoT from instruction-tuned models.

Few-shot CoT

Show the reasoning explicitly in examples:

Q: Roger has 5 tennis balls. He buys 2 more cans. Each can has 3 balls. How many balls does he have?
A: Roger started with 5 balls. 2 cans × 3 balls = 6 more balls. 5 + 6 = 11. The answer is 11.

Q: A juggler can juggle 16 balls. Half of them are golf balls, and half of the golf balls are blue. How many blue golf balls?
A: Total balls: 16. Golf balls: 16/2 = 8. Blue golf balls: 8/2 = 4. The answer is 4.

Q: <new question>
A:

The model continues the “show your work” pattern.

Why CoT works

Each generated CoT token expands the model’s effective compute. Instead of computing the answer in one forward pass, the model breaks the problem into steps and uses earlier output as input to later steps.

For arithmetic, multi-hop reasoning, planning — CoT often turns a 30% accuracy task into 80%.

Limitations of CoT

Plausible nonsense: a confident-sounding chain can lead to a wrong answer.
Compute cost: more tokens generated = more time and money.
Not always better: simple tasks don’t benefit.
Pretraining signal: models trained without much CoT data don’t follow it as well.

For weaker models, CoT helps a lot. For frontier models, the gap shrinks because they were trained on lots of CoT examples already.

Reasoning models vs CoT

Reasoning models (Stage 07) internalize CoT. They were trained to think for a long time before answering, often with a hidden trace.

For reasoning models, you usually don’t need to ask for CoT — they do it automatically. Asking explicitly can even slow them down or confuse them.

Self-consistency

Sample multiple CoT outputs at high temperature; take the majority vote of the final answers (Wang et al. 2022).

answers = []
for _ in range(10):
    response = model.complete(cot_prompt, temperature=0.8)
    answers.append(extract_answer(response))
return Counter(answers).most_common(1)[0][0]

Often boosts accuracy 5–15 percentage points over single-shot CoT. Cost: 10× the API calls.

Variants and extensions

Plan-and-Solve (Wang et al. 2023): “First devise a plan; then carry out the plan.”
Step-Back Prompting (Zheng et al. 2023): ask for the high-level concept first, then the specific answer.
Decomposition prompts: explicitly break the problem into named subproblems.
Self-ask: the model asks itself follow-up questions before answering.

These are all riffs on “use generated tokens to do more thinking before committing to an answer.”

When CoT hurts

Simple lookups: “What’s the capital of France?” doesn’t need CoT.
Format-strict outputs (JSON): CoT bleeds into the format. Use a separate reasoning model or tool calls.
Latency-critical paths: CoT triples your latency.

Practical patterns

”Reason then answer” with separator

Think through the problem in <reasoning> tags. Then output your final answer in <answer> tags.

Question: ...

The model produces:

<reasoning>...</reasoning>
<answer>42</answer>

You parse the answer cleanly.

CoT for reliability, fast model for triage

Use a small fast model to decide if CoT is needed; if so, route to a stronger model.

Verify with a second model

Generate an answer with CoT; have a second model check the chain. Sometimes catches confident wrong answers.