Text-to-Code

The fastest-growing AI application category. From autocomplete (Copilot) to in-IDE chat (Cursor) to fully agentic coding (Claude Code, Devin, Cline). The patterns are converging on a recognizable stack.

The product spectrum

Increasing in agency:

Inline completion: Copilot, Tabnine, Codeium. Suggest the next token / line.
Comment-driven generation: write a comment; generate the function.
In-IDE chat: ask questions about your codebase, paste context, get answers.
Multi-file edit: “rename this across the codebase”; “extract this to a service.”
Agentic coding: “build me a feature.” Plans, edits, tests, iterates.

Top tools (early 2026):

Claude Code (Anthropic) — agentic CLI/IDE.
Cursor, Windsurf — IDE-replacement editors.
GitHub Copilot Workspace / Coding Agent.
Cline / Roo Code — open agentic IDE plugins.
Devin / Replit Agent — autonomous “give me the spec, get a PR.”
Aider — terminal-based agentic editor.
Continue.dev — open framework for IDE assistants.

What makes coding hard for LLMs

Long context: codebases are huge.
Strict syntax: typos break everything.
Side effects: code does things; mistakes cause damage.
Versions: APIs change; outdated training is a hazard.
Project conventions: style, structure, architecture.
Cross-file dependencies: changing one file affects many.
Tooling integration: linters, type checkers, tests.

Architectural patterns

Repository indexing

For “answer questions about my codebase,” index it:

Chunk by function/class/file.
Embed each chunk with a code-tuned embedder (voyage-code-3, nomic-embed-code, CodeRankEmbed).
Hybrid retrieval (semantic + keyword) — code search benefits from BM25.
Optionally graph-based: index the call graph, type graph, import graph.

Tools: tree-sitter for parsing, dedicated indexers (Sourcegraph, Bloop, Cursor’s indexer).

Context selection

The agent must decide what to put in context:

Files mentioned by user.
Files retrieved via semantic search.
Files in the same directory.
Files imported by the target file.
Recent edits.
Failing tests / errors.

Smart context selection is what separates Copilot-feeling tools from “really gets my codebase” tools.

Tool-driven editing

For multi-file edits, agentic tools use a small set of strong tools:

read_file(path, line_range)
write_file(path, content)
edit_file(path, old_string, new_string)
glob(pattern), grep(pattern, path)
run_shell(cmd)
run_tests(path)

Claude Code’s tool set is roughly this. It’s small and composable.

Verify with tools

The hallmark of strong code agents:

Edit code → run tests / type-check / linter → see errors → fix → repeat.

This is the verifier-loop pattern from Stage 11. Code is uniquely well-suited to it because compilation and tests are cheap to run.

In-IDE patterns

Inline completion

Pattern: every keystroke triggers a quick completion call. Models tuned for this:

Tab-Tab models: sub-second TTFT, multi-line.
Fill-in-the-middle (FIM): the model fills a gap given prefix and suffix.

Code-specialized models (DeepSeek-Coder, Qwen-Coder, Codestral) have FIM tokens trained in.

Chat with context

User selects code, asks a question. The IDE injects relevant context (selection + open files + retrieved chunks).

Apply edits

Cursor’s “apply” pattern: chat suggests an edit; user clicks “apply”; the IDE merges the diff into the file. Cleaner than asking the user to copy-paste.

Diffs as the UX

Show every model edit as a diff. Let the user accept/reject per-hunk. The user stays in control; the agent contributes.

Agentic coding patterns

Plan-execute-verify

1. Understand the task (read relevant files)
2. Plan the approach (sometimes explicitly, sometimes implicitly)
3. Make edits
4. Run tests / type-checker
5. If failures: debug; go to 3
6. If success: summarize for user

Claude Code, Cursor’s agent mode, Devin all roughly follow this.

Sub-agents

For big tasks, spawn sub-agents:

A “spec” sub-agent reviews the request and produces a plan.
A “research” sub-agent reads relevant code.
An “implement” sub-agent does the editing.
A “review” sub-agent checks the result.

Often the same model with different system prompts and tools.

Memory across runs

Project-specific memory:

CLAUDE.md / .cursorrules / .aider.conf — persistent project context.
“Lessons” from prior sessions: things the agent should remember.

By 2026, this pattern is standard.

Eval

How do you know your code agent is good?

Public benchmarks

HumanEval, MBPP: single-function code generation.
BigCodeBench: harder single-function.
SWE-bench Verified: real-world GitHub issues from major Python projects.
LiveCodeBench: contemporary competitive programming.
CodeContests (DeepMind): competitive programming.

SWE-bench is the closest to “real-world coding.” Top frontier agents score 60–80% on Verified subset by 2026 — wildly improved from 2024’s <20%.

Internal evals

Replay real issues from your repo; see if the agent solves them.
Inject bugs into known-good code; see if agent fixes.
Measure on time to first useful suggestion, time to test pass, PR acceptance rate.

Cost & latency

Agentic coding is expensive:

100k+ tokens of context for a serious codebase task.
Many iterations.
Reasoning models for planning.

A real “fix this issue” task might cost $0.50–$10 in API calls. Worth it if it saves a developer hour, expensive at scale for trivial tasks.

Mitigations:

Prompt caching (huge for repeated codebase context).
Cheap router → strong agent: a cheap model decides if the task needs heavy lifting.
Bounded tool calls per task.
Streaming and intermediate results so the user can intervene.

Safety

Code agents have write access to your filesystem and shell. Treat carefully:

Sandbox where possible: containers, VMs, ephemeral workspaces.
Confirmation for destructive shell commands.
Read/write scope: limit which directories the agent can touch.
Don’t run untrusted code outside a sandbox.
Auditable logs of every tool call.

Model selection

For coding tasks specifically:

Frontier all-rounders: Claude Sonnet 4.6/4.7, GPT-5, Gemini 3 — strong at code.
Code-specialized open: Qwen2.5-Coder, DeepSeek-Coder-V3, Codestral.
Reasoning for hard bugs: extended thinking / o-series models.
Cheap for autocomplete: smaller code-tuned models.

Most production tools use a frontier model for serious tasks and a smaller one for inline suggestions.

Pitfalls

Stale model knowledge: outdated APIs in suggestions. Mitigate with docs in context.
Confidently wrong code: especially in unfamiliar libraries. Always run tests.
Subtle security bugs: SQL injection, XSS, command injection — code looks fine, isn’t. Use security-aware static analysis.
Over-eager refactoring: agent rewrites things unnecessarily. Constrain in system prompt.
Forgetting tests: agent fixes feature, breaks tests. Always run them.
Loops: agent edits same file repeatedly. Detect and bail.

Patterns from Claude Code (and similar)

The architecture of mature code agents looks like:

A small set of well-designed tools (read, write, edit, search, shell, test).
A strong system prompt with project context.
Reasoning model with extended thinking for hard tasks.
Subagents for research / parallel exploration.
Persistent memory via project files.
Streaming UI showing every step.
Granular permissions / confirmations for risky ops.
Strict observability (every command logged).

You can build something close to this from raw API calls. Frameworks help with packaging.