Text-to-Code
The fastest-growing AI application category. From autocomplete (Copilot) to in-IDE chat (Cursor) to fully agentic coding (Claude Code, Devin, Cline). The patterns are converging on a recognizable stack.
The product spectrum
Increasing in agency:
- Inline completion: Copilot, Tabnine, Codeium. Suggest the next token / line.
- Comment-driven generation: write a comment; generate the function.
- In-IDE chat: ask questions about your codebase, paste context, get answers.
- Multi-file edit: “rename this across the codebase”; “extract this to a service.”
- Agentic coding: “build me a feature.” Plans, edits, tests, iterates.
Top tools (early 2026):
- Claude Code (Anthropic) — agentic CLI/IDE.
- Cursor, Windsurf — IDE-replacement editors.
- GitHub Copilot Workspace / Coding Agent.
- Cline / Roo Code — open agentic IDE plugins.
- Devin / Replit Agent — autonomous “give me the spec, get a PR.”
- Aider — terminal-based agentic editor.
- Continue.dev — open framework for IDE assistants.
What makes coding hard for LLMs
- Long context: codebases are huge.
- Strict syntax: typos break everything.
- Side effects: code does things; mistakes cause damage.
- Versions: APIs change; outdated training is a hazard.
- Project conventions: style, structure, architecture.
- Cross-file dependencies: changing one file affects many.
- Tooling integration: linters, type checkers, tests.
Architectural patterns
Repository indexing
For “answer questions about my codebase,” index it:
- Chunk by function/class/file.
- Embed each chunk with a code-tuned embedder (
voyage-code-3,nomic-embed-code,CodeRankEmbed). - Hybrid retrieval (semantic + keyword) — code search benefits from BM25.
- Optionally graph-based: index the call graph, type graph, import graph.
Tools: tree-sitter for parsing, dedicated indexers (Sourcegraph, Bloop, Cursor’s indexer).
Context selection
The agent must decide what to put in context:
- Files mentioned by user.
- Files retrieved via semantic search.
- Files in the same directory.
- Files imported by the target file.
- Recent edits.
- Failing tests / errors.
Smart context selection is what separates Copilot-feeling tools from “really gets my codebase” tools.
Tool-driven editing
For multi-file edits, agentic tools use a small set of strong tools:
read_file(path, line_range)write_file(path, content)edit_file(path, old_string, new_string)glob(pattern),grep(pattern, path)run_shell(cmd)run_tests(path)
Claude Code’s tool set is roughly this. It’s small and composable.
Verify with tools
The hallmark of strong code agents:
- Edit code → run tests / type-check / linter → see errors → fix → repeat.
This is the verifier-loop pattern from Stage 11. Code is uniquely well-suited to it because compilation and tests are cheap to run.
In-IDE patterns
Inline completion
Pattern: every keystroke triggers a quick completion call. Models tuned for this:
- Tab-Tab models: sub-second TTFT, multi-line.
- Fill-in-the-middle (FIM): the model fills a gap given prefix and suffix.
Code-specialized models (DeepSeek-Coder, Qwen-Coder, Codestral) have FIM tokens trained in.
Chat with context
User selects code, asks a question. The IDE injects relevant context (selection + open files + retrieved chunks).
Apply edits
Cursor’s “apply” pattern: chat suggests an edit; user clicks “apply”; the IDE merges the diff into the file. Cleaner than asking the user to copy-paste.
Diffs as the UX
Show every model edit as a diff. Let the user accept/reject per-hunk. The user stays in control; the agent contributes.
Agentic coding patterns
Plan-execute-verify
1. Understand the task (read relevant files)
2. Plan the approach (sometimes explicitly, sometimes implicitly)
3. Make edits
4. Run tests / type-checker
5. If failures: debug; go to 3
6. If success: summarize for user
Claude Code, Cursor’s agent mode, Devin all roughly follow this.
Sub-agents
For big tasks, spawn sub-agents:
- A “spec” sub-agent reviews the request and produces a plan.
- A “research” sub-agent reads relevant code.
- An “implement” sub-agent does the editing.
- A “review” sub-agent checks the result.
Often the same model with different system prompts and tools.
Memory across runs
Project-specific memory:
CLAUDE.md/.cursorrules/.aider.conf— persistent project context.- “Lessons” from prior sessions: things the agent should remember.
By 2026, this pattern is standard.
Eval
How do you know your code agent is good?
Public benchmarks
- HumanEval, MBPP: single-function code generation.
- BigCodeBench: harder single-function.
- SWE-bench Verified: real-world GitHub issues from major Python projects.
- LiveCodeBench: contemporary competitive programming.
- CodeContests (DeepMind): competitive programming.
SWE-bench is the closest to “real-world coding.” Top frontier agents score 60–80% on Verified subset by 2026 — wildly improved from 2024’s <20%.
Internal evals
- Replay real issues from your repo; see if the agent solves them.
- Inject bugs into known-good code; see if agent fixes.
- Measure on time to first useful suggestion, time to test pass, PR acceptance rate.
Cost & latency
Agentic coding is expensive:
- 100k+ tokens of context for a serious codebase task.
- Many iterations.
- Reasoning models for planning.
A real “fix this issue” task might cost $0.50–$10 in API calls. Worth it if it saves a developer hour, expensive at scale for trivial tasks.
Mitigations:
- Prompt caching (huge for repeated codebase context).
- Cheap router → strong agent: a cheap model decides if the task needs heavy lifting.
- Bounded tool calls per task.
- Streaming and intermediate results so the user can intervene.
Safety
Code agents have write access to your filesystem and shell. Treat carefully:
- Sandbox where possible: containers, VMs, ephemeral workspaces.
- Confirmation for destructive shell commands.
- Read/write scope: limit which directories the agent can touch.
- Don’t run untrusted code outside a sandbox.
- Auditable logs of every tool call.
Model selection
For coding tasks specifically:
- Frontier all-rounders: Claude Sonnet 4.6/4.7, GPT-5, Gemini 3 — strong at code.
- Code-specialized open: Qwen2.5-Coder, DeepSeek-Coder-V3, Codestral.
- Reasoning for hard bugs: extended thinking / o-series models.
- Cheap for autocomplete: smaller code-tuned models.
Most production tools use a frontier model for serious tasks and a smaller one for inline suggestions.
Pitfalls
- Stale model knowledge: outdated APIs in suggestions. Mitigate with docs in context.
- Confidently wrong code: especially in unfamiliar libraries. Always run tests.
- Subtle security bugs: SQL injection, XSS, command injection — code looks fine, isn’t. Use security-aware static analysis.
- Over-eager refactoring: agent rewrites things unnecessarily. Constrain in system prompt.
- Forgetting tests: agent fixes feature, breaks tests. Always run them.
- Loops: agent edits same file repeatedly. Detect and bail.
Patterns from Claude Code (and similar)
The architecture of mature code agents looks like:
- A small set of well-designed tools (read, write, edit, search, shell, test).
- A strong system prompt with project context.
- Reasoning model with extended thinking for hard tasks.
- Subagents for research / parallel exploration.
- Persistent memory via project files.
- Streaming UI showing every step.
- Granular permissions / confirmations for risky ops.
- Strict observability (every command logged).
You can build something close to this from raw API calls. Frameworks help with packaging.
See also
- Stage 11 — Agents
- Stage 09 — RAG — code-RAG variant
- Stage 13 — Production