Case Studies
Real-world AI products and what’s reusable from them. The point isn’t to learn how Cursor specifically works; it’s to see patterns repeated across products and lift them to your domain.
Case study: Claude Code (Anthropic)
A coding agent that runs in the terminal and IDE.
Architecture (publicly disclosed):
- A small, well-designed tool set: Read, Write, Edit, Glob, Grep, Bash, plus task-specific tools.
- Frontier reasoning model with extended thinking.
- Subagents spawned for parallel research / scoped sub-tasks.
- Persistent project memory via
CLAUDE.mdfiles. - Permission gates on every potentially-destructive action.
- Streaming UI with incremental visibility.
Lessons:
- A small set of strong tools beats many narrow ones.
- Persistent context (CLAUDE.md) is huge for project work.
- Real-time visibility into agent actions builds trust.
- Permission gates make destructive tools safe to expose.
- Subagents enable parallelism without over-complicating one big context.
Transferable: any domain where a single agent operates over a structured workspace.
Case study: Cursor (and Windsurf)
IDE-replacement editors with AI throughout.
Architecture:
- Custom inline completion model (small, fast, fill-in-the-middle).
- Indexed codebase with hybrid retrieval.
- Chat UI with selectable context.
- Apply UX: chat suggests, user clicks “Apply,” IDE merges.
- Agent mode for multi-file changes.
Lessons:
- Two model classes: tiny + fast for autocomplete, frontier + slow for chat.
- Apply UX (review-and-accept) keeps the human in control.
- Indexing the codebase upfront unlocks rich context retrieval.
- Conventions files (
.cursorrules) preserve project context.
Transferable: any “AI in a power-user app” — Notion AI, Figma AI, Excel Copilot.
Case study: Perplexity
AI-powered search with citations.
Architecture:
- Real-time web search.
- Aggressive multi-document retrieval.
- LLM synthesizes with strict citation requirements.
- Follow-up suggestions based on conversation history.
Lessons:
- Citation discipline builds trust.
- Real-time search complements the model’s static knowledge.
- Domain-specific search modes (Academic, Math, Travel) outperform general for those queries.
- Caching repeated queries critical for cost.
Transferable: any RAG product where source attribution matters (legal, medical, news).
Case study: ChatGPT browsing / search
OpenAI’s web-augmented chat.
Architecture:
- Search tool integrated into the model’s tool-calling.
- Multiple retrieval rounds for complex questions.
- Synthesis with citations.
Lessons:
- Tool-driven RAG (model decides when to search) often beats always-RAG.
- Multi-round retrieval enables multi-hop questions.
- Search quality is a separate problem from generation quality.
Transferable: any “agent that gathers info before answering.”
Case study: Notion AI / Coda AI / Confluence AI
Document-grounded AI in productivity tools.
Architecture:
- Per-document and per-workspace RAG.
- Permissions inherited from the underlying doc system.
- In-context generation (rewrite this, summarize that, brainstorm).
- Page-level chat with workspace search.
Lessons:
- Permissions integration is the hardest part — easy to leak data across orgs.
- “Rewrite” / “Summarize” / “Brainstorm” buttons drive adoption faster than chat.
- Workspace-wide search must be careful with sensitive content (HR docs visible to all).
Transferable: any vertical SaaS adding AI features.
Case study: Devin / Replit Agent / Aider
Autonomous coding agents — “give a spec, get a PR.”
Architecture:
- Plan-and-execute loops.
- Sandboxed execution environment.
- Test-driven verification (run tests; iterate on failures).
- Long-running asynchronous tasks (sometimes hours).
Lessons:
- Verifier-loop pattern (test → fix → test) is critical.
- Async + checkpoints required for tasks taking >5 minutes.
- Surfacing intermediate progress builds user trust.
- Strict sandboxing — agents can write any code, including bad code.
Transferable: any “give it a goal, let it run” agent.
Case study: HeyGen / Synthesia / D-ID
AI video avatars — generate spokesperson video from text.
Architecture:
- TTS (often ElevenLabs or in-house).
- Lip-sync model (Wav2Lip-style or proprietary).
- Background and visual generation.
- Studio-style composition pipeline.
Lessons:
- Multi-stage generation (script → audio → lip-sync → video) beats end-to-end.
- Voice cloning + face cloning enables personalization.
- Quality bars are high — uncanny valley is ruthless.
- Ethical guards (consent, watermarking) are non-optional.
Transferable: any multi-modal generation pipeline.
Case study: Suno / Udio
AI music generation.
Architecture:
- Diffusion or autoregressive on audio tokens.
- Conditioning: text + lyrics + style.
- Web UX with one-click generation.
Lessons:
- Constrained problem (3-minute songs) hides at-scale infeasibility (10-minute orchestras).
- UX matters: one-click generation drove adoption faster than power-user tools.
- Licensing / training data is a major commercial risk.
Transferable: any “creative one-click generation” product.
Case study: Hebbia / Harvey
Vertical AI for finance and legal analysts.
Architecture:
- Multi-document RAG with strict citation requirements.
- Multi-agent orchestration (research → extract → reason → write).
- Domain-specific evaluations.
- Strict permissions and data handling for confidential clients.
Lessons:
- Vertical depth beats horizontal capability for high-stakes analyst work.
- Citation and audit trails are core, not features.
- Multi-agent for complex multi-step tasks is worth the complexity.
- Custom evals on domain data are critical for trust.
Transferable: any high-stakes vertical (medical, regulatory).
Case study: Real-time voice agents (Vapi, Retell, OpenAI Realtime)
Voice-in-voice-out AI assistants.
Architecture:
- Streaming ASR.
- LLM with low-latency optimization.
- Streaming TTS (or end-to-end voice models).
- Interruption handling.
Lessons:
- End-to-end latency p95 < 700ms is the bar for “natural feeling.”
- Native voice models (GPT-4o realtime) reduce latency vs ASR+LLM+TTS pipelines.
- Tool use mid-conversation is hard but increasingly available.
- Common use cases: customer support, scheduling, lead qualification.
Transferable: any conversational voice product.
Recurring patterns across all of these
What you’ll see again and again:
- A small set of strong tools beats many narrow ones.
- RAG with strict citation builds trust.
- Verifier loops (tests, schemas, judges) are the difference between demo and production.
- Streaming and intermediate visibility make agents feel responsive.
- Permissions + audit are foundational, not afterthoughts.
- Two-tier model routing saves cost without quality loss.
- Persistent project context (CLAUDE.md-style files) is huge for repeat use.
- Real-data evals beat synthetic / public benchmarks for product quality.
- Failure-mode-driven iteration: each new failure → eval case → fix.
- Subagents and decomposition for tasks too big for one context.
Anti-patterns that recur
What you’ll see repeatedly going wrong:
- Shipping without evals.
- Expensive prompt for trivial tasks.
- Long, unfocused system prompts.
- Tool sprawl (50+ tools).
- No observability.
- No permission gates on destructive ops.
- Confident wrong answers without grounding.
- Forgetting cost monitoring.
- Treating LLM output as ground truth.
If you avoid this list, you’re already ahead.
Reading list (real product blogs)
For any case study above, search for:
- “How we built X at [company]”
- Engineering blog posts (Anthropic, OpenAI, Mistral, Notion, Cursor, Hebbia all have good ones).
- Conference talks (RecSys, NeurIPS workshops, MLOps World).
Real production lessons > academic benchmarks for shipping products.
Your own case study
When your product ships, write up:
- What worked.
- What surprised you.
- What you’d do differently.
- What the eval set looked like.
You’ll teach others, and reinforce your own learning.