Case Studies

Real-world AI products and what’s reusable from them. The point isn’t to learn how Cursor specifically works; it’s to see patterns repeated across products and lift them to your domain.

Case study: Claude Code (Anthropic)

A coding agent that runs in the terminal and IDE.

Architecture (publicly disclosed):

A small, well-designed tool set: Read, Write, Edit, Glob, Grep, Bash, plus task-specific tools.
Frontier reasoning model with extended thinking.
Subagents spawned for parallel research / scoped sub-tasks.
Persistent project memory via CLAUDE.md files.
Permission gates on every potentially-destructive action.
Streaming UI with incremental visibility.

Lessons:

A small set of strong tools beats many narrow ones.
Persistent context (CLAUDE.md) is huge for project work.
Real-time visibility into agent actions builds trust.
Permission gates make destructive tools safe to expose.
Subagents enable parallelism without over-complicating one big context.

Transferable: any domain where a single agent operates over a structured workspace.

Case study: Cursor (and Windsurf)

IDE-replacement editors with AI throughout.

Architecture:

Custom inline completion model (small, fast, fill-in-the-middle).
Indexed codebase with hybrid retrieval.
Chat UI with selectable context.
Apply UX: chat suggests, user clicks “Apply,” IDE merges.
Agent mode for multi-file changes.

Lessons:

Two model classes: tiny + fast for autocomplete, frontier + slow for chat.
Apply UX (review-and-accept) keeps the human in control.
Indexing the codebase upfront unlocks rich context retrieval.
Conventions files (.cursorrules) preserve project context.

Transferable: any “AI in a power-user app” — Notion AI, Figma AI, Excel Copilot.

Case study: Perplexity

AI-powered search with citations.

Architecture:

Real-time web search.
Aggressive multi-document retrieval.
LLM synthesizes with strict citation requirements.
Follow-up suggestions based on conversation history.

Lessons:

Citation discipline builds trust.
Real-time search complements the model’s static knowledge.
Domain-specific search modes (Academic, Math, Travel) outperform general for those queries.
Caching repeated queries critical for cost.

Transferable: any RAG product where source attribution matters (legal, medical, news).

Case study: ChatGPT browsing / search

OpenAI’s web-augmented chat.

Architecture:

Search tool integrated into the model’s tool-calling.
Multiple retrieval rounds for complex questions.
Synthesis with citations.

Lessons:

Tool-driven RAG (model decides when to search) often beats always-RAG.
Multi-round retrieval enables multi-hop questions.
Search quality is a separate problem from generation quality.

Transferable: any “agent that gathers info before answering.”

Case study: Notion AI / Coda AI / Confluence AI

Document-grounded AI in productivity tools.

Architecture:

Per-document and per-workspace RAG.
Permissions inherited from the underlying doc system.
In-context generation (rewrite this, summarize that, brainstorm).
Page-level chat with workspace search.

Lessons:

Permissions integration is the hardest part — easy to leak data across orgs.
“Rewrite” / “Summarize” / “Brainstorm” buttons drive adoption faster than chat.
Workspace-wide search must be careful with sensitive content (HR docs visible to all).

Transferable: any vertical SaaS adding AI features.

Case study: Devin / Replit Agent / Aider

Autonomous coding agents — “give a spec, get a PR.”

Architecture:

Plan-and-execute loops.
Sandboxed execution environment.
Test-driven verification (run tests; iterate on failures).
Long-running asynchronous tasks (sometimes hours).

Lessons:

Verifier-loop pattern (test → fix → test) is critical.
Async + checkpoints required for tasks taking >5 minutes.
Surfacing intermediate progress builds user trust.
Strict sandboxing — agents can write any code, including bad code.

Transferable: any “give it a goal, let it run” agent.

Case study: HeyGen / Synthesia / D-ID

AI video avatars — generate spokesperson video from text.

Architecture:

TTS (often ElevenLabs or in-house).
Lip-sync model (Wav2Lip-style or proprietary).
Background and visual generation.
Studio-style composition pipeline.

Lessons:

Multi-stage generation (script → audio → lip-sync → video) beats end-to-end.
Voice cloning + face cloning enables personalization.
Quality bars are high — uncanny valley is ruthless.
Ethical guards (consent, watermarking) are non-optional.

Transferable: any multi-modal generation pipeline.

Case study: Suno / Udio

AI music generation.

Architecture:

Diffusion or autoregressive on audio tokens.
Conditioning: text + lyrics + style.
Web UX with one-click generation.

Lessons:

Constrained problem (3-minute songs) hides at-scale infeasibility (10-minute orchestras).
UX matters: one-click generation drove adoption faster than power-user tools.
Licensing / training data is a major commercial risk.

Transferable: any “creative one-click generation” product.

Case study: Hebbia / Harvey

Vertical AI for finance and legal analysts.

Architecture:

Multi-document RAG with strict citation requirements.
Multi-agent orchestration (research → extract → reason → write).
Domain-specific evaluations.
Strict permissions and data handling for confidential clients.

Lessons:

Vertical depth beats horizontal capability for high-stakes analyst work.
Citation and audit trails are core, not features.
Multi-agent for complex multi-step tasks is worth the complexity.
Custom evals on domain data are critical for trust.

Transferable: any high-stakes vertical (medical, regulatory).

Case study: Real-time voice agents (Vapi, Retell, OpenAI Realtime)

Voice-in-voice-out AI assistants.

Architecture:

Streaming ASR.
LLM with low-latency optimization.
Streaming TTS (or end-to-end voice models).
Interruption handling.

Lessons:

End-to-end latency p95 < 700ms is the bar for “natural feeling.”
Native voice models (GPT-4o realtime) reduce latency vs ASR+LLM+TTS pipelines.
Tool use mid-conversation is hard but increasingly available.
Common use cases: customer support, scheduling, lead qualification.

Transferable: any conversational voice product.

Recurring patterns across all of these

What you’ll see again and again:

A small set of strong tools beats many narrow ones.
RAG with strict citation builds trust.
Verifier loops (tests, schemas, judges) are the difference between demo and production.
Streaming and intermediate visibility make agents feel responsive.
Permissions + audit are foundational, not afterthoughts.
Two-tier model routing saves cost without quality loss.
Persistent project context (CLAUDE.md-style files) is huge for repeat use.
Real-data evals beat synthetic / public benchmarks for product quality.
Failure-mode-driven iteration: each new failure → eval case → fix.
Subagents and decomposition for tasks too big for one context.

Anti-patterns that recur

What you’ll see repeatedly going wrong:

Shipping without evals.
Expensive prompt for trivial tasks.
Long, unfocused system prompts.
Tool sprawl (50+ tools).
No observability.
No permission gates on destructive ops.
Confident wrong answers without grounding.
Forgetting cost monitoring.
Treating LLM output as ground truth.

If you avoid this list, you’re already ahead.

Reading list (real product blogs)

For any case study above, search for:

“How we built X at [company]”
Engineering blog posts (Anthropic, OpenAI, Mistral, Notion, Cursor, Hebbia all have good ones).
Conference talks (RecSys, NeurIPS workshops, MLOps World).

Real production lessons > academic benchmarks for shipping products.

Your own case study

When your product ships, write up:

What worked.
What surprised you.
What you’d do differently.
What the eval set looked like.

You’ll teach others, and reinforce your own learning.

Case Studies

Case study: Claude Code (Anthropic)

Case study: Cursor (and Windsurf)

Case study: Perplexity

Case study: ChatGPT browsing / search

Case study: Notion AI / Coda AI / Confluence AI

Case study: Devin / Replit Agent / Aider

Case study: HeyGen / Synthesia / D-ID

Case study: Suno / Udio

Case study: Hebbia / Harvey

Case study: Real-time voice agents (Vapi, Retell, OpenAI Realtime)

Recurring patterns across all of these

Anti-patterns that recur

Reading list (real product blogs)

Your own case study

See also