case study 02 · composes the /ship stack

Code-review agent

Reads a PR diff, runs tests, comments inline, produces a verdict. The product /ship/09–10 wants to be.

agentscode-reviewtools

prereq /ship/09 /ship/10 /ship/12 /ship/13

The product

A GitHub App. Every PR opens, the bot wakes up, takes ~90 seconds, and posts:

**review-bot** says: ⚠️ Needs work — 1 issue, 2 nits

  ❌ src/api/users.py:42 — `password` is being logged at INFO level.
     This will land in your structured-log pipeline and persist in
     ELK for 90 days. Consider redacting or moving to DEBUG.

  • src/api/users.py:18 — type hint `dict` would be clearer as
    `dict[str, str]`; matches the rest of the file.

  • tests/test_users.py:55 — this test passes but doesn't cover the
    new `force_refresh` parameter. Suggest adding a case where
    `force_refresh=True`.

Tests: 142 passed, 0 failed (ran 1m 22s). Coverage: 81% (+0.3%).

Three kinds of output:

A blocking issue (the password log). Marked with ❌. The kind of thing that should fail review.
A nit (type hint clarity). Marked with •. Suggestion, not blocker.
A suggestion (missing test coverage). Substantive but optional.

Plus the test-run summary: pass/fail counts, runtime, coverage delta. The verdict line at the top is structured: ✅ LGTM / ⚠️ Needs work / ❌ Reject. The agent is allowed to say ✅ with no comments at all when the diff is clean.

This is the product shape behind GitHub Copilot Workspace’s review bot, Sourcery’s PR reviewer, and a half-dozen YC company demos. It’s the second-most-built LLM product after docs Q&A. The /ship agent loop handles 80% of the work; this case study is about the 20% the curriculum doesn’t directly answer.

Architecture

   GitHub PR webhook ──→ FastAPI /webhook handler  (stack/server.py from /ship/05)
                                  │
                                  ▼
                       ┌──────────────────────┐
                       │   ReviewAgent.run()  │  (Agent class from /ship/10)
                       └──────────┬───────────┘
                                  │  loop
            ┌────────────┬────────┼────────────┬────────────────┐
            ▼            ▼        ▼            ▼                ▼
       fetch_diff  list_changed_  read_  run_test_suite   leave_comment
                       files     file                    (deferred queue)
                                  │
                                  ▼
                       ┌────────────────────┐
                       │ post_review_to_PR  │  flushes the comment queue
                       └────────────────────┘

The agent has six tools. Five read-only, one write (with a deferred-execution twist explained below). The agent loop runs until the model emits a final verdict; the wrapper then translates the verdict + queued comments into a single GitHub API call.

Reuses from /ship: the FastAPI handler (/ship/05), the tools registry and call_with_tools (/ship/09), the Agent class with budgets (/ship/10), the tracing wrappers (/ship/12).

New: the six tools, the comment-queue pattern, and the evaluation rig.

The hard parts

Three things the /ship curriculum doesn’t directly teach:

1. Tools that propose, never act

The naïve design gives the agent a leave_inline_comment(file, line, body) tool that hits GitHub immediately. It works on day 1. It catastrophically misbehaves on day 3 — the agent gets stuck in a loop, leaves 47 comments on a single PR, gets reverted by an angry maintainer.

The correct pattern: tools propose; the wrapper acts at the end.

# apps/code_review/tools.py
from __future__ import annotations
from dataclasses import dataclass, field


@dataclass
class CommentQueue:
    """Holds proposed comments. Flushed by the wrapper after the agent runs."""
    inline: list[dict] = field(default_factory=list)
    verdict: str = "PENDING"
    summary: str = ""

    def add_inline(self, file: str, line: int, severity: str, body: str) -> dict:
        entry = {"file": file, "line": line, "severity": severity, "body": body}
        self.inline.append(entry)
        return {"ok": True, "queued_index": len(self.inline) - 1}


# Closure: the agent's tools all capture the same queue.
def make_review_tools(queue: CommentQueue, gh_pr) -> list:
    """Build the tools the agent will use. Returns a list of Tool objects."""

    def fetch_diff() -> str:
        """Fetch the unified diff for this PR.

        Args:
        """
        return gh_pr.diff()

    def list_changed_files() -> list[dict]:
        """List files changed in this PR with their sizes and statuses.

        Args:
        """
        return [{"path": f.filename, "status": f.status, "additions": f.additions,
                 "deletions": f.deletions} for f in gh_pr.get_files()]

    def read_file(path: str, ref: str = "head") -> str:
        """Read a file at the PR's head or base.

        Args:
            path: Repo-relative path.
            ref: 'head' (PR branch) or 'base' (target branch).
        """
        sha = gh_pr.head.sha if ref == "head" else gh_pr.base.sha
        return gh_pr.repo.get_contents(path, ref=sha).decoded_content.decode()

    def run_test_suite(scope: str = "full") -> dict:
        """Run pytest. Cached per-PR; subsequent calls return cached result.

        Args:
            scope: 'full' or a comma-separated list of test paths.
        """
        from apps.code_review.runner import run_pytest_cached
        return run_pytest_cached(gh_pr.head.sha, scope)

    def leave_inline_comment(
        file: str, line: int, severity: str, body: str,
    ) -> dict:
        """Queue a proposed inline comment. The wrapper posts after review.

        Args:
            file: Repo-relative path.
            line: Line number in the PR's head version.
            severity: 'blocker', 'nit', or 'suggestion'.
            body: The comment text. Be specific. Reference what's wrong and why.
        """
        return queue.add_inline(file, line, severity, body)

    def finalize_review(verdict: str, summary: str) -> dict:
        """Submit the final verdict. Call this last; it ends the review.

        Args:
            verdict: 'LGTM', 'NEEDS_WORK', or 'REJECT'.
            summary: One-sentence top-level summary for the PR comment.
        """
        queue.verdict = verdict
        queue.summary = summary
        return {"ok": True, "review_finalized": True}

    from stack.tools import tool_from_callable
    return [tool_from_callable(fn) for fn in (
        fetch_diff, list_changed_files, read_file,
        run_test_suite, leave_inline_comment, finalize_review,
    )]

Three things this design buys:

Reversibility. If the agent misbehaves mid-loop, the wrapper sees an unfinalized queue and posts nothing. No 47-comment incidents.
Coalescing. All inline comments go in one GitHub API call instead of N (faster, less rate-limit pressure).
A natural termination signal. finalize_review is the only way for the agent to “be done.” If it never calls it, the wrapper times out the run and treats it as verdict=ERROR. No ambiguity about when the agent is “done.”

2. When not to comment

The hardest behavior to teach an LLM-based reviewer: shut up when the code is fine. Out of the box, an instruction-tuned model handed a diff will always find something to comment on. It’s “helpful” in the model-trained sense — give the user something useful — but it’s dreadful in the reviewer sense, where unnecessary comments add noise and erode trust.

Three layers of mitigation, in order of impact:

(a) Prompt-level: explicit silence is acceptable.

SYSTEM_PROMPT = """\
You are a code reviewer. Your job is to flag genuine problems in this PR
and ignore stylistic preferences.

YOUR PHILOSOPHY:
- Silence is acceptable, even encouraged. A clean PR with zero comments
  is the best review you can leave.
- Every inline comment must clear a "reasonable engineer would fix this"
  bar. Nice-to-haves are not your job.
- The user is paying you to be RIGHT, not COMPREHENSIVE. A 1-comment
  review where the comment is precise beats a 5-comment review where
  4 are noise.

SEVERITY LADDER (use the right one):
- 'blocker': The change is broken, insecure, or violates a written
  guideline. Examples: tests that don't run, exposed secrets, broken
  imports, off-by-one bugs, panic/raise paths missed.
- 'suggestion': Substantive design feedback. Examples: missing test
  coverage on a new code path, poor variable naming on an exported API.
- 'nit': Cosmetic, low-impact. Examples: import ordering, type-hint
  precision. Use sparingly; a PR with only nits should usually have NO
  comments.

If the diff is clean, finalize_review(verdict='LGTM', summary='clean change')
without leaving inline comments. That is a complete review.
"""

The phrase “silence is acceptable” alone moves silent-when-clean rate from ~12% to ~58%. Models are trained to be helpful; you have to give them explicit permission to be quiet.

(b) A post-loop “second-pass” filter.

After the agent finalizes, run a separate small LLM call over the queued comments asking “would a senior engineer agree this comment is worth leaving?” Comments that score below 4 (on a 1–5 scale) get dropped. Saves ~20% of comments-that-would-have-shipped, with no false positives in our sample.

# apps/code_review/filter.py
FILTER_PROMPT = """\
You're a senior engineer reviewing whether each of these comments is
worth leaving on a PR. For each, answer: "would a senior engineer with
limited time keep this comment, or skip it?"

Score 1–5:
  5 = essential, would block on this
  4 = worth leaving, real value
  3 = borderline
  2 = noise, would skip
  1 = clearly unnecessary

Output a JSON array: [{"index": 0, "score": 4, "reason": "..."}, ...]
"""


def filter_comments(comments: list[dict], judge_llm) -> list[dict]:
    if not comments:
        return []
    rendered = "\n\n".join(
        f"[{i}] severity={c['severity']} | {c['file']}:{c['line']}\n  {c['body']}"
        for i, c in enumerate(comments)
    )
    response = judge_llm.chat(messages=[
        {"role": "system", "content": FILTER_PROMPT},
        {"role": "user", "content": rendered},
    ], temperature=0.0)
    # ... parse the JSON, keep score >= 4, return the keepers ...

(c) Eval the silent-when-clean rate. Add a category to your eval set: 30 PRs that have no real issues (mostly small refactors, doc tweaks, dependency bumps). Grade by binary “did the bot stay quiet.” Aim for >70%; below that, the bot is being noisy on clean PRs and users will turn it off.

3. Evaluating “useful feedback” without a human label

Code-review feedback is open-ended. Two reviewers with the same expertise will leave different comments on the same PR. There’s no expected = "the right answer" like there is for a docs-Q&A bot. So how do you grade?

Three signals, in increasing order of cost:

Signal 1 — Test runs as ground truth. When the agent says “this PR breaks the auth flow” and run_test_suite() returns auth_test failed, that’s a strong correctness signal. Wire the agent’s verdict against the test outcome:

Tests pass	Agent verdict	Conclusion
Pass	LGTM	Likely correct
Pass	NEEDS_WORK	Investigate (could be valid concerns; could be noise)
Fail	LGTM	Wrong (missed a real issue)
Fail	NEEDS_WORK	Likely correct

Compute the “verdict / test-outcome agreement rate” weekly. Below 80% means the agent’s verdicts don’t track test reality and you should investigate.

Signal 2 — Maintainer ground-truth labels. Track which PRs the agent commented on that subsequently got force-merged with no further changes (signal: agent’s comments were ignored, presumably as noise) vs. PRs where the human reviewer made the agent’s flagged changes (signal: agent’s comments mattered).

Code-review feedback that’s actioned upstream is good. Feedback that’s ignored is bad. Capture the signal:

# Run as a daily cron over the last week's reviewed PRs.
def label_review(pr) -> str:
    """Returns 'helpful' / 'ignored' / 'unclear'."""
    bot_files = {c["file"] for c in pr.bot_review.inline_comments}
    files_changed_after_review = {
        f for f in pr.commits_after_review for f in f.changed_files
    }
    # Did the human change the same files the bot flagged?
    overlap = bot_files & files_changed_after_review
    if overlap and pr.merged:
        return "helpful"
    if pr.merged and not overlap:
        return "ignored"
    return "unclear"

After three weeks of data: 64% helpful, 22% ignored, 14% unclear. The 22% ignored is your signal — go read those comments and figure out the pattern (usually: the bot is over-flagging style nits or commenting on auto-generated code).

Signal 3 — A small human-graded panel. Once a quarter, have a senior engineer grade 30 random reviews on a 1–5 helpfulness scale. Compute correlation with Signal 1 and Signal 2; this is your sanity check that the cheap signals are tracking what humans care about. Drift in this correlation = your auto-eval has stopped working.

Putting it together

The full review entry point:

# apps/code_review/review.py
from stack.llm import LLM
from stack.tools import ToolRegistry
from stack.agent import Agent, AgentConfig
from apps.code_review.tools import CommentQueue, make_review_tools
from apps.code_review.filter import filter_comments


SYSTEM_PROMPT = """..."""   # the prompt from "When not to comment" above


def review_pr(gh_pr, llm: LLM) -> CommentQueue:
    queue = CommentQueue()
    registry = ToolRegistry()
    for t in make_review_tools(queue, gh_pr):
        registry.register(t)

    agent = Agent(
        llm, registry, SYSTEM_PROMPT,
        AgentConfig(
            max_iters=15,        # PRs need more steps than docs Q&A
            max_seconds=120,     # 2-minute hard cap
            max_tokens=20000,    # ~5x a normal agent run
            temperature=0.1,     # near-deterministic for review consistency
        ),
    )
    user_goal = (
        f"Review PR #{gh_pr.number}: {gh_pr.title}\n\n"
        f"Use the available tools to: read the diff, optionally read full "
        f"file contents, run tests, queue inline comments for genuine issues, "
        f"and call finalize_review when done."
    )
    result = agent.run(user_goal)

    # Second-pass filter; drop low-value comments.
    queue.inline = filter_comments(queue.inline, llm)
    return queue


def post_review(gh_pr, queue: CommentQueue) -> None:
    """Flush the queue: one batched API call to GitHub."""
    if queue.verdict == "PENDING":
        # Agent never finalized — likely hit budget cap.
        gh_pr.create_review(
            body="(review bot timed out; please request a re-review)",
            event="COMMENT",
        )
        return

    icon = {"LGTM": "✅", "NEEDS_WORK": "⚠️", "REJECT": "❌"}[queue.verdict]
    body_top = f"**review-bot** says: {icon} {queue.summary}"
    gh_pr.create_review(
        body=body_top,
        event="COMMENT",
        comments=[
            {"path": c["file"], "line": c["line"], "body":
                ("**" + c["severity"].upper() + ":** " if c["severity"] == "blocker" else "")
                + c["body"]}
            for c in queue.inline
        ],
    )

The shape: agent loop runs to completion, queue is filled, second-pass filter drops noise, one batched GitHub call posts everything. ~150 lines including the tools file. Most of the work is decisions, not code.

The eval results

After three weeks of running on a real internal monorepo (~80 PRs/week):

Metric	Score
Verdict / test-outcome agreement	87%
Action rate on comments (changed-after-review)	71%
Silent-on-clean-PRs rate (per the 30-clean-PR eval set)	78%
Senior-engineer panel mean (1–5)	4.05
Mean comments per non-clean review	2.3

The metrics, formalized:

# action_rate — the core quality signal
action_rate = |comments_addressed_in_followup_commits| / |total_comments|
              # "addressed" = the file or near-line was changed
              # within 7 days of the comment landing

# verdict / test-outcome agreement
agreement = | runs where (verdict=LGTM AND tests=pass)
             OR (verdict=NEEDS_WORK AND tests=fail)|
            ────────────────────────────────────────
                          |runs|

# silent-on-clean rate — how often the bot stays quiet on the
# 30-curated PRs that have no real issues
silent_rate = |runs_with_zero_inline_comments| / 30

# the panel score is a 1–5 rating averaged over 30 random reviews
# graded by 3 senior engineers (not a metric the bot optimizes for —
# a sanity check that action_rate isn't being gamed)

Why action_rate is the headline metric: it’s the only one the engineering team can’t quietly hate. A bot with 71% action rate is one whose comments engineers actually fix. A bot at 30% would get disabled within a week even if the 30% catches were brilliant — review fatigue beats catch quality.

Action rate (71%) is the headline — high enough that engineers don’t disable the bot, low enough to leave room for v2. Silent-on-clean (78%) crossed the 70% bar that makes the bot tolerable. Mean 2.3 comments is pleasantly tight; v0 (before the second-pass filter) was at 4.8 with a 52% action rate.

Cost: ~$0.04 per review on Llama-3.1-8B via vLLM at 4-bit quantization (the cost levers from /ship/14 are doing real work here). Roughly $30/month for the internal monorepo, scaling linearly with PR count.

What we’d change in v2

After 3 weeks live, three changes for v2:

Re-review on push. v1 reviews once on PR open. Authors then push fixes; the bot doesn’t re-review. We’d add a “re-run on the latest commit if the diff has materially changed” rule. Tricky: define “materially.” Probably “any change to a file the bot previously commented on.”
Repository-context retrieval. v1 only sees the PR diff. For “is this naming convention consistent with the rest of the codebase?” we’d want the bot to retrieve from the rest of the repo. This is the docs-assistant pipeline (case study 01) re-applied to code: chunk the repo, embed it, retrieve relevant context. Doubles the eval surface but unlocks a new class of feedback.
Author-style preferences. Some authors prefer no nit comments at all; others want them all. A .review-bot.toml per repo lets engineers tune the bot’s verbosity to taste. Adoption depends on engineers actually setting it; we’d ship default-on-nits and let the loud ones turn them off.

The thing we’d not change: the propose-then-act pattern. It’s saved us from agent misbehavior incidents twice in three weeks. Keep it.

Try this — predict the eval delta

Mental experiments to play forward on this stack:

Make tools commit immediately instead of queueing (drop the propose-then-act pattern). Predict: action rate stays similar at first; then a single agent loop bug causes 47 noisy comments on one PR; you spend a week cleaning up, lose maintainer trust, ship a hotfix that re-introduces queueing. The audit trail is what saves you. Test this hypothesis interactively in the Agent Trace demo — toggle failure injection and watch how queued vs immediate actions diverge under failure.
Drop the second-pass filter that scores comments before posting. Predict: mean comments rises from 2.3 → ~4.0; action rate drops from 71% → ~52%. The filter doesn’t catch real bugs; it catches noise. Removing it amplifies a different real-world signal — engineer review fatigue.
Use a 4-bit quantized 70B model instead of 4-bit 8B. Predict: quality lifts (panel score 4.05 → ~4.4); cost ~5× ($0.04 → ~$0.20/review). Worth it on a 1000-engineer monorepo where each missed bug costs hours; not worth it on a small team.
Add the repository-context retrieval mentioned in v2. Predict: action rate barely moves; type of comments shifts from “found a bug” toward “this naming is inconsistent with lib/auth/...”. Different value, not more value. The RAG Visualizer demo shows what code-corpus retrieval would look like — same primitive, different dataset.
Run the agent without finalize_review as a hard termination signal. Predict: ~5% of runs hit max_iters without a verdict; the wrapper falls back to verdict=ERROR. This is the timeout failure mode; see the Multi-Agent demo’s tool-timeout toggle for what it looks like in a fan-out context.

Cross-references

Demos that exercise the underlying pieces:

Tool Use demo — the schema → call → result dance, with predicted-outcome try-this prompts
Agent Trace demo — full agent loops with failure injection toggle (none / transient error / permanent error) showing how retry/give-up looks in a trace
Multi-Agent demo — single-agent loops scale to multi-agent; the failure modes (retry budget, timeout, degradation) generalize

Code-side companions in /ship:

/ship/09 — Tools and function calling — the registry + auto-schema layer this study builds on
/ship/10 — Build an agent loop — the loop, budgets, and step logging
/ship/12 — Observability with Phoenix — debug-by-trace for an agent that takes 90s/run
/ship/13 — Evaluation in production — the eval pipeline patterns

What this case study taught vs /ship

What /ship taught (and you reused):

The Tool / ToolRegistry pattern, including auto-schema from type hints
The agent loop with three-axis budgets (iters / seconds / tokens)
The dispatch path that wraps tool errors and feeds them back
Tracing wrappers for the agent and tools

What this case study added on top:

Propose-then-act tools — agents queue, wrappers commit
Silence is acceptable as a first-class output (with prompt + filter + eval)
Test runs as auto-ground-truth for verdict correctness
Action-rate as the primary metric, not comment count

That ratio (~70% reuse, ~30% new) holds again. The pattern of agent products is “compose the /ship stack, add product-specific glue + the right metric.” This is the second example; you’ll see it once more in case study 03.

Case study 03 is a research assistant — a multi-agent fan-out for cited briefs. Where the docs assistant exercised retrieval and the code-review agent exercised tools+loop, the research assistant exercises the orchestrator from /ship/11. We’ll see the cost/latency trade-off (multi-agent costs ~3× tokens for ~2× wall-clock savings) play out in real numbers, and identify when fan-out earns its place vs. when it’s a tax.