Observability & Tracing

You can’t fix what you can’t see. LLM systems are non-deterministic, multi-step, and often slow. Without observability, every bug report is a fishing expedition. With it, you replay any session, find the failing step, and fix it.

What to log

For every request:

  • Timestamp, user, session, request_id.
  • Input: full prompt(s), system prompt, RAG context, prior turns.
  • Model: name, version, provider.
  • Parameters: temperature, max_tokens, stop sequences.
  • Output: full response.
  • Tool calls: each call with input + output + duration.
  • Token counts: input, output, total.
  • Cost: dollars per call (computed from tokens × price).
  • Latency: total + per-step.
  • Errors: full stack trace + context.
  • Metadata: feature flag values, A/B test bucket, version of prompt, etc.

This is a lot. Cap volume:

  • Sampling: keep 100% for errors, 10% for normal traffic.
  • Truncation: very long contexts can be truncated/summarized in logs.
  • Retention: hot for 7–30 days, cold for 90+, archive thereafter.

Tools

Observability platforms purpose-built for LLM systems:

  • Langfuse: open-source, self-host or cloud. Strong tracing + eval.
  • LangSmith (LangChain): tracing + datasets + eval.
  • Phoenix (Arize): open-source, evaluation-heavy.
  • Helicone: simple proxy-based tracing.
  • Braintrust: production-grade with eval integration.
  • Traceloop: OTel-native observability.
  • Weights & Biases Weave: integrated with their ML platform.

For general APM (still useful):

  • Datadog, Honeycomb, Grafana Loki, OpenTelemetry.

LLM-specific platforms understand messages, tools, cost natively. General APM treats them as opaque strings.

Tracing patterns

Per-request trace

A trace = full record of one request through your system. Includes:

  • Initial input.
  • Each LLM call with prompt + response.
  • Each tool call.
  • Each retrieval with query + results.
  • Final output.

A trace looks like a flame graph; each node is a step.

Distributed tracing

For multi-service or agent-orchestrated systems, propagate trace IDs across services. OpenTelemetry context propagation is the standard.

Spans and attributes

Each step is a span. Attributes per span:

  • llm.model, llm.provider
  • llm.prompt_tokens, llm.completion_tokens
  • llm.temperature
  • tool.name, tool.args, tool.result
  • retrieval.query, retrieval.k, retrieval.docs

Standardize on OpenTelemetry semantic conventions for GenAI (released 2024–2025).

Replays

A killer feature of LLM observability: replay any past session. Go from a user complaint to a click-through view of every step the system took.

[Trace #abc123] User: "Help with my refund"
  → System decided: route to refund agent
    → Tool call: search_orders(user_id=42) → 5 orders
    → LLM: "Which order?" 
    → User: "the latest"
    → Tool call: get_order(id=last) → ...
    → ...

Replay should be one click in your tracing tool. If it isn’t, fix that.

Datasets from production

Production traces are training and eval data:

  • Sample N requests/day → labeled set for offline eval.
  • Annotate a subset (good / bad / unsure).
  • Build domain-specific eval sets from real failure modes.

Most LLM-observability platforms expose APIs to export traces to dataset format. Use them.

Live monitoring

Beyond per-request traces, aggregate metrics:

Quality

  • LLM-judge scores (sampled).
  • User feedback (thumbs).
  • Retry rate (proxy for “user dissatisfied”).
  • Faithfulness scores.

Performance

  • Latency: p50, p95, p99.
  • Error rate.
  • Timeout rate.

Cost

  • $/request.
  • $/active user.
  • $/successful task.
  • Tokens per request distribution.

Safety

  • Refusal rate.
  • Filter trigger rate.
  • Jailbreak attempt detection rate.

Dashboards in Grafana / Datadog / your tracing tool’s UI.

Anomaly detection

Production drifts. Set up alerts:

  • Cost spike: 2× average → page someone.
  • Latency regression: p95 over threshold for 10 min.
  • Quality regression: judge scores drop > X%.
  • Error rate spike: >5% errors.
  • Filter trigger spike: possible attack or prompt injection wave.
  • Token usage outlier: a single request burning 100k tokens.

Debugging workflow

When something breaks in production:

  1. Find affected traces (via user ID, time range, error message).
  2. Replay an example end-to-end.
  3. Inspect each step: was input correct? did the model output what was expected? did the tool work?
  4. Identify the failing step.
  5. Reproduce locally with the same inputs.
  6. Fix, write a regression test, ship.

This loop should take 30 minutes for a typical issue, not 3 days.

Cost & token attribution

Per-feature, per-customer, per-experiment:

  • Tag every request with feature/experiment ID.
  • Aggregate cost by tag.
  • See which features are unexpectedly expensive.
  • Catch a prompt change that doubled token usage before you get the bill.

Sample-based analysis

Pick 100 random traces from yesterday. Read them. You’ll find:

  • Issues you didn’t know about.
  • Patterns of user behavior you missed.
  • Tool calls that fail in unexpected ways.
  • Unintentional high-cost flows.

Schedule this weekly. Not just engineers — PMs and designers should read traces too.

Privacy

Traces contain everything users send. Treat as sensitive:

  • Encryption at rest.
  • Access control by role.
  • Retention limits per regulation.
  • Redaction of PII before logging where required.
  • Right to delete: user requests removal → purge traces.

Most observability platforms have features for this; configure them.

Synthetic monitoring

In addition to logging real traffic, run synthetic probes:

  • Every 5 minutes, send a known prompt → expect a known response.
  • Detects upstream model regressions, network issues, broken deployments.
  • Cheap; alert if probe fails.

Pitfalls

  • Logging everything verbose → costs and noise.
  • Logging nothing → flying blind.
  • Trusting metrics without sampling actual traces → misleading.
  • Forgetting cost monitoring → surprise bills.
  • No retention policy → endless storage growth.
  • No PII handling → compliance issues.
  • Alert fatigue → important alerts get ignored.

Practical advice

  1. Pick one observability platform on day one. Don’t add it after the fact.
  2. Trace 100% of errors, sample normal traffic.
  3. Tag traces with feature/experiment/user metadata.
  4. Read traces weekly. Not just on bug reports.
  5. Set cost and quality alerts before launch.

Watch it interactively

  • Observability Trace Viewer — distributed-trace waterfall with perturbation toggle (slow rerank / slow LLM / tool timeout / extra retry). Predict before clicking: flip “slow rerank” and the rerank span balloons, downstream spans shift right, total wall-clock grows; the bottleneck is now obvious. Click any span to see its tags + cost. This is the view your on-call rotation lives in.

Build it in code

  • /ship/12 — observability with Phoenix — OpenTelemetry instrumentation for every LLM call, tool, retrieval, agent step. ~120 lines including helpers for llm_span, tool_span, retrieval_span. Includes head-based sampling, trace IDs in HTTP responses, user attribution.
  • /case-studies/04 — customer-support bot — real product running on top of this observability layer; shows what “debug-by-trace” looks like at scale.

See also