Observability & Tracing

You can’t fix what you can’t see. LLM systems are non-deterministic, multi-step, and often slow. Without observability, every bug report is a fishing expedition. With it, you replay any session, find the failing step, and fix it.

What to log

For every request:

Timestamp, user, session, request_id.
Input: full prompt(s), system prompt, RAG context, prior turns.
Model: name, version, provider.
Parameters: temperature, max_tokens, stop sequences.
Output: full response.
Tool calls: each call with input + output + duration.
Token counts: input, output, total.
Cost: dollars per call (computed from tokens × price).
Latency: total + per-step.
Errors: full stack trace + context.
Metadata: feature flag values, A/B test bucket, version of prompt, etc.

This is a lot. Cap volume:

Sampling: keep 100% for errors, 10% for normal traffic.
Truncation: very long contexts can be truncated/summarized in logs.
Retention: hot for 7–30 days, cold for 90+, archive thereafter.

Tools

Observability platforms purpose-built for LLM systems:

Langfuse: open-source, self-host or cloud. Strong tracing + eval.
LangSmith (LangChain): tracing + datasets + eval.
Phoenix (Arize): open-source, evaluation-heavy.
Helicone: simple proxy-based tracing.
Braintrust: production-grade with eval integration.
Traceloop: OTel-native observability.
Weights & Biases Weave: integrated with their ML platform.

For general APM (still useful):

Datadog, Honeycomb, Grafana Loki, OpenTelemetry.

LLM-specific platforms understand messages, tools, cost natively. General APM treats them as opaque strings.

Tracing patterns

Per-request trace

A trace = full record of one request through your system. Includes:

Initial input.
Each LLM call with prompt + response.
Each tool call.
Each retrieval with query + results.
Final output.

A trace looks like a flame graph; each node is a step.

Distributed tracing

For multi-service or agent-orchestrated systems, propagate trace IDs across services. OpenTelemetry context propagation is the standard.

Spans and attributes

Each step is a span. Attributes per span:

llm.model, llm.provider
llm.prompt_tokens, llm.completion_tokens
llm.temperature
tool.name, tool.args, tool.result
retrieval.query, retrieval.k, retrieval.docs

Standardize on OpenTelemetry semantic conventions for GenAI (released 2024–2025).

Replays

A killer feature of LLM observability: replay any past session. Go from a user complaint to a click-through view of every step the system took.

[Trace #abc123] User: "Help with my refund"
  → System decided: route to refund agent
    → Tool call: search_orders(user_id=42) → 5 orders
    → LLM: "Which order?" 
    → User: "the latest"
    → Tool call: get_order(id=last) → ...
    → ...

Replay should be one click in your tracing tool. If it isn’t, fix that.

Datasets from production

Production traces are training and eval data:

Sample N requests/day → labeled set for offline eval.
Annotate a subset (good / bad / unsure).
Build domain-specific eval sets from real failure modes.

Most LLM-observability platforms expose APIs to export traces to dataset format. Use them.

Live monitoring

Beyond per-request traces, aggregate metrics:

Quality

LLM-judge scores (sampled).
User feedback (thumbs).
Retry rate (proxy for “user dissatisfied”).
Faithfulness scores.

Performance

Latency: p50, p95, p99.
Error rate.
Timeout rate.

Cost

$/request.
$/active user.
$/successful task.
Tokens per request distribution.

Safety

Refusal rate.
Filter trigger rate.
Jailbreak attempt detection rate.

Dashboards in Grafana / Datadog / your tracing tool’s UI.

Anomaly detection

Production drifts. Set up alerts:

Cost spike: 2× average → page someone.
Latency regression: p95 over threshold for 10 min.
Quality regression: judge scores drop > X%.
Error rate spike: >5% errors.
Filter trigger spike: possible attack or prompt injection wave.
Token usage outlier: a single request burning 100k tokens.

Debugging workflow

When something breaks in production:

Find affected traces (via user ID, time range, error message).
Replay an example end-to-end.
Inspect each step: was input correct? did the model output what was expected? did the tool work?
Identify the failing step.
Reproduce locally with the same inputs.
Fix, write a regression test, ship.

This loop should take 30 minutes for a typical issue, not 3 days.

Cost & token attribution

Per-feature, per-customer, per-experiment:

Tag every request with feature/experiment ID.
Aggregate cost by tag.
See which features are unexpectedly expensive.
Catch a prompt change that doubled token usage before you get the bill.

Sample-based analysis

Pick 100 random traces from yesterday. Read them. You’ll find:

Issues you didn’t know about.
Patterns of user behavior you missed.
Tool calls that fail in unexpected ways.
Unintentional high-cost flows.

Schedule this weekly. Not just engineers — PMs and designers should read traces too.

Privacy

Traces contain everything users send. Treat as sensitive:

Encryption at rest.
Access control by role.
Retention limits per regulation.
Redaction of PII before logging where required.
Right to delete: user requests removal → purge traces.

Most observability platforms have features for this; configure them.

Synthetic monitoring

In addition to logging real traffic, run synthetic probes:

Every 5 minutes, send a known prompt → expect a known response.
Detects upstream model regressions, network issues, broken deployments.
Cheap; alert if probe fails.

Pitfalls

Logging everything verbose → costs and noise.
Logging nothing → flying blind.
Trusting metrics without sampling actual traces → misleading.
Forgetting cost monitoring → surprise bills.
No retention policy → endless storage growth.
No PII handling → compliance issues.
Alert fatigue → important alerts get ignored.

Practical advice

Pick one observability platform on day one. Don’t add it after the fact.
Trace 100% of errors, sample normal traffic.
Tag traces with feature/experiment/user metadata.
Read traces weekly. Not just on bug reports.
Set cost and quality alerts before launch.

Watch it interactively

Observability Trace Viewer — distributed-trace waterfall with perturbation toggle (slow rerank / slow LLM / tool timeout / extra retry). Predict before clicking: flip “slow rerank” and the rerank span balloons, downstream spans shift right, total wall-clock grows; the bottleneck is now obvious. Click any span to see its tags + cost. This is the view your on-call rotation lives in.

Build it in code

/ship/12 — observability with Phoenix — OpenTelemetry instrumentation for every LLM call, tool, retrieval, agent step. ~120 lines including helpers for llm_span, tool_span, retrieval_span. Includes head-based sampling, trace IDs in HTTP responses, user attribution.
/case-studies/04 — customer-support bot — real product running on top of this observability layer; shows what “debug-by-trace” looks like at scale.