Observability & Tracing
You can’t fix what you can’t see. LLM systems are non-deterministic, multi-step, and often slow. Without observability, every bug report is a fishing expedition. With it, you replay any session, find the failing step, and fix it.
What to log
For every request:
- Timestamp, user, session, request_id.
- Input: full prompt(s), system prompt, RAG context, prior turns.
- Model: name, version, provider.
- Parameters: temperature, max_tokens, stop sequences.
- Output: full response.
- Tool calls: each call with input + output + duration.
- Token counts: input, output, total.
- Cost: dollars per call (computed from tokens × price).
- Latency: total + per-step.
- Errors: full stack trace + context.
- Metadata: feature flag values, A/B test bucket, version of prompt, etc.
This is a lot. Cap volume:
- Sampling: keep 100% for errors, 10% for normal traffic.
- Truncation: very long contexts can be truncated/summarized in logs.
- Retention: hot for 7–30 days, cold for 90+, archive thereafter.
Tools
Observability platforms purpose-built for LLM systems:
- Langfuse: open-source, self-host or cloud. Strong tracing + eval.
- LangSmith (LangChain): tracing + datasets + eval.
- Phoenix (Arize): open-source, evaluation-heavy.
- Helicone: simple proxy-based tracing.
- Braintrust: production-grade with eval integration.
- Traceloop: OTel-native observability.
- Weights & Biases Weave: integrated with their ML platform.
For general APM (still useful):
- Datadog, Honeycomb, Grafana Loki, OpenTelemetry.
LLM-specific platforms understand messages, tools, cost natively. General APM treats them as opaque strings.
Tracing patterns
Per-request trace
A trace = full record of one request through your system. Includes:
- Initial input.
- Each LLM call with prompt + response.
- Each tool call.
- Each retrieval with query + results.
- Final output.
A trace looks like a flame graph; each node is a step.
Distributed tracing
For multi-service or agent-orchestrated systems, propagate trace IDs across services. OpenTelemetry context propagation is the standard.
Spans and attributes
Each step is a span. Attributes per span:
llm.model,llm.providerllm.prompt_tokens,llm.completion_tokensllm.temperaturetool.name,tool.args,tool.resultretrieval.query,retrieval.k,retrieval.docs
Standardize on OpenTelemetry semantic conventions for GenAI (released 2024–2025).
Replays
A killer feature of LLM observability: replay any past session. Go from a user complaint to a click-through view of every step the system took.
[Trace #abc123] User: "Help with my refund"
→ System decided: route to refund agent
→ Tool call: search_orders(user_id=42) → 5 orders
→ LLM: "Which order?"
→ User: "the latest"
→ Tool call: get_order(id=last) → ...
→ ...
Replay should be one click in your tracing tool. If it isn’t, fix that.
Datasets from production
Production traces are training and eval data:
- Sample N requests/day → labeled set for offline eval.
- Annotate a subset (good / bad / unsure).
- Build domain-specific eval sets from real failure modes.
Most LLM-observability platforms expose APIs to export traces to dataset format. Use them.
Live monitoring
Beyond per-request traces, aggregate metrics:
Quality
- LLM-judge scores (sampled).
- User feedback (thumbs).
- Retry rate (proxy for “user dissatisfied”).
- Faithfulness scores.
Performance
- Latency: p50, p95, p99.
- Error rate.
- Timeout rate.
Cost
- $/request.
- $/active user.
- $/successful task.
- Tokens per request distribution.
Safety
- Refusal rate.
- Filter trigger rate.
- Jailbreak attempt detection rate.
Dashboards in Grafana / Datadog / your tracing tool’s UI.
Anomaly detection
Production drifts. Set up alerts:
- Cost spike: 2× average → page someone.
- Latency regression: p95 over threshold for 10 min.
- Quality regression: judge scores drop > X%.
- Error rate spike: >5% errors.
- Filter trigger spike: possible attack or prompt injection wave.
- Token usage outlier: a single request burning 100k tokens.
Debugging workflow
When something breaks in production:
- Find affected traces (via user ID, time range, error message).
- Replay an example end-to-end.
- Inspect each step: was input correct? did the model output what was expected? did the tool work?
- Identify the failing step.
- Reproduce locally with the same inputs.
- Fix, write a regression test, ship.
This loop should take 30 minutes for a typical issue, not 3 days.
Cost & token attribution
Per-feature, per-customer, per-experiment:
- Tag every request with feature/experiment ID.
- Aggregate cost by tag.
- See which features are unexpectedly expensive.
- Catch a prompt change that doubled token usage before you get the bill.
Sample-based analysis
Pick 100 random traces from yesterday. Read them. You’ll find:
- Issues you didn’t know about.
- Patterns of user behavior you missed.
- Tool calls that fail in unexpected ways.
- Unintentional high-cost flows.
Schedule this weekly. Not just engineers — PMs and designers should read traces too.
Privacy
Traces contain everything users send. Treat as sensitive:
- Encryption at rest.
- Access control by role.
- Retention limits per regulation.
- Redaction of PII before logging where required.
- Right to delete: user requests removal → purge traces.
Most observability platforms have features for this; configure them.
Synthetic monitoring
In addition to logging real traffic, run synthetic probes:
- Every 5 minutes, send a known prompt → expect a known response.
- Detects upstream model regressions, network issues, broken deployments.
- Cheap; alert if probe fails.
Pitfalls
- Logging everything verbose → costs and noise.
- Logging nothing → flying blind.
- Trusting metrics without sampling actual traces → misleading.
- Forgetting cost monitoring → surprise bills.
- No retention policy → endless storage growth.
- No PII handling → compliance issues.
- Alert fatigue → important alerts get ignored.
Practical advice
- Pick one observability platform on day one. Don’t add it after the fact.
- Trace 100% of errors, sample normal traffic.
- Tag traces with feature/experiment/user metadata.
- Read traces weekly. Not just on bug reports.
- Set cost and quality alerts before launch.
Watch it interactively
- Observability Trace Viewer — distributed-trace waterfall with perturbation toggle (slow rerank / slow LLM / tool timeout / extra retry). Predict before clicking: flip “slow rerank” and the rerank span balloons, downstream spans shift right, total wall-clock grows; the bottleneck is now obvious. Click any span to see its tags + cost. This is the view your on-call rotation lives in.
Build it in code
/ship/12— observability with Phoenix — OpenTelemetry instrumentation for every LLM call, tool, retrieval, agent step. ~120 lines including helpers forllm_span,tool_span,retrieval_span. Includes head-based sampling, trace IDs in HTTP responses, user attribution./case-studies/04— customer-support bot — real product running on top of this observability layer; shows what “debug-by-trace” looks like at scale.