Data Systems for AI Products
LLM features look stateless from the outside — prompt in, completion out. The reality is that any non-trivial AI product is a data system: training corpora, RAG indexes, eval sets, prompt caches, conversation logs, feedback streams, model artifacts, all coupled and all evolving. Treat it like a stateless API and you’ll be debugging silent quality decay six months in.
This article is the missing piece between the model-centric content of Stages 8–11 and the deployment-mechanics of Stage 13’s other articles. It draws heavily on Chip Huyen’s Designing Machine Learning Systems, Gift & Deza’s Practical MLOps, and Ameisen’s Building ML-Powered Applications.
The data flow
Every production AI product has roughly this shape:
┌──────────────────────────────────┐
│ Source of truth: documents, │
│ KB articles, user-generated │
│ content, transactional DB │
└────────────┬─────────────────────┘
│
┌────────────▼─────────────────────┐
│ Ingestion pipeline: │
│ extract, chunk, embed, store │
└────────────┬─────────────────────┘
│
┌────────────────────┼────────────────────────┐
│ │ │
┌────────▼─────┐ ┌─────────▼────────┐ ┌─────────▼──────┐
│ Vector DB + │ │ Eval / golden │ │ Fine-tune / │
│ keyword idx │ │ sets │ │ distill data │
└──────┬───────┘ └──────────────────┘ └────────────────┘
│
┌──────▼──────┐
│ LLM service │ ← prompts, system messages
└──────┬──────┘
│
┌──────▼──────┐ ┌─────────────────────┐
│ User-facing │────────▶│ Trace store + │
│ response │ │ feedback signals │
└─────────────┘ └──────────┬──────────┘
│
┌───────────▼──────────┐
│ Curation pipeline: │
│ failures → eval set; │
│ successes → fine- │
│ tune corpus │
└──────────┬───────────┘
│
(loop back to top)
Every arrow is a data system. Most teams instrument a few of them; the gap between “a few” and “all of them” is the gap between a demo and a product that improves over time.
Pillar 1 — Source-to-index pipeline
The path from “a doc lives somewhere” to “it’s queryable” is more than chunk → embed → store.
What you actually need:
- Idempotent ingestion. Re-running the pipeline on the same source produces the same index. No duplicate chunks, no orphaned vectors. Use content hashes as primary keys.
- Schema versioning. Chunk size, embedder version, metadata fields all change over time. Version the schema and the index together.
- Incremental updates. When 10 docs change in a 100k-doc corpus, don’t re-embed everything. Track
(doc_id, content_hash)and process only the diff. - Deletion propagation. When a source doc is deleted, its chunks must disappear from the index, the BM25 index, the cache, and any derived structures (graph nodes, summaries). Soft-delete with TTL is usually saner than hard-delete.
- Backfill discipline. When you upgrade the embedder, you need a parallel index, a migration window, and a cutover. Don’t mix vectors from different models — covered in Stage 9 vector databases.
Common failures:
- “It worked on day one” — index built once from a snapshot, never updated; quality decays as docs change.
- “We re-embedded everything weekly” — works until the corpus exceeds a few million docs and the cost stops making sense.
- “We added a metadata field” — old chunks don’t have it; queries that filter on it silently miss them.
The classical-ML version of this is a feature store (Feast, Tecton, in-house). The LLM version is the indexed corpus + embeddings + chunk metadata. Same engineering principle: a stable, versioned, queryable derived view of source-of-truth data.
Pillar 2 — Training-serving skew
Skew is when the data your model sees in production doesn’t match the data it trained on. For LLMs, this shows up in several ways:
- Prompt template drift. You changed the production system prompt; the few-shot examples that worked under v1 now produce subtly different outputs.
- Tokenizer mismatch. Fine-tuned with one chat template, served with another. The model still produces words but quality cratered.
- Retrieval quality drift. Eval set built when the corpus was 1k chunks; production now serves 100k chunks; recall@10 metrics from the smaller corpus mislead.
- Context-length drift. Eval used 4k-token contexts; production users send 50k. Model behaves differently on long context.
Mitigations:
- Pin templates and prompts. Treat the chat template, system prompt, and tool definitions as build artifacts versioned alongside the model.
- Pin tokenizers. When fine-tuning, the exact tokenizer file is part of the artifact.
- Production-derived eval. Sample real production traffic into the eval set monthly so the eval distribution matches what users actually send.
- Shadow runs. Before promoting a new prompt or model, run it on a slice of production traffic alongside the old; compare metrics, then promote.
The classical Huyen frame: “The features used during training must be the same as those used during serving.” For LLMs, “features” means the entire prompt, not just retrieved context. Anything that goes into messages.create() is a feature; any difference is potential skew.
Pillar 3 — Distribution shift detection
Production data drifts. New topics, new user populations, new failure modes appear without warning. Detection options:
- Input drift: monitor distributions of input lengths, language mix, topic clusters (via clustering of query embeddings), tool-call patterns. Alert when a cluster’s share doubles week-over-week.
- Output drift: monitor output length distributions, refusal rates, structured-output validity rates, citation patterns.
- Quality drift: a small fixed eval set, run nightly. The drift signal is the same set giving different scores over time, which means something upstream changed (model version, prompt, retrieval).
- Feedback drift: thumbs-down rates by topic, user, time of day.
For LLM products specifically, the model itself drifting is a real concern: API providers change underlying versions, even at “stable” model IDs. Monthly regression-eval runs catch this. Pin model IDs explicitly (claude-sonnet-4-6 not claude-sonnet-latest).
Tools: arize.com, fiddler.ai, evidentlyai.com for classical ML drift detection; Phoenix / Langfuse / Helicone for LLM-specific.
Pillar 4 — The feedback loop
The hardest pillar. The point of running in production is that real usage tells you what’s broken — but only if you instrument the loop.
Inputs to the loop:
- Explicit signals: thumbs up/down, user edits to outputs, escalations to humans.
- Implicit signals: retries, conversation abandonment, copy-pastes out of the product, click-through on cited sources.
- Trace replay: filter for low-confidence outputs, slow traces, schema-violation events; sample a few per week, hand-review.
Outputs of the loop:
- Failures → eval set. Every “this answer was wrong” becomes a regression test. Over months, your eval set grows from synthetic to deeply real.
- Successes → fine-tune corpus. Outputs the user accepted, especially with edits, are training signal for SFT or DPO.
- Patterns → product changes. A class of queries that consistently fail might need a new tool, a new RAG index, or a UX change to head them off.
This is what Huyen calls “continual learning.” The classical-ML version retrains the model on fresh data; the LLM version more often updates prompts, RAG sources, eval sets — fine-tunes are the heavy hammer.
Privacy note: anything captured into the loop must comply with whatever data agreement you have with users. PII redaction at trace ingest, configurable retention, opt-out for eval/training-data inclusion. See enterprise considerations.
Pillar 5 — Versioning and lineage
Three things every audit eventually asks for:
- What was the system at time T? — model version, prompt version, RAG index version, eval set version.
- Why did this specific query produce this specific output? — full trace from input to output with intermediate steps.
- Where did this training-data row come from? — source URL, ingestion timestamp, transformations applied, consent status.
Implementing:
- Git for code, prompts, configs. Prompts live as files, not strings in code. Diffs in PR review.
- Model registry (MLflow, Weights & Biases artifacts, HuggingFace Hub, in-house). Every fine-tuned model has metadata: training data hash, hyperparameters, eval metrics, parent model.
- Index registry. Each vector-DB collection version has metadata: source-doc count, embedder version, chunking config, build timestamp.
- Trace retention. Hot 7–30 days, cold 90+, archived to compliance window.
- Data lineage. Source URL → document ID → chunk ID → trace ID. Each chunk in an index points back to its source. Each output cites the chunks. Each chunk’s row metadata includes ingest provenance.
Without this, “why is the system worse this week?” is unanswerable.
Pillar 6 — Continuous training (or its LLM equivalent)
Classical ML teams retrain models weekly or daily on fresh data. LLM teams rarely retrain that often — but the equivalent operations all need pipelines:
- Eval set refresh — sample new failures into the regression eval.
- RAG index refresh — re-ingest changed source docs.
- Prompt iteration — try a new variant, A/B against the current.
- Periodic fine-tune — quarterly or as accumulated signal warrants.
- Reward model / judge calibration — if you use an LLM-as-judge, calibrate against human labels every quarter; judges drift too.
These all need CI/CD: a pipeline triggered on data change, running evals, pushing to staging, promoting on green metrics. Same patterns as classical CI/CD with twists:
- Tests are mostly eval metrics, not unit tests.
- “Green” is metrics within tolerance, not “0 failures.”
- Rollbacks happen by reverting prompt versions, swapping model IDs, or restoring an index snapshot.
Practical MLOps spends most of its pages on this; the patterns transfer directly to LLM workflows.
Build vs. buy
For each pillar, you can build or buy:
| Pillar | Build | Buy |
|---|---|---|
| Source-to-index | Custom Python + Postgres/pgvector | LlamaCloud, Vectara, Pinecone Assistants |
| Skew prevention | Convention + code review | Anyscale, MosaicML pipelines |
| Drift detection | Custom + Grafana | Arize, Fiddler, Evidently |
| Feedback loop | Custom + Phoenix/Langfuse | LangSmith, Helicone, Braintrust |
| Versioning | Git + model registry (MLflow / W&B) | Weights & Biases, ClearML, Comet |
| Continuous training | GitHub Actions + custom | Vertex Pipelines, SageMaker Pipelines, Anyscale |
Most teams in 2026 buy the observability layer (Phoenix or Langfuse), build the rest with thin wrappers around open-source primitives. At scale, the buy decisions stack up.
Anti-patterns
- Stateless mindset. “It’s just an API call.” No it isn’t; the corpus is a database, the prompt is code, the eval is a regression test, and the trace is an audit log.
- No versioning of prompts. A new system prompt ships, quality drops, nobody can find the diff.
- One-shot ingestion. Built the index once; corpus drifted; no one noticed until quality complaints rolled in.
- Eval set in someone’s local Jupyter. It’s a production artifact. Treat it like one.
- No PII handling in traces. Captured everything; legal called.
- Drift alerts firing constantly, ignored. The classic Goodhart problem — alerts that don’t lead to action are noise. Either tune them or remove them.
A starter checklist
Before launching an LLM feature to >100 users:
- Source docs are stored in a known place; their hash is recorded; ingestion is repeatable.
- Index version is tagged and saved; rollback procedure documented.
- Prompts and system messages are in version control, not hard-coded.
- Tokenizer and chat template (if fine-tuned) are pinned with the model artifact.
- Tracing captures input, retrievals, model output, latency, cost.
- PII redaction (if applicable) runs at trace ingest.
- At least one nightly automated eval; one weekly metric review.
- Drift signals on: input length, output length, refusal rate, structured-output validity.
- Feedback path exists: how does a “this is wrong” complaint reach an engineer?
- Failures-to-eval-set pipeline runs at least monthly.
If you can’t check all of these, you have data debt. It’s manageable; track it explicitly.
See also
- Deployment architectures — model serving, infrastructure
- Observability & tracing — what to log and where
- Evaluation & benchmarks — eval discipline
- Stage 9 — Vector databases — the index layer
- Stage 10 — Data & tooling — training data pipelines
References
- Chip Huyen, Designing Machine Learning Systems (O’Reilly 2022) — chapters 3 (Data Engineering), 4 (Training Data), 5 (Feature Engineering), 8 (Distribution Shifts and Monitoring), 10 (Infrastructure and Tooling for MLOps).
- Noah Gift & Alfredo Deza, Practical MLOps (O’Reilly 2021) — for CI/CD patterns, model registries, deployment strategies (canary, blue-green).
- Emmanuel Ameisen, Building Machine Learning Powered Applications (O’Reilly 2020) — for the iterate-deploy-monitor loop.