Text-to-SQL
“Show me revenue by region for Q4” → SELECT region, SUM(revenue) FROM sales WHERE .... A perennial dream; surprisingly hard at production quality. Modern frontier models can do it well with the right scaffolding — without it, they hallucinate column names and produce dangerous queries.
The naive approach (and why it fails)
prompt = f"""
You are a SQL expert. Convert this question to SQL.
Question: {question}
SQL:
"""
sql = llm(prompt)
db.execute(sql) # 🔥
What goes wrong:
- Hallucinated columns:
SELECT user_email FROM userswhen the column isemail_address. - Wrong joins: model invents foreign keys.
- Missing filters: forgets to scope to current user, current tenant.
- Read access only? Or write? Naive code can drop tables.
- Dialect mismatches: PostgreSQL vs MySQL vs Snowflake vs BigQuery.
- Performance: full-table scans on huge tables.
Production text-to-SQL is a system, not a single prompt.
The schema in context
The model needs to know what tables and columns exist.
Static schema dump
For small DBs, paste the schema into the prompt:
Tables:
users(id, email_address, created_at, plan_type)
orders(id, user_id, total_cents, status, created_at)
...
Works up to ~50 tables / 500 columns. Beyond that, prompt bloats.
Schema RAG
For larger DBs:
- Embed each table description (name + columns + comments + sample rows + example queries).
- At query time, embed the user question; retrieve top-k tables.
- Include those in prompt.
Now you can scale to 1000s of tables. Critical for warehouses.
Schema enrichment
The model performs better when the schema includes:
- Comments on tables and columns (
-- Reseller of products bought by users). - Example values (top-5 distinct values for low-cardinality columns).
- Foreign-key relationships explicitly listed.
- Common query patterns as examples.
Invest here. Better schema docs ≈ much better text-to-SQL.
Few-shot examples
Show the model 3–10 (question, SQL) pairs from your domain. Often the largest single quality lift.
Q: "How many users signed up last month?"
A: SELECT COUNT(*) FROM users WHERE created_at >= DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '1 month'
AND created_at < DATE_TRUNC('month', CURRENT_DATE);
Q: "What's our top-selling product?"
A: SELECT p.name, COUNT(*) as sold FROM orders o JOIN products p ON p.id = o.product_id
GROUP BY p.name ORDER BY sold DESC LIMIT 1;
Q: {user_question}
A:
Dynamic few-shot: retrieve the most similar past questions (with their SQL) for each new query. Even better.
Validation and execution
Never execute model-generated SQL blindly.
Syntax validation
Parse the SQL with a parser before executing:
- sqlglot: parses many dialects; can also rewrite/transpile.
- Database-specific EXPLAIN: catches some errors.
Schema validation
Check: every referenced table and column exists. Reject if not.
Read-only enforcement
Use a read-only DB user. The model can’t write; even successful injection can’t delete data.
CREATE USER sql_agent WITH PASSWORD '...';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO sql_agent;
Row-level security / scoping
For multi-tenant: enforce tenant filter automatically.
def safe_execute(sql, tenant_id):
sql = enforce_tenant_filter(sql, tenant_id)
return db.execute(sql)
Or use database row-level security (Postgres RLS) so the user account can only see their data.
Result limits
Always cap result rows:
sql = inject_limit(sql, max_rows=1000)
Prevents a query from returning a million rows and blowing up the LLM context.
Self-correction
When SQL fails, give the error to the model:
for attempt in range(3):
sql = generate_sql(question, history)
try:
result = db.execute(sql)
return result
except DatabaseError as e:
history.append(f"Previous attempt failed with: {e}")
Surprisingly effective. The model can fix typos, missing joins, wrong functions when given the error.
Evaluation
For text-to-SQL, evals matter:
- Execution accuracy: does the generated SQL produce the right result?
- Semantic equivalence: same result even if SQL syntax differs.
- Pass at k: of k generations, did at least one succeed?
Public benchmarks: Spider, BIRD, WikiSQL — useful for comparing models, less useful for your specific schema.
Build a domain eval set: 100 (question, expected_result) pairs from real users.
Patterns for production
Translation pipeline
User question
↓
Classification: data question? metadata? unrelated?
↓
Schema retrieval (top-k tables)
↓
Generation with schema + few-shot
↓
SQL validation (syntax, schema, safety)
↓
Execution
↓
Result formatting (tables, charts, summaries)
↓
Show user; offer "show SQL" link for power users
Conversation memory
Multi-turn data conversations: “Now break that down by month.”
The model needs prior turns + prior SQL to generate the next query. Maintain a conversation state with table/SQL history.
Charting and visualization
After getting results, ask the LLM to choose a viz:
Given this data and original question, what's the best chart?
- bar / line / pie / table
Render with Vega-Lite, Plotly, etc.
Frontier vs specialized
By 2026, frontier models (Claude, GPT-4.x, Gemini) are very strong at SQL out-of-the-box. Specialized models (Defog SQLCoder, etc.) are mostly catching up but rarely surpass frontier.
For most teams: frontier model + good schema RAG + few-shot + validation beats a specialized model with worse plumbing.
Real products
Roughly the architecture used by:
- Snowflake Cortex Analyst, Databricks Genie: warehouse-native NL→SQL.
- DataChat, Hex Magic, Mode AI Assistant: BI tool integrations.
- Defog, Vanna AI: open-source frameworks.
Each adds: charting, conversation memory, query optimization, governance.
Pitfalls
- No schema validation: model hallucinates columns; queries fail in prod.
- No read-only user: a jailbreak gets DROP TABLE.
- Forgetting timezone / locale: “yesterday” is ambiguous globally.
- No result limit: query returns 100M rows, app dies.
- Trusting LLM with PII visibility: model can see all rows from result; ensure access controls upstream.
- Forgetting non-SQL data: some questions are about metadata or things that aren’t in the DB. Have a fallback.