Text-to-SQL

“Show me revenue by region for Q4” → SELECT region, SUM(revenue) FROM sales WHERE .... A perennial dream; surprisingly hard at production quality. Modern frontier models can do it well with the right scaffolding — without it, they hallucinate column names and produce dangerous queries.

The naive approach (and why it fails)

prompt = f"""
You are a SQL expert. Convert this question to SQL.

Question: {question}
SQL:
"""
sql = llm(prompt)
db.execute(sql)   # 🔥

What goes wrong:

Hallucinated columns: SELECT user_email FROM users when the column is email_address.
Wrong joins: model invents foreign keys.
Missing filters: forgets to scope to current user, current tenant.
Read access only? Or write? Naive code can drop tables.
Dialect mismatches: PostgreSQL vs MySQL vs Snowflake vs BigQuery.
Performance: full-table scans on huge tables.

Production text-to-SQL is a system, not a single prompt.

The schema in context

The model needs to know what tables and columns exist.

Static schema dump

For small DBs, paste the schema into the prompt:

Tables:
  users(id, email_address, created_at, plan_type)
  orders(id, user_id, total_cents, status, created_at)
  ...

Works up to ~50 tables / 500 columns. Beyond that, prompt bloats.

Schema RAG

For larger DBs:

Embed each table description (name + columns + comments + sample rows + example queries).
At query time, embed the user question; retrieve top-k tables.
Include those in prompt.

Now you can scale to 1000s of tables. Critical for warehouses.

Schema enrichment

The model performs better when the schema includes:

Comments on tables and columns (-- Reseller of products bought by users).
Example values (top-5 distinct values for low-cardinality columns).
Foreign-key relationships explicitly listed.
Common query patterns as examples.

Invest here. Better schema docs ≈ much better text-to-SQL.

Few-shot examples

Show the model 3–10 (question, SQL) pairs from your domain. Often the largest single quality lift.

Q: "How many users signed up last month?"
A: SELECT COUNT(*) FROM users WHERE created_at >= DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '1 month'
   AND created_at < DATE_TRUNC('month', CURRENT_DATE);

Q: "What's our top-selling product?"
A: SELECT p.name, COUNT(*) as sold FROM orders o JOIN products p ON p.id = o.product_id
   GROUP BY p.name ORDER BY sold DESC LIMIT 1;

Q: {user_question}
A:

Dynamic few-shot: retrieve the most similar past questions (with their SQL) for each new query. Even better.

Validation and execution

Never execute model-generated SQL blindly.

Syntax validation

Parse the SQL with a parser before executing:

sqlglot: parses many dialects; can also rewrite/transpile.
Database-specific EXPLAIN: catches some errors.

Schema validation

Check: every referenced table and column exists. Reject if not.

Read-only enforcement

Use a read-only DB user. The model can’t write; even successful injection can’t delete data.

CREATE USER sql_agent WITH PASSWORD '...';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO sql_agent;

Row-level security / scoping

For multi-tenant: enforce tenant filter automatically.

def safe_execute(sql, tenant_id):
    sql = enforce_tenant_filter(sql, tenant_id)
    return db.execute(sql)

Or use database row-level security (Postgres RLS) so the user account can only see their data.

Result limits

Always cap result rows:

sql = inject_limit(sql, max_rows=1000)

Prevents a query from returning a million rows and blowing up the LLM context.

Self-correction

When SQL fails, give the error to the model:

for attempt in range(3):
    sql = generate_sql(question, history)
    try:
        result = db.execute(sql)
        return result
    except DatabaseError as e:
        history.append(f"Previous attempt failed with: {e}")

Surprisingly effective. The model can fix typos, missing joins, wrong functions when given the error.

Evaluation

For text-to-SQL, evals matter:

Execution accuracy: does the generated SQL produce the right result?
Semantic equivalence: same result even if SQL syntax differs.
Pass at k: of k generations, did at least one succeed?

Public benchmarks: Spider, BIRD, WikiSQL — useful for comparing models, less useful for your specific schema.

Build a domain eval set: 100 (question, expected_result) pairs from real users.

Patterns for production

Translation pipeline

User question
   ↓
Classification: data question? metadata? unrelated?
   ↓
Schema retrieval (top-k tables)
   ↓
Generation with schema + few-shot
   ↓
SQL validation (syntax, schema, safety)
   ↓
Execution
   ↓
Result formatting (tables, charts, summaries)
   ↓
Show user; offer "show SQL" link for power users

Conversation memory

Multi-turn data conversations: “Now break that down by month.”

The model needs prior turns + prior SQL to generate the next query. Maintain a conversation state with table/SQL history.

Charting and visualization

After getting results, ask the LLM to choose a viz:

Given this data and original question, what's the best chart?
- bar / line / pie / table

Render with Vega-Lite, Plotly, etc.

Frontier vs specialized

By 2026, frontier models (Claude, GPT-4.x, Gemini) are very strong at SQL out-of-the-box. Specialized models (Defog SQLCoder, etc.) are mostly catching up but rarely surpass frontier.

For most teams: frontier model + good schema RAG + few-shot + validation beats a specialized model with worse plumbing.

Real products

Roughly the architecture used by:

Snowflake Cortex Analyst, Databricks Genie: warehouse-native NL→SQL.
DataChat, Hex Magic, Mode AI Assistant: BI tool integrations.
Defog, Vanna AI: open-source frameworks.

Each adds: charting, conversation memory, query optimization, governance.

Pitfalls

No schema validation: model hallucinates columns; queries fail in prod.
No read-only user: a jailbreak gets DROP TABLE.
Forgetting timezone / locale: “yesterday” is ambiguous globally.
No result limit: query returns 100M rows, app dies.
Trusting LLM with PII visibility: model can see all rows from result; ensure access controls upstream.
Forgetting non-SQL data: some questions are about metadata or things that aren’t in the DB. Have a fallback.