Financial Reasoning

LLMs reading financial documents, doing analysis, supporting trading and accounting workflows. A high-stakes domain where hallucinations can cost real money. The patterns here transfer to any domain with structured data + complex reasoning.

Use cases

Document analysis: extract from 10-Ks, contracts, earnings transcripts.
Financial Q&A: “What was Apple’s revenue growth Q3 2025?” with citations.
Multi-document synthesis: compare three earnings reports side-by-side.
Analyst tooling: support for buy/sell research, due diligence.
Compliance: scan transactions for AML / fraud signals; produce SAR drafts.
Accounting automation: process invoices, reconcile, journal entries.
Personal finance: budgeting, planning, tax prep.

Why financial is hard

Numerical precision: a 3% vs 30% difference matters; hallucinations are catastrophic.
Time-sensitive: Q3 2024 vs Q3 2025 must not be confused.
Regulatory: outputs may need to be audit-ready, with clear sourcing.
Heterogeneous sources: structured (databases) + semi-structured (spreadsheets, tables in PDFs) + unstructured (transcripts).
Multi-step reasoning: many calculations build on each other.
High-trust contexts: users won’t tolerate confident wrong answers in finance.

Architectural patterns

Reasoning + tools, not generation alone

Don’t have the model do math in its head. Use tools:

Calculator / Python code execution for arithmetic.
SQL/queries for structured data.
Document retrieval for source-grounded facts.

A reasoning model with tools (Stage 07) is the right model class for serious finance work.

Multi-stage retrieval

For “Compare revenue growth across these three companies”:

Retrieve relevant filings for each company.
Extract revenue numbers with explicit citations.
Compute growth rates with code execution.
Synthesize comparison with citations preserved.

Each stage is a separate sub-task; results aggregated at the end.

Verified extraction

For pulling numbers from documents:

Have the model output (claim, source_quote, doc_id, page).
Verify the source_quote actually appears in the doc.
Double-extract: ask the model to extract the same fact twice, compare.

Strong citations

Every assertion linked to a source quote, not just a doc reference:

{
  "claim": "Apple Q3 2025 revenue was $97.3B",
  "source": {
    "doc_id": "AAPL_10Q_2025_Q3",
    "page": 5,
    "quote": "Net sales of $97.3 billion in the three months ended..."
  }
}

A linked quote that doesn’t appear in the doc → reject.

Specific application areas

10-K / 10-Q analysis

SEC filings are long, structured, often dozens of pages. Patterns:

Layout-aware parsers (Unstructured, LlamaParse) to extract tables.
Section-aware chunking (MD&A, Financials, Risk Factors).
Schema-extraction prompts with verifications.
RAG over historical filings for trend analysis.

Earnings transcripts

Speech → text → reasoning. Handle:

Speaker diarization (CFO vs analyst vs CEO).
Q&A vs prepared remarks.
Cross-reference with reported numbers.

Financial spreadsheets

Excel files with formulas, named ranges, cross-sheet references.

Parse with openpyxl / pandas.
Or convert to structured representation; pass to LLM.
For complex models, code execution (LLM writes Python; runs it; reads result) often beats direct generation.

Regulatory / compliance

Higher bar:

Specific output formats required (SAR, suspicious activity report).
Strict audit trail: who saw what data, when.
Conservative refusal for borderline cases.
Often requires fine-tuning on regulatory examples.

Trading / quant

Less common for LLMs to drive trades directly; more common for research support: summarize news, scan disclosures, generate hypotheses for analysts.

Latency-sensitive paths use small specialized models; analytical paths use frontier models with reasoning.

Multi-agent for finance

A common pattern:

Research agent: gathers filings, news, transcripts → corpus
Extractor agent: pulls structured data from each source
Reasoner agent: performs analysis with code execution
Reviewer agent: critiques output, flags issues
Writer agent: produces final report with citations

Each agent has a focused tool set and system prompt. Slow, expensive, but verifiable.

Evaluation

Domain-specific harnesses:

FinBench, FinanceBench (Patronus): real-world financial questions with reference answers.
MMLU finance subsets.
Custom: 100+ (question, expected_answer) pairs from your use case.

For verification-heavy use cases:

Source-citation accuracy: did the model cite real, relevant passages?
Numerical accuracy: are numbers exactly right (often more important than approximate)?
Calibrated refusal: does the model say “I don’t have data on Q4” when it doesn’t?

Risks specific to finance

Misleading claims: confident wrong analysis can drive real-world losses.
Hallucinated numbers: indistinguishable from real ones to non-experts.
Data leakage: client portfolio info, M&A discussions — strict data boundaries.
Regulation: in some jurisdictions, AI-generated investment advice is regulated.
Audit trail: outputs may be subpoena’d; full traceability matters.

Tooling

LlamaIndex Financial: prepackaged finance-RAG patterns.
OpenBB, finbert, FinGPT: open finance ML tools.
Bloomberg Terminal (closed) increasingly has AI integration.
Patronus AI: finance-specialized eval and guardrails.
Hebbia, Glean, Harvey: enterprise finance/legal AI products.

Real products (early 2026)

Hebbia: agentic AI for financial analysts; multi-document research.
Harvey: legal + finance, used by law firms.
Rogo, Brightwave, Patronus: specialized analyst tooling.
Bridgewater AI initiative, Renaissance experiments: hedge-fund integrations (mostly research).
Ramp, Brex: AI for finance ops at SMBs.

Pitfalls

No verifier on numbers: model says revenue was $X; nobody checks.
Treating all sources equally: a Reddit post and a 10-K have very different credibility.
Missing temporal context: “as of when?” matters constantly.
Currency / unit confusion: $1.5M vs $1.5B vs €1.5M.
Quarterly vs annual confusion.
Forgetting macro context: a “20% revenue drop” needs context (was that an industry-wide event?).
No human reviewer: high-stakes outputs without expert verification.

Practical advice

Always use tools for math. Calculator, code, SQL — never raw model arithmetic.
Always cite sources with verifiable quotes.
Verify on a domain eval with known answers.
Have humans review anything client-facing or decision-driving.
Lean on reasoning models for multi-step analysis.
Track the audit trail: every input, source, decision, output.