Retrieval-first compliance agent — clause-level RAG, MMR diversity, and a hard confidence gate that declines rather than hallucinates
A production-grade RAG system for RBI circular retrieval — built around the insight that in a compliance context, a clean decline is categorically better than a confident hallucination. The system never calls the LLM when retrieval confidence is too low.
The failure mode in a compliance context isn't ambiguity — it's confident incorrectness.
Compliance analysts manually cross-reference dozens of RBI circulars to answer a single query — a process that's slow, expensive, and error-prone. Naively deploying an LLM accelerates the search but introduces a worse risk: the model produces plausible-sounding answers that aren't grounded in actual circular text, making hallucinated compliance guidance indistinguishable from real guidance.
The system needs to answer accurately or decline explicitly, with no middle ground where it guesses. A false-confident answer in a compliance setting is a legal liability, not just an inconvenience.
Retrieval-first. Gate before LLM. Citations from metadata, not from the model's output.
- RBI circulars are scraped from rbi.org.in, chunked at clause boundaries (500 tokens, 50-token overlap to preserve cross-clause context) — 139 circulars across the 2024 calendar year yield ~31,000 chunks, 0 quarantined. Each chunk is embedded with text-embedding-3-small (1536-dim — same retrieval quality as 3-large at 5× lower cost) and indexed into Chroma for local development or Pinecone for production.
- On query, MMR retrieval (k=6, λ=0.3) fetches diverse chunks — maximal marginal relevance penalises redundant results, ensuring the retrieved context spans multiple circulars when the answer requires it.
- A confidence gate at 0.65 cosine similarity evaluates retrieval quality before the LLM is called. Below threshold: the system returns a decline message, logs the gate-fire event to PostgreSQL, and never calls the LLM. Above threshold: GPT-4o-mini synthesises a grounded response with citations derived from chunk metadata — not from the model's own output.
- APScheduler re-indexes nightly at 02:00 IST, embedding only documents added since the last run to avoid redundant embedding costs. Every request and retrieval is logged to an append-only PostgreSQL audit table with chunk IDs, similarity scores, and final response.
Why these choices — each driven by the compliance failure mode, not engineering preference.
- Gate-before-LLM: confidence threshold fires before model invocation. Prevents wasted API calls and — more critically — stops low-confidence retrievals from reaching the LLM where they become hallucinations. The threshold was recalibrated down from an initial 0.72 to 0.65 against measured OpenAI cosine scores: on-topic RBI queries land 0.65–0.73; out-of-scope queries cluster near 0.45–0.52. The gap is wide enough that a hard threshold works cleanly without a fuzzy band.
- No re-ranking layer — MMR handles diversity at retrieval time. A cross-encoder re-ranker would need fine-tuning on Indian banking regulation to add value; a generic ms-marco model would re-sort by surface similarity and likely hurt precision on domain-specific queries. MMR (λ=0.3) already biases toward diverse, high-relevance context, so the added latency and complexity of a re-ranker buys nothing here.
- Clause-level chunking over document-level embedding. Regulatory documents mix provisions, definitions, and exceptions within a single section. Embedding at the document level loses that granularity — a query about a specific provision retrieves the whole circular, diluting the result with unrelated adjacent text. Clause-level chunks keep semantic coherence tight and retrieval precision high.
- PostgreSQL for session history and citation audit log. Compliance teams need a reproducible audit trail of which clauses informed which answer — not just the LLM's output. Every request, including gate-fired declines, is written to an append-only table with retrieved chunk IDs, similarity scores, and final response. The audit log is the ground truth; the LLM output is derived from it.
- Citations injected from chunk metadata, not generated by the model. If the LLM generates its own citations it can hallucinate circular numbers that don't exist. All citations (circular ID, title, effective date) are sourced from the metadata tags attached to the retrieved chunks at index time — the model has no ability to invent them.
~31k chunks indexed · Pinecone production · sub-$1 total build cost.
Latency splits cleanly by path: gate-fired queries (no LLM) return in ~4.9s mean; gate-pass queries (LLM hit) run ~7.9s mean. The bottleneck is two sequential OpenAI round-trips — embed the query, then the GPT-4o-mini call — not retrieval, which is sub-100ms against Pinecone. Corpus indexing was a ~$0.20 one-time cost; the entire build, including ~550 dev and demo queries, stayed under a dollar.
What broke during calibration — and what I changed because of it.
- Chroma's default L2 distance quietly broke the confidence gate. The gate scores looked wrong across the board — even clean on-topic queries were landing below 0.65 — until I traced it to Chroma defaulting to L2 (Euclidean) distance, which compressed the score range so everything read as low-confidence. Forcing hnsw:space=cosine at collection creation restored the true cosine scores the threshold was calibrated against. This is what pushed the recalibration from 0.72 down to a stable 0.65.
- MMR λ=0.3 is not always the right setting. An early version used the default MMR λ (0.5), which balanced relevance and diversity too aggressively — retrieved chunks were diverse but the highest-relevance chunk was sometimes excluded in favour of diversity, causing the gate to fire on queries that should have passed. Lowering λ to 0.3 biased retrieval back toward relevance, giving the gate a cleaner signal.
- The 500-token chunk size created invisible splits at mid-clause boundaries. Some RBI circular clauses span more than 500 tokens — a condition, its exception, and a cross-reference. Splitting at token count cut these mid-clause, leaving each half missing the other's context. The fix: RecursiveCharacterTextSplitter with clause-boundary separators (section headers, numbered lists) before falling back to token-count splitting.
- Nightly delta re-indexing has a race condition when a circular is updated the same day it's indexed. Early versions used modification date as the freshness signal. If a circular was indexed during the day and then updated before the nightly run, the stale embedded version persisted. Fix: use a content hash as the freshness signal — the nightly run re-embeds any document whose hash doesn't match the stored embedding's metadata hash.
- Gate-fired declines need a human-readable explanation, not just a refusal. Early versions returned a plain "I cannot answer this query" on gate failure. Analysts interpreted this as a system error, not as a confidence gate firing. Fix: the decline message includes the top similarity score and suggests rephrasing — analysts understand the system is deferring on confidence, not broken.