10.13 RAG Pipeline

Retrieval-Augmented Generation: a fixed-order pipeline where an encoder embeds the query, a retriever fetches relevant chunks from a corpus, a reranker filters to the best context window, a generator produces the answer, and a citation validator confirms every claim maps to a retrieved source.


Motivating Scenario

A legal research firm handles 500 contract queries per day — jurisdiction-specific questions about clause language across a 50,000-contract corpus. A single capable model answering from memory produces a 23% hallucination rate on jurisdiction-specific questions: the model confabulates plausible-sounding clauses that do not exist in the actual contracts. The firm faces professional liability on every wrong answer.

After deploying a RAG pipeline, hallucination rate drops to 2.1% and query turnaround is 4x faster than associates. The reduction comes from grounding: every answer is constructed from retrieved clauses, and a citation validator confirms that each factual claim in the output traces to an exact chunk in the retrieved set. The remaining 2.1% failure rate clusters on retrieval gaps — cases where the relevant contract is not in the corpus — not on model hallucination within context.

Structure

Zoom and pan enabled · Concrete example: legal contract query pipeline

Key Metrics

MetricSignal
Hallucination rate Primary quality signal — percentage of claims in validated output not supported by retrieved chunks. Target: below 3% for high-stakes domains.
Retrieval recall@K Does the Retriever return the relevant document in top-K results? Measured on held-out query set with known ground truth. Dropping below 0.80 requires corpus or encoder investigation.
Reranker precision@N Of the N chunks passed to the Generator, what fraction are actually relevant? Low precision dilutes the context window and increases generator error rate.
Citation validator pass rate Percentage of draft answers that pass citation validation without requiring modification. A falling pass rate signals generator or reranker degradation before end-to-end quality metrics surface the issue.
Corpus coverage lag Average time between document ingestion and availability for retrieval. For active legal matters, a lag above 4 hours creates material retrieval gap risk.
NodeWhat it doesWhat it receivesWhat it produces
Query Encoder Embeds the raw query text into a dense vector representation matching the corpus embedding space Raw query string Query embedding vector
Retriever Runs approximate nearest-neighbor search against the vector store to fetch the top-K most semantically similar chunks Query embedding + vector store access Top-50 candidate chunks with similarity scores
Reranker Cross-encodes query against each candidate chunk; filters by jurisdiction and matter type; returns the highest-precision subset that fits the context window Top-50 chunks + query + metadata filters Top-8 reranked chunks with relevance scores
Generator Constructs answer grounded in the retrieved context window; instructed to cite chunk IDs and quote directly rather than paraphrase from memory Top-8 chunks + original query Draft answer with inline chunk citations
Citation Validator Verifies that each factual claim in the draft answer is directly supported by a cited chunk; rejects or flags unsupported claims Draft answer + retrieved chunks Validated answer with grounding status per claim

When to Use

Use when
Avoid when

Value Profile

Origin of ValueWhere it appearsHow it is captured
Future Cashflow Generator node Answer quality is the product. The Generator's output is the only customer-facing deliverable; every upstream stage exists to make it accurate. Quality is directly proportional to retrieval precision — a Generator with perfect context still fails if the Reranker passed irrelevant chunks.
Governance Citation Validator The Citation Validator is the trust boundary between model speculation and warranted claims. Outputs crossing it are grounded assertions; outputs that fail it are model opinions not backed by the corpus. This node encodes the firm's liability standard.
Risk Exposure Retriever and corpus Retrieval gaps — queries where the relevant document is absent from the corpus — are the primary residual failure mode. These failures are invisible to the Citation Validator because the model may generate a plausible answer from other chunks. Corpus coverage is a risk metric, not just an operational concern.
Conditional Action Every stage Each stage consumes compute before the answer is produced. Encoder and Retriever are cheap; Reranker and Generator are expensive. A query that returns zero relevant chunks still incurs full pipeline cost. Query volume directly drives cost with no fixed-overhead amortization.
VCM analog: Work Token with data provenance. Each stage in the pipeline earns its position only if its output measurably improves grounding. The Citation Validator is the trust boundary — outputs crossing it are warranted claims backed by retrieved evidence, not model speculation. A pipeline without a Citation Validator produces unwarranted Work Tokens: they look valid but carry no provenance guarantee.

Dynamics and Failure Modes

Retrieval gap (relevant document absent from corpus)

The user asks about a Force Majeure clause in a contract signed last week. The corpus ingestion pipeline runs nightly. The contract is not yet indexed. The Retriever returns the top-50 most similar chunks from existing contracts, none of which are from the relevant document. The Generator constructs an answer that sounds correct — it draws on similar language from other jurisdictions — and the Citation Validator passes it because all citations resolve to real chunks. The answer is factually wrong for this specific contract. Detection requires corpus coverage monitoring: track what percentage of recent documents have been indexed, and alert when coverage drops below a threshold for active matters.

Context window stuffing (too many chunks, diluted signal)

The Reranker passes 20 chunks instead of 8 to fit more context. The Generator attends weakly to the most relevant chunks, which are buried mid-context, and over-weights the first and last chunks (recency and primacy effects). Answer quality degrades compared to a tighter 6-chunk context despite more information being present. Fix: tune Reranker cutoff empirically by measuring answer accuracy as a function of context size. For most LLMs, 6-10 high-precision chunks outperform 20 moderate-precision chunks.

Citation hallucination (model ignores retrieved context)

The Generator is instructed to cite chunk IDs but produces an answer that references chunk IDs not in the retrieved set, or that quotes text not present in the cited chunk. This happens when the model's training knowledge contradicts the retrieved context and the model "corrects" the retrieved source. The Citation Validator catches explicit mismatches but may miss paraphrastic hallucinations where the model rewrites a clause to change its meaning while preserving surface similarity. Fix: require verbatim quotation in high-stakes passages and validate exact string match, not semantic similarity.

Embedding drift (query and document embeddings misaligned)

The document corpus was embedded using text-embedding-ada-002. The query encoder was upgraded to text-embedding-3-large. Cosine similarity scores between query embeddings and document embeddings are now unreliable — the spaces are not aligned. Retrieval recall drops from 0.87 to 0.61 with no error signal, only gradual answer quality degradation. Fix: re-embed the entire corpus when the embedding model changes. Track a held-out retrieval benchmark query set with known relevant documents, and alert when recall@K drops more than 5 percentage points.

Variants

VariantModificationWhen to use
HyDE (Hypothetical Document Embedding) Before retrieval, the Generator produces a hypothetical ideal document matching the query; the encoder embeds that document instead of the raw query Queries are short or ambiguous; a richer representation improves retrieval recall over direct query embedding
Multi-Hop RAG After the first retrieval cycle, the Generator identifies missing context and the Retriever runs again with refined sub-queries, repeating until context is sufficient Queries that require connecting information across multiple documents — e.g., "does this clause conflict with the indemnification terms in the master agreement?"
Corrective RAG (CRAG) Citation Validator failure triggers the Retriever to re-query with expanded or reformulated search terms before regenerating High-precision requirements where a single retrieval pass is insufficient; acceptable to trade latency for grounding quality

Related Patterns

PatternRelationship
10.11 PipelineRAG is a specialized pipeline with retrieval infrastructure — the same sequential composition principles apply, including cascading error from early stages
10.15 Evaluator-OptimizerAdd a retry loop when the Citation Validator fails: the Evaluator triggers re-retrieval or re-generation rather than returning a failed response
20.23 Orchestrator-WorkersMulti-hop RAG becomes an Orchestrator-Workers pattern when the orchestrator dynamically decides which corpora to query based on intermediate results

Investment Signal

RAG pipelines are measurable at every stage. Retrieval recall, reranker precision, citation pass rate, and end-to-end hallucination rate are all independently auditable. A firm that can demonstrate sub-3% hallucination on a domain-specific held-out benchmark has a defensible quality moat — the benchmark itself is a due diligence artifact.

The corpus is the primary asset. A vector store built from 5 years of proprietary contract history, customer interactions, or domain-specific documents cannot be replicated by a competitor who buys the same LLM. Corpus quality, coverage, and freshness are the real competitive variables — the retrieval architecture is commoditizing rapidly.

Red flag: a RAG deployment with no per-stage instrumentation is flying blind. If the firm reports only final answer quality and cannot decompose failure into retrieval gaps vs. generation errors vs. citation failures, they cannot diagnose or improve the system and cannot prove the corpus is doing useful work.