10.13 RAG Pipeline — AI-Native Organization Patterns

Retrieval-Augmented Generation: a fixed-order pipeline where an encoder embeds the query, a retriever fetches relevant chunks from a corpus, a reranker filters to the best context window, a generator produces the answer, and a citation validator confirms every claim maps to a retrieved source.

Motivating Scenario

A legal research firm handles 500 contract queries per day — jurisdiction-specific questions about clause language across a 50,000-contract corpus. A single capable model answering from memory produces a 23% hallucination rate on jurisdiction-specific questions: the model confabulates plausible-sounding clauses that do not exist in the actual contracts. The firm faces professional liability on every wrong answer.

After deploying a RAG pipeline, hallucination rate drops to 2.1% and query turnaround is 4x faster than associates. The reduction comes from grounding: every answer is constructed from retrieved clauses, and a citation validator confirms that each factual claim in the output traces to an exact chunk in the retrieved set. The remaining 2.1% failure rate clusters on retrieval gaps — cases where the relevant contract is not in the corpus — not on model hallucination within context.

Structure

Key Metrics

Metric	Signal
Hallucination rate	Primary quality signal — percentage of claims in validated output not supported by retrieved chunks. Target: below 3% for high-stakes domains.
Retrieval recall@K	Does the Retriever return the relevant document in top-K results? Measured on held-out query set with known ground truth. Dropping below 0.80 requires corpus or encoder investigation.
Reranker precision@N	Of the N chunks passed to the Generator, what fraction are actually relevant? Low precision dilutes the context window and increases generator error rate.
Citation validator pass rate	Percentage of draft answers that pass citation validation without requiring modification. A falling pass rate signals generator or reranker degradation before end-to-end quality metrics surface the issue.
Corpus coverage lag	Average time between document ingestion and availability for retrieval. For active legal matters, a lag above 4 hours creates material retrieval gap risk.

Metric

Signal

Hallucination rate

Primary quality signal — percentage of claims in validated output not supported by retrieved chunks. Target: below 3% for high-stakes domains.

Retrieval recall@K

Does the Retriever return the relevant document in top-K results? Measured on held-out query set with known ground truth. Dropping below 0.80 requires corpus or encoder investigation.

Reranker precision@N

Of the N chunks passed to the Generator, what fraction are actually relevant? Low precision dilutes the context window and increases generator error rate.

Citation validator pass rate

Percentage of draft answers that pass citation validation without requiring modification. A falling pass rate signals generator or reranker degradation before end-to-end quality metrics surface the issue.

Corpus coverage lag

Average time between document ingestion and availability for retrieval. For active legal matters, a lag above 4 hours creates material retrieval gap risk.

Node	What it does	What it receives	What it produces
Query Encoder	Embeds the raw query text into a dense vector representation matching the corpus embedding space	Raw query string	Query embedding vector
Retriever	Runs approximate nearest-neighbor search against the vector store to fetch the top-K most semantically similar chunks	Query embedding + vector store access	Top-50 candidate chunks with similarity scores
Reranker	Cross-encodes query against each candidate chunk; filters by jurisdiction and matter type; returns the highest-precision subset that fits the context window	Top-50 chunks + query + metadata filters	Top-8 reranked chunks with relevance scores
Generator	Constructs answer grounded in the retrieved context window; instructed to cite chunk IDs and quote directly rather than paraphrase from memory	Top-8 chunks + original query	Draft answer with inline chunk citations
Citation Validator	Verifies that each factual claim in the draft answer is directly supported by a cited chunk; rejects or flags unsupported claims	Draft answer + retrieved chunks	Validated answer with grounding status per claim

Node

What it does

What it receives

What it produces

Query Encoder

Embeds the raw query text into a dense vector representation matching the corpus embedding space

Raw query string

Query embedding vector

Retriever

Runs approximate nearest-neighbor search against the vector store to fetch the top-K most semantically similar chunks

Query embedding + vector store access

Top-50 candidate chunks with similarity scores

Reranker

Cross-encodes query against each candidate chunk; filters by jurisdiction and matter type; returns the highest-precision subset that fits the context window

Top-50 chunks + query + metadata filters

Top-8 reranked chunks with relevance scores

Generator

Constructs answer grounded in the retrieved context window; instructed to cite chunk IDs and quote directly rather than paraphrase from memory

Top-8 chunks + original query

Draft answer with inline chunk citations

Citation Validator

Verifies that each factual claim in the draft answer is directly supported by a cited chunk; rejects or flags unsupported claims

Draft answer + retrieved chunks

Validated answer with grounding status per claim

When to Use

Use when

Answers must be grounded in a specific document corpus, not model memory
Hallucination carries professional, legal, or financial risk
The corpus is too large for full-context prompting
Auditability is required — users must trace each claim to a source
The corpus changes frequently (new contracts, updated policies)

Avoid when

The corpus is small enough to fit in a single context window — retrieval adds latency with no benefit
Answers require synthesis across many documents simultaneously — retrieval windows are too narrow
Query distribution is so broad that retrieval precision is inherently low — use Orchestrator-Workers to select corpora dynamically
Low-latency requirements preclude sequential retrieval and reranking steps

Value Profile

Origin of Value	Where it appears	How it is captured
Future Cashflow	Generator node	Answer quality is the product. The Generator's output is the only customer-facing deliverable; every upstream stage exists to make it accurate. Quality is directly proportional to retrieval precision — a Generator with perfect context still fails if the Reranker passed irrelevant chunks.
Governance	Citation Validator	The Citation Validator is the trust boundary between model speculation and warranted claims. Outputs crossing it are grounded assertions; outputs that fail it are model opinions not backed by the corpus. This node encodes the firm's liability standard.
Risk Exposure	Retriever and corpus	Retrieval gaps — queries where the relevant document is absent from the corpus — are the primary residual failure mode. These failures are invisible to the Citation Validator because the model may generate a plausible answer from other chunks. Corpus coverage is a risk metric, not just an operational concern.
Conditional Action	Every stage	Each stage consumes compute before the answer is produced. Encoder and Retriever are cheap; Reranker and Generator are expensive. A query that returns zero relevant chunks still incurs full pipeline cost. Query volume directly drives cost with no fixed-overhead amortization.

Origin of Value

Where it appears

How it is captured

Future Cashflow

Generator node

Answer quality is the product. The Generator's output is the only customer-facing deliverable; every upstream stage exists to make it accurate. Quality is directly proportional to retrieval precision — a Generator with perfect context still fails if the Reranker passed irrelevant chunks.

Governance

Citation Validator

The Citation Validator is the trust boundary between model speculation and warranted claims. Outputs crossing it are grounded assertions; outputs that fail it are model opinions not backed by the corpus. This node encodes the firm's liability standard.

Risk Exposure

Retriever and corpus

Retrieval gaps — queries where the relevant document is absent from the corpus — are the primary residual failure mode. These failures are invisible to the Citation Validator because the model may generate a plausible answer from other chunks. Corpus coverage is a risk metric, not just an operational concern.

Conditional Action

Every stage

Each stage consumes compute before the answer is produced. Encoder and Retriever are cheap; Reranker and Generator are expensive. A query that returns zero relevant chunks still incurs full pipeline cost. Query volume directly drives cost with no fixed-overhead amortization.

Dynamics and Failure Modes

Variants

Variant	Modification	When to use
HyDE (Hypothetical Document Embedding)	Before retrieval, the Generator produces a hypothetical ideal document matching the query; the encoder embeds that document instead of the raw query	Queries are short or ambiguous; a richer representation improves retrieval recall over direct query embedding
Multi-Hop RAG	After the first retrieval cycle, the Generator identifies missing context and the Retriever runs again with refined sub-queries, repeating until context is sufficient	Queries that require connecting information across multiple documents — e.g., "does this clause conflict with the indemnification terms in the master agreement?"
Corrective RAG (CRAG)	Citation Validator failure triggers the Retriever to re-query with expanded or reformulated search terms before regenerating	High-precision requirements where a single retrieval pass is insufficient; acceptable to trade latency for grounding quality

Variant

Modification

When to use

HyDE (Hypothetical Document Embedding)

Before retrieval, the Generator produces a hypothetical ideal document matching the query; the encoder embeds that document instead of the raw query

Queries are short or ambiguous; a richer representation improves retrieval recall over direct query embedding

Multi-Hop RAG

After the first retrieval cycle, the Generator identifies missing context and the Retriever runs again with refined sub-queries, repeating until context is sufficient

Queries that require connecting information across multiple documents — e.g., "does this clause conflict with the indemnification terms in the master agreement?"

Corrective RAG (CRAG)

Citation Validator failure triggers the Retriever to re-query with expanded or reformulated search terms before regenerating

High-precision requirements where a single retrieval pass is insufficient; acceptable to trade latency for grounding quality

Related Patterns

Pattern	Relationship
10.11 Pipeline	RAG is a specialized pipeline with retrieval infrastructure — the same sequential composition principles apply, including cascading error from early stages
10.15 Evaluator-Optimizer	Add a retry loop when the Citation Validator fails: the Evaluator triggers re-retrieval or re-generation rather than returning a failed response
20.23 Orchestrator-Workers	Multi-hop RAG becomes an Orchestrator-Workers pattern when the orchestrator dynamically decides which corpora to query based on intermediate results

Pattern

Relationship

10.11 Pipeline

RAG is a specialized pipeline with retrieval infrastructure — the same sequential composition principles apply, including cascading error from early stages

10.15 Evaluator-Optimizer

Add a retry loop when the Citation Validator fails: the Evaluator triggers re-retrieval or re-generation rather than returning a failed response

20.23 Orchestrator-Workers

Multi-hop RAG becomes an Orchestrator-Workers pattern when the orchestrator dynamically decides which corpora to query based on intermediate results

Investment Signal

RAG pipelines are measurable at every stage. Retrieval recall, reranker precision, citation pass rate, and end-to-end hallucination rate are all independently auditable. A firm that can demonstrate sub-3% hallucination on a domain-specific held-out benchmark has a defensible quality moat — the benchmark itself is a due diligence artifact.

The corpus is the primary asset. A vector store built from 5 years of proprietary contract history, customer interactions, or domain-specific documents cannot be replicated by a competitor who buys the same LLM. Corpus quality, coverage, and freshness are the real competitive variables — the retrieval architecture is commoditizing rapidly.

Red flag: a RAG deployment with no per-stage instrumentation is flying blind. If the firm reports only final answer quality and cannot decompose failure into retrieval gaps vs. generation errors vs. citation failures, they cannot diagnose or improve the system and cannot prove the corpus is doing useful work.