retrieval

Part of the AI system design curriculum

RAG Fundamentals

Why retrieval-augmented generation works, and how to build a pipeline that actually grounds answers.

A language model trained six months ago has never read your internal wiki, your Q3 product specs, or yesterday's incident report. When you ask it about these things, it either refuses or confabulates, generating fluent text that sounds right but is grounded in nothing. Retrieval-Augmented Generation (RAG) fixes this by giving the model a search capability: find the relevant documents at query time, inject them into the prompt, and let the model reason over verified content instead of recalled patterns. The architecture is conceptually simple, but making it work reliably in production is a surprisingly deep engineering problem.

The Grounding Problem

Why LLMs Confabulate

Hallucination is the wrong mental model. A language model does not have opinions or beliefs; it predicts probable continuations. When the training data contains no signal about a specific fact, the model still produces a continuation that fits the surrounding context. The result is confident, plausible text that happens to be wrong.

Think of a brilliant research librarian who has read millions of books but not your company's internal documentation. Ask her about public history, and she's excellent. Ask her about last week's board meeting, and she'll fill the gap with a plausible reconstruction, not because she intends to deceive, but because that's what fluent speakers do when pressed for specifics they don't have.

The problem compounds in domain-specific applications. A legal assistant that confidently cites a statute that does not exist, or a customer support tool that invents a product feature, causes real downstream harm that a plain "I don't know" would not. The deeper issue is that there is no natural signal inside the model to distinguish between remembered fact and inferred plausibility; both produce identical-looking output text.

Fine-tuning does not solve knowledge freshness. When you fine-tune a model on a set of facts, you are encoding those patterns into the weights. That can work for stable, high-frequency information, but it is brittle: one update to a document requires another fine-tuning run, coverage of edge cases depends on how well your training examples represent them, and the model may still confabulate on queries it has not seen enough of during training. RAG replaces weight-encoded memory with retrieval. The document lives outside the model, is easily updated by a re-index operation, and is directly visible in the prompt where the model can read it literally rather than recall it probabilistically.

Fine-tuning changes how a model responds (tone, format, persona). RAG changes what facts it can draw on. These solve different problems. Reaching for fine-tuning when you have a knowledge-freshness problem is like repainting your car because it ran out of gas.

The Core RAG Pipeline

The canonical three-stage pipeline:

Indexing (offline): Split source documents into chunks, embed each chunk into a vector, persist in a vector store
Retrieval (online, per query): Embed the user's query, find the top-K most similar chunks by vector similarity
Generation: Prepend retrieved chunks to the query and pass the full context to the LLM

from sentence_transformers import SentenceTransformer
import numpy as np
 
embedder = SentenceTransformer("all-MiniLM-L6-v2")
 
# --- Indexing phase (done once) ---
documents = [
    "The Paris Agreement targets limiting warming to 1.5°C above pre-industrial levels.",
    "Photovoltaic panels convert photons into direct current via the photoelectric effect.",
    "Our Q3 revenue was $4.2M, up 18% year-over-year.",
]
doc_embeddings = embedder.encode(documents, normalize_embeddings=True)
 
# --- Retrieval phase (per query) ---
def retrieve(query: str, top_k: int = 2) -> list[str]:
    q_vec = embedder.encode([query], normalize_embeddings=True)
    # Cosine similarity is just dot product for unit vectors
    scores = doc_embeddings @ q_vec.T
    ranked = np.argsort(scores[:, 0])[::-1][:top_k]
    return [documents[i] for i in ranked]
 
# --- Generation phase ---
context_chunks = retrieve("What was our Q3 financial performance?")
prompt = "Context:\n" + "\n".join(context_chunks) + "\n\nQuestion: What was our Q3 financial performance?"
# Pass `prompt` to your LLM

This works for demos. For production, every stage has significant failure modes.

Chunking Strategy and Recall

How you split documents directly determines whether the right information surfaces at retrieval time. Chunk too small and you lose sentence context: the embedding of a 50-token fragment rarely captures the full meaning of a paragraph, and cosine similarity scores become noisy because the model is comparing two slivers of meaning rather than two coherent units. Chunk too large and the embedding averages over too much content: a 1000-token chunk that covers both pricing policy and shipping policy scores mediocrely against a question about either topic alone, because the embedding is a centroid of many concepts rather than a precise signal for any single one.

A practical starting point is 256 to 512 tokens with a 40 to 80 token overlap between adjacent chunks. The overlap prevents a useful sentence from being stranded at a boundary by copying the tail of each chunk into the head of the next. For documents with natural structural units (numbered contract clauses, API endpoint descriptions, FAQ entries), align chunk boundaries to those units rather than fixed token counts. Structural boundaries are almost always better split points than arbitrary token positions.

Two patterns extend the baseline further. Parent-document retrieval indexes small sentence-level chunks (around 128 tokens) to produce sharp, precise embeddings, but stores a pointer from each sentence chunk to its full parent section (512 to 1024 tokens). When a sentence chunk matches at retrieval time, the retriever returns the parent section rather than the small chunk. The small chunk acts as a precise address; the large chunk carries the context the model actually needs to generate a complete answer. Sentence-window expansion is a lighter version of the same idea: index individual sentences, but at retrieval time expand each matched sentence to include the two or three sentences on either side before passing to the LLM. Both patterns improve Recall@K because the index is sharp, and improve generation quality because the context window is fuller.

A useful diagnostic to run on your system: if Recall@10 is above 80% but answer quality is still poor, chunking is probably not the bottleneck and the failure is downstream in generation. If Recall@10 is below 60%, check chunk boundaries and chunk size before blaming the embedding model or retrieval algorithm.

Query → embed → retrieve → augment prompt → LLM answer; numbered steps show the full grounding flow.

The Three Retrieval Gaps

Most RAG failures are retrieval failures, not generation failures. The LLM is perfectly capable of reading a relevant passage and extracting the answer. The problem is that the relevant passage never made it into the context window. There are three distinct ways this happens.

Semantic mismatch, fragmented context, and lost-in-the-middle: the three ways RAG retrieval fails.

Gap 1: Semantic Mismatch

A user asks "which vehicle accelerates fastest?" Your database stores a document titled "Porsche 911 Turbo S 0-60 benchmark results." Pure keyword search finds nothing because no words overlap. Embedding search usually helps (semantic similarity captures paraphrase), but embedding models trained on general text sometimes miss domain-specific vocabulary.

The fix is hybrid retrieval: run both dense (embedding) search and sparse (BM25) keyword search, then merge the ranked results with Reciprocal Rank Fusion (RRF). Each result gets a score of 1 / (k + rank) in each list, and the scores are summed. Sparse search catches exact terminology; dense search catches conceptual equivalence. Their combination beats either alone.

BM25 is a term-frequency and inverse-document-frequency scoring function that has been the backbone of full-text search engines for decades. It handles exact matches, phrase proximity, and rare tokens without any learned representation, which makes it robust to domain shift: a BM25 index over legal documents does not need to be trained on legal text, it just counts and weights tokens. Dense retrieval adds semantic generalization but can fail on a specific model number, a contract identifier, or a rare technical term that the embedding model has not seen in sufficient context during training. Neither dominates across all query types; hybrid retrieval extracts the strengths of both.

The RRF formula in full: given ranked lists from multiple retrieval systems, assign each document a score of 1 / (k + rank_in_list) for each list it appears in, then sum those scores across all lists. The constant k is typically 60. It prevents a single top-ranked result from dominating when there is uncertainty at the top of a list, and it requires no score normalization: BM25 scores and cosine similarities live in completely different numeric ranges, but ranks are always comparable integers. The combined merged list is then passed to the reranker.

On the BEIR benchmark (a collection of 18 heterogeneous retrieval tasks spanning biomedical literature, legal filings, and financial documents), hybrid retrieval consistently outperforms dense-only retrieval by 2 to 8 NDCG points, with the largest gaps on out-of-domain queries where the embedding model was not specifically trained on that domain's vocabulary.

Gap 2: Fragmented Context

The retriever fetches a chunk that contains a partial answer, but the complete answer spans two adjacent chunks that were split during indexing. The model sees "the interest rate was increased by..." and the completion "...50 basis points" lives in the next chunk, which scored just below the retrieval cutoff.

The fix involves smarter chunking strategies:

Overlapping windows: chunks share N tokens with their neighbors
Parent-document retrieval: index small chunks for precision, but return the full parent section when a small chunk matches
Sentence-window expansion: expand a matched sentence to include its surrounding paragraph at generation time

The effect of overlap is measurable. On a corpus of legal briefs split into 256-token chunks with no overlap, Recall@10 for questions whose answers span a paragraph boundary was around 34%. Adding a 64-token overlap raised it to roughly 71%, because a sentence straddling a chunk boundary now appears in both adjacent chunks, and at least one of those chunks is retrieved.

For parent-document retrieval, a concrete setup: index 128-token sentence-level chunks so that embedding signals are sharp, but store a pointer from each sentence chunk to its 512-token parent section. When the retriever finds the top-K sentence chunks, swap each one for its parent before building the context window. This is a single indirection in the retrieval layer but frequently doubles the completeness of the context passed to the LLM, because the model receives the full paragraph from which the matched sentence came rather than an isolated fragment.

Gap 3: Lost in the Middle

Even when all relevant content arrives in the context window, LLMs disproportionately use information near the start and end of long prompts, effectively ignoring content buried in the middle. If you retrieve 20 chunks and the key fact is chunk 11, the model may not use it.

The fix is a cross-encoder reranker. After initial retrieval, the reranker scores each candidate chunk against the query using a model that attends jointly to both (not just their individual embeddings). This is slower than embedding lookup but far more accurate at separating relevant from marginally relevant content. Keep only the top 3-5 chunks and move the most relevant ones to the prompt boundaries.

Understanding why cross-encoders are more accurate clarifies when the extra latency is worth it. A bi-encoder (the standard embedding approach) produces one vector for the query and one vector for each document, computed independently, with no interaction between them. Relevance is the dot product of two fixed-length vectors, which collapses all meaning into a single geometric comparison. A cross-encoder instead concatenates the query and the candidate document into a single input sequence, runs the full transformer attention stack over all tokens together, and produces a scalar relevance score. Every query token can attend to every document token. This joint attention catches fine-grained term-level matches, negation, and conditional relationships that a dot product between independent vectors cannot represent.

The cost is one forward pass per candidate. For a corpus of a million documents, running a cross-encoder over every candidate is not feasible at query time. The practical pattern is a two-stage pipeline: a fast bi-encoder retrieves the top 50 candidates (roughly 5 to 10 ms using HNSW approximate nearest-neighbor search), and the cross-encoder rescores those 50 in 100 to 200 ms on a single GPU. Total end-to-end retrieval latency is under 300 ms, which is acceptable for most interactive applications. If latency is tighter, reduce the candidate pool to 20 or 30 and use a smaller reranker model such as cross-encoder/ms-marco-MiniLM-L-6-v2 instead of a full BERT-large variant.

Measuring RAG quality end-to-end (final answer accuracy) hides which stage is broken. Build three separate metrics: retrieval recall at K (did the right chunk appear in the top-K results?), context faithfulness (does the generated answer stay grounded in the retrieved text?), and factual correctness. You cannot fix what you cannot isolate.

RAG vs. Long Context Windows

Frontier models now offer context windows of 1M-2M tokens. The obvious question is whether RAG still makes sense when you can simply paste in everything.

The answer is nuanced and depends on your situation:

Dimension	Long-context loading	RAG
Corpus under 50K tokens	Yes, simpler, no infra	Overkill
Corpus over 1M tokens	Impossible	Yes, required
Real-time / hourly updates	Expensive re-load	Yes, just update index
Cost-per-query sensitivity	High (input tokens are priced)	Yes, retrieve ~2K tokens
Source attribution required	Difficult to trace	Yes, chunk-level citations

The practical heuristic: for a static corpus under 50K tokens, in-context loading with prompt caching is simpler and requires no vector infrastructure. For anything that grows, updates frequently, or must stay cost-efficient at scale, RAG is the right choice.

One consideration often overlooked is latency under document churn. If 10% of your documents update daily, long-context loading means re-loading the full corpus on every query that might touch updated content. RAG means re-indexing only the changed documents. Most vector databases support upsert by document ID, so re-indexing a changed document is a targeted operation, not a full rebuild. The incremental update cost favors RAG heavily for dynamic corpora.

A second practical consideration is source attribution. When an LLM reads 200K tokens and makes a claim, pinpointing which passage supports that claim is a post-hoc analysis problem. RAG pipelines surface chunk-level metadata automatically: each retrieved chunk carries its document ID, section heading, and creation timestamp. Attribution is a byproduct of the architecture, not an added audit step.

Prompt caching (available from Anthropic, OpenAI, and Google) can reduce the cost of long-context loading by 70-90% by reusing GPU-computed KV representations across queries. This shifts the trade-off but does not eliminate it. You still pay full price for the first query and for any document that changes.

The RAG Quality Ladder

Think of RAG maturity in tiers:

Tier	Name	What it adds	When to use
1	Naive RAG	Embed → cosine search → top-K → generate	Prototypes only
2	Advanced RAG	Hybrid retrieval (BM25 + dense), cross-encoder reranker, context compression	Minimum viable production baseline
3	Agentic RAG	Model controls retrieval, issues follow-up queries, routes between indexes	Highly varied query patterns

Three maturity tiers: Naive → Advanced → Agentic RAG, each adding retrieval sophistication.

Between tiers there are concrete implementation decisions worth understanding. Moving from Tier 1 to Tier 2 is almost always justified. The components are well-supported: BM25 indexing is available in Elasticsearch, OpenSearch, or the rank-bm25 Python package; cross-encoder reranking is a single method call in the sentence-transformers library. Adding a reranker on a 50-candidate pool costs 100 to 200 ms per query, which is acceptable in most interactive applications. The improvement in answer quality is typically significant enough that the latency cost is not a real debate in practice.

Moving to Tier 3 (agentic) is only justified when queries vary so widely in structure that no single retrieval strategy covers them adequately: some queries need a precise single-document lookup, others require aggregating evidence across dozens of documents, and some require iterative refinement where each retrieval round depends on what the previous round found. Agentic RAG typically involves the model issuing a retrieval tool call, inspecting the results, deciding whether to issue a follow-up query with refined terms, and repeating until it has enough information. This loop introduces latency proportional to the number of rounds (each round adds a full retrieval and LLM inference step) and makes the pipeline significantly harder to evaluate systematically, because the retrieval path differs for every query. Exhaust the deterministic Tier 2 options before introducing agentic loops.

Reach for Tier 2 before Tier 3. Most production RAG improvements come from better retrieval precision: smarter chunking, hybrid search, a reranker, not from adding agentic loops. Agentic RAG introduces non-determinism and is harder to evaluate; earn it by exhausting deterministic options first.

Worked Example: A Two-Stage Retrieval Pipeline

Consider a legal document search system. Queries come in two shapes: precise ("what is the indemnification clause in contract 4471?") and fuzzy ("what contracts have unusual liability exposure?").

A production pipeline for this system would:

Index each contract clause as an overlapping chunk (200 tokens, 40-token overlap), stored in a vector database alongside a BM25 inverted index
At query time, run both searches in parallel and merge with RRF
Rerank the top-20 merged candidates with a legal-domain cross-encoder, keeping the top-5
Generate a structured answer, citing clause IDs from the retrieved chunks

The clause ID citation is the payoff: because every retrieved chunk carries its source metadata, the generated answer can point back to the exact document section, making the system auditable in a way a pure LLM never could be.

With Concrete Numbers

A production evaluation on a held-out set of 200 labeled queries showed the following progression after each pipeline change:

Pipeline variant	Recall@10	Faithfulness	Avg context tokens sent to LLM
Naive RAG (dense only, top 10)	61%	74%	2800
Hybrid BM25 + dense, RRF, top 10	79%	81%	2800
Hybrid + cross-encoder reranker, top 5	86%	91%	1400

The retrieval path for each query: BM25 returns its top 25 candidates; dense retrieval returns its own top 25. RRF merges the two lists into a pool of up to 50 unique documents (many candidates appear in both lists, so the actual pool is typically 30 to 40 unique entries). The cross-encoder scores every document in that pool against the query in a single batched forward pass, and the top 5 are passed to the LLM with their contract IDs included as metadata in the prompt.

Faithfulness here is measured by an LLM-as-judge approach: for each sentence in the generated answer, a judge model checks whether a retrieved clause directly supports that sentence. A faithfulness score of 91% means 9% of generated sentences contain claims not traceable to any retrieved chunk. Most of these occur when the model bridges two clauses with an implied inference that is not stated in either. The fix is a prompt instruction to state only what is explicitly supported by the provided clauses, not to derive or infer across them.

The drop in context tokens from 2800 to 1400 (from top 10 to top 5 after reranking) reduced per-query cost by roughly 15% and measurably improved response coherence, because the model was processing five tightly relevant clauses rather than five relevant clauses plus five marginally related ones that added noise.

Hybrid recall (BM25+dense) narrows to top-K; cross-encoder reranking selects top-n, the production retrieval pattern.

Interview angle

When should you use RAG versus fine-tuning versus long-context loading?

What they are probing for: choosing the right knowledge injection strategy

These three techniques address different problems and are not mutually exclusive. Fine-tuning updates the model's weights to change how it responds (tone, format, response style, domain vocabulary) but does not inject new facts reliably and becomes stale the moment a document changes. Long-context loading is appropriate for a small, static corpus under 50K tokens where you want simplicity and have no latency or cost constraints. RAG is the right default for any knowledge base that grows, changes frequently, or is queried at scale, because you pay only for retrieved tokens per query, updates are incremental re-indexes, and every claim is traceable to a source chunk. In practice the three approaches compose: fine-tune for response style, use RAG for factual grounding, and use long-context loading for one-off deep analyses on bounded documents.

Answer quality on your RAG system is low. How do you figure out whether the problem is in retrieval or in generation?

What they are probing for: systematic fault isolation in RAG pipelines

Instrument the two stages independently before touching any code. Compute Recall@10 on a labeled evaluation set: for each query, does the correct document appear in the top 10 retrieved chunks? If Recall@10 is below 70%, the failure is in retrieval and you should investigate chunk size, embedding model quality, and whether hybrid search would help. If Recall@10 is high but answer quality is still poor, the failure is in generation. Within the generation layer, compute faithfulness: for each sentence in the generated answer, check whether it is supported by a retrieved chunk. Low faithfulness means the model is generating beyond the context window, which a stricter system prompt or a smaller top-K often fixes. High faithfulness with wrong answers means the retrieved content itself is outdated or incorrect, which is a data pipeline and index freshness problem.

Why use hybrid retrieval plus a cross-encoder reranker instead of dense retrieval alone?

What they are probing for: understanding retrieval architecture tradeoffs

Dense retrieval generalizes well across paraphrase but fails on exact-match queries involving product codes, proper nouns, or rare technical terms that the embedding model has not seen in sufficient training context. BM25 catches these precisely because it scores on raw term frequency and inverse document frequency, with no learned representations that can drift on domain shift. Combining both via Reciprocal Rank Fusion is cheap and requires no score normalization because you are merging integer ranks rather than calibrating scores across systems. The cross-encoder reranker then applies a higher-quality joint-attention model to the merged shortlist (typically 50 candidates) at 100 to 200 ms on a GPU. The architecture layers cost and accuracy: fast recall from two complementary signals, then precise ranking from a slower but more accurate model that sees token-level interactions between query and document.

How would you evaluate a RAG system end to end?

What they are probing for: evaluation methodology and metric selection

A complete evaluation stack has three layers. First, retrieval quality: Recall@K measures whether the correct chunk appears in the top K results; MRR (Mean Reciprocal Rank) measures how highly it is ranked. Second, context quality: context relevance scores how well the retrieved chunks match the query, flagging retrievals that are technically close in embedding space but do not actually answer the question. Third, generation quality: faithfulness checks whether the answer contains only claims grounded in retrieved chunks (detecting generation-layer hallucination), and answer correctness compares the final answer to a ground-truth reference. RAGAS is a widely used open-source framework that automates these metrics using an LLM as judge. Running all three layers on 200 to 500 representative queries lets you isolate which layer is degrading and direct optimization effort precisely rather than tuning everything at once.

Your RAG corpus has documents that are updated daily and some sources that contradict each other. How do you handle this?

What they are probing for: production data quality and consistency

Stale documents are an indexing pipeline problem. Track modification timestamps on source documents and trigger re-embedding on change, using upsert by document ID in the vector store so only changed chunks are replaced rather than the full corpus rebuilt. Conflicting documents require a different approach. One pattern is to include document metadata (version number, timestamp, source authority tier) inside each chunk at index time and prompt the model to prefer the more recent or more authoritative source when it encounters disagreement. A second approach is to detect conflict in the retrieved set before generation: if top-K chunks have high mutual semantic similarity but contain contradictory surface claims, route that query to a human review queue rather than auto-generating an answer. Both patterns require that you index with rich metadata from the start, which is easy to add early and painful to retrofit.

What signals tell you to adjust your chunking strategy?

What they are probing for: diagnosing retrieval problems empirically

The clearest signal is a gap between where sentences are and where your chunk boundaries are. If retrieved chunks frequently contain part of an answer but not all of it, chunks are too small and you are splitting logical units across boundaries: increase chunk size or add overlap of 40 to 64 tokens. If retrieved chunks are large but semantically noisy (the right paragraph is present but surrounded by unrelated content), chunks are too large and the embedding signal is diluted: reduce chunk size or switch to sentence-level indexing with parent-document expansion. A concrete diagnostic: build a test set of questions whose answers fit in a single paragraph, measure Recall@10 with zero overlap versus 64-token overlap, and compare. A recall jump above 15 percentage points confirms that chunk boundary splits are a real problem worth fixing before investigating the embedding model.

← PreviousHow LLMs Actually Work

Next →Agent Fundamentals