Ch 6 — Retrieval Strategies

Getting the right chunks to the LLM
High Level
search
Query
arrow_forward
scatter_plot
Dense
arrow_forward
text_fields
Sparse
arrow_forward
join
Hybrid
arrow_forward
swap_vert
Rerank
arrow_forward
filter_alt
Filter
arrow_forward
check_circle
Context
-
Click play or press Space to begin the journey...
Step- / 7
search
Why Retrieval Is the Bottleneck
The quality of your RAG output is capped by the quality of retrieved chunks
The Retrieval Problem
The LLM can only answer based on what you put in its context window. If retrieval misses the right chunk, the answer will be wrong — no matter how good the model is. Studies show that retrieval quality accounts for 60–80% of RAG answer quality. Improving retrieval is the highest-leverage optimization you can make.
What Can Go Wrong
Missed relevant chunks: The right information exists but was not retrieved (low recall).

Noisy results: Retrieved chunks are vaguely related but don’t contain the answer (low precision).

Wrong granularity: The chunk is too big (dilutes the answer) or too small (missing context).

Semantic gap: The query and the answer use different words for the same concept.
Retrieval Strategies Overview
This chapter covers the main strategies to improve retrieval:

1. Dense retrieval — Embedding-based semantic search (the default)
2. Sparse retrieval — Keyword-based search (BM25)
3. Hybrid search — Combining dense + sparse
4. Reranking — A second-pass model that re-scores results
5. Metadata filtering — Narrowing the search space
6. Multi-index strategies — Searching across multiple collections
The retrieval pipeline is a funnel. Start broad (retrieve 20–50 candidates), then narrow with reranking and filtering to the final 3–5 chunks that go into the LLM prompt. Each stage increases precision without sacrificing recall.
scatter_plot
Dense Retrieval (Semantic Search)
Finding chunks by meaning, not keywords
How It Works
Embed the query using the same model that embedded the chunks. Find the top-k nearest vectors by cosine similarity. This is the default retrieval method in most RAG systems and what you get out of the box with LangChain and LlamaIndex.
Strengths
Semantic understanding: “How do I get my money back?” matches “refund policy” even though they share no keywords.

Multilingual: Models like Cohere embed-v3 and BGE-M3 work across languages — query in English, retrieve in Spanish.

Zero-shot: No training on your specific data needed.
Weaknesses
Exact terms: Struggles with product codes (“SKU-4829”), error codes (“E_TIMEOUT”), or proper nouns the model hasn’t seen.

Negation: “What is NOT covered by the warranty?” may retrieve chunks about what IS covered.

Long queries: Embedding a long query into a single vector can dilute the signal.
# Dense retrieval with LangChain retriever = vectorstore.as_retriever( search_type="similarity", search_kwargs={"k": 10} ) docs = retriever.invoke("refund policy")
Dense retrieval is your baseline. Always start here. If it works well enough, you don’t need the complexity of hybrid search or reranking. Only add layers when you identify specific failure modes.
text_fields
Sparse Retrieval (BM25 / Keyword Search)
The classic information retrieval approach that still works
What Is BM25?
BM25 (Best Matching 25) is the standard keyword-based ranking algorithm used by Elasticsearch, Solr, and Lucene. It scores documents based on term frequency (how often the query term appears in the document) and inverse document frequency (how rare the term is across all documents). It has been the backbone of search engines for decades.
Strengths
Exact match: Perfect for product codes, error codes, names, and technical terms.

Transparent: You can see exactly why a document matched (which terms).

Fast: Inverted indexes are extremely efficient. No GPU needed.

No embedding model: Works without any ML model.
Weaknesses
No semantic understanding: “car” does not match “automobile.” Synonyms, paraphrases, and conceptual similarity are invisible.

Vocabulary mismatch: If the user and the document use different terms for the same thing, BM25 fails.
# BM25 with LangChain from langchain_community.retrievers import BM25Retriever bm25 = BM25Retriever.from_documents( documents=chunks, k=10 ) docs = bm25.invoke("SKU-4829 compatibility")
BM25 is not obsolete. For queries with specific identifiers, codes, or exact phrases, BM25 often outperforms dense retrieval. The best RAG systems use both — that’s hybrid search.
join
Hybrid Search
Combining the best of dense and sparse retrieval
How It Works
Run both dense (vector) and sparse (BM25) retrieval in parallel. Merge the results using a fusion algorithm. The most common approach is Reciprocal Rank Fusion (RRF): each result gets a score based on its rank in each list, and the final ranking is the sum of these scores.
RRF Formula
score(d) = Σ 1 / (k + rank_i(d))

Where k is a constant (typically 60) and rank_i(d) is the rank of document d in the i-th retrieval list. Documents that appear high in both lists get the highest combined score.
Native Hybrid Search
Weaviate: Built-in hybrid search with configurable alpha (0 = pure BM25, 1 = pure vector).
Qdrant: Sparse-dense vectors in the same collection.
Pinecone: Sparse-dense vectors via dotproduct.
Elasticsearch: kNN + BM25 in a single query.
# LangChain — Ensemble Retriever (hybrid) from langchain.retrievers import EnsembleRetriever hybrid = EnsembleRetriever( retrievers=[dense_retriever, bm25_retriever], weights=[0.5, 0.5] ) docs = hybrid.invoke("SKU-4829 refund policy") # Weaviate — native hybrid results = collection.query.hybrid( query="SKU-4829 refund policy", alpha=0.5, # 0=BM25, 1=vector limit=10 )
Hybrid search is the single biggest retrieval improvement for most RAG systems. Research consistently shows hybrid outperforms either dense or sparse alone. Start with equal weights (0.5/0.5) and tune from there. If your queries are mostly keyword-heavy, shift toward BM25; if mostly conceptual, shift toward dense.
swap_vert
Reranking
A second-pass model that dramatically improves precision
What Is Reranking?
A reranker is a cross-encoder model that takes the query and a candidate document together as input and outputs a relevance score. Unlike bi-encoders (which embed query and document separately), cross-encoders see both at once — enabling much deeper understanding of relevance. The trade-off: they are too slow for first-pass retrieval (must compare every pair), so they are used as a second pass on the top-k candidates.
Popular Rerankers
Cohere Rerank — API-based. State-of-the-art quality. Supports 100+ languages. The most popular choice.

Jina Reranker — API and open-weight models. Fast, multilingual.

BGE Reranker (BAAI) — Open-source. Self-hostable. bge-reranker-v2-m3 supports multilingual.

FlashRank — Lightweight, runs locally. Good for prototyping.
# Cohere Rerank with LangChain from langchain_cohere import CohereRerank compressor = CohereRerank( model="rerank-english-v3.0", top_n=5 ) from langchain.retrievers import ContextualCompressionRetriever reranking_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=hybrid_retriever # retrieve 20 ) # Returns top 5 after reranking docs = reranking_retriever.invoke("refund policy")
The retrieve-then-rerank pattern: Retrieve 20–50 candidates with fast retrieval (dense, sparse, or hybrid), then rerank to the top 3–5 with a cross-encoder. This gives you the recall of broad retrieval with the precision of deep relevance scoring. Cohere reports 30–50% improvement in answer quality from adding reranking.
filter_alt
Metadata Filtering & Multi-Index
Narrowing the search space for better results
Metadata Filtering
Restrict search to relevant subsets before vector search runs. Common filters:

Source type: Only search policies, only search FAQs
Date range: Only documents from the last 6 months
Access control: Only documents the user has permission to see
Department: Only HR docs, only engineering docs

Filters reduce noise and improve precision without any model changes.
Multi-Index Retrieval
Search across multiple collections or indexes and merge results. Useful when your data has different structures:

Separate by source: One index for docs, one for Slack, one for Jira
Separate by granularity: One index for paragraphs, one for full pages
Separate by language: One index per language
# LangChain — self-query retriever # LLM extracts filters from natural language from langchain.retrievers.self_query.base import ( SelfQueryRetriever ) retriever = SelfQueryRetriever.from_llm( llm=llm, vectorstore=vectorstore, document_contents="Company policies", metadata_field_info=metadata_fields ) # User: "What is the HR refund policy from 2024?" # LLM extracts: query="refund policy" # filter: department="HR" AND year=2024
Self-query retrieval uses the LLM to automatically extract metadata filters from the user’s natural language query. The user says “What’s the HR refund policy from 2024?” and the system automatically filters to department=HR and year=2024 before searching. LangChain’s SelfQueryRetriever implements this pattern.
verified
Putting It All Together
A practical retrieval pipeline for production RAG
The Production Retrieval Pipeline
Stage 1 — Filter: Apply metadata filters (access control, date, source type) to narrow the search space.

Stage 2 — Retrieve: Run hybrid search (dense + BM25) to get 20–50 candidates. Broad recall is the goal here.

Stage 3 — Rerank: Pass candidates through a cross-encoder reranker. Narrow to the top 3–5 most relevant chunks.

Stage 4 — Assemble: Order the final chunks, add source citations, and format them into the LLM prompt context.
When to Add Complexity
Start simple: Dense retrieval with k=5. Test with real queries.

Add BM25 if: Users search for specific codes, names, or exact phrases that dense retrieval misses.

Add reranking if: You get relevant chunks in the top 20 but not the top 5. Reranking surfaces them.

Add self-query if: Users naturally include filter criteria in their questions (“2024 HR policies”).
Measure before optimizing. Build an evaluation set of 50–100 query-answer pairs. Measure retrieval recall@5 and precision@5. Only add complexity (hybrid, reranking) when you can prove it improves these metrics. Every layer adds latency and cost.