Ch 7 — Query Transformation

Rewriting the user’s question before retrieval
High Level
chat
Raw Query
arrow_forward
edit_note
Rewrite
arrow_forward
call_split
Decompose
arrow_forward
auto_awesome
HyDE
arrow_forward
step_over
Step-Back
arrow_forward
repeat
Multi-Query
arrow_forward
check_circle
Retrieve
-
Click play or press Space to begin the journey...
Step- / 7
chat
Why Transform Queries?
User questions are rarely optimal for retrieval
The Problem
Users ask questions in natural, conversational language. But the best retrieval query is often different from what the user typed. Common issues:

Vague queries: “Tell me about that thing we discussed” — no useful search terms.

Complex questions: “How does X compare to Y in terms of A, B, and C?” — too many concepts for a single embedding.

Vocabulary mismatch: The user says “get my money back” but the docs say “refund policy.”

Conversational context: “What about the pricing?” — depends on what was discussed before.
Query Transformation Strategies
This chapter covers the main techniques to improve queries before retrieval:

1. Query Rewriting — LLM rewrites the query for better retrieval
2. Sub-Question Decomposition — Break complex questions into simpler parts
3. HyDE — Generate a hypothetical answer, embed that instead
4. Step-Back Prompting — Ask a broader question first
5. Multi-Query — Generate multiple query variations

All of these use the LLM as a pre-processing step before retrieval.
Query transformation sits between the user and retrieval. The user’s raw question goes in, a better search query (or queries) comes out. This is one of the highest-leverage improvements in advanced RAG — often more impactful than changing the embedding model or vector store.
edit_note
Query Rewriting
LLM rewrites the user’s question for better retrieval
How It Works
Send the user’s query to an LLM with a prompt like: “Rewrite this question to be more specific and better suited for searching a knowledge base.” The LLM returns a cleaner, more precise version that retrieves better results.
What Rewriting Fixes
Conversational context: “What about the pricing?” → “What is the pricing for the Enterprise plan discussed earlier?” (using chat history)

Ambiguity: “How do I set it up?” → “How do I set up the Stripe payment integration?”

Typos & grammar: “how too cancle subscrption” → “How to cancel a subscription?”
# LangChain — query rewriting from langchain.prompts import ChatPromptTemplate rewrite_prompt = ChatPromptTemplate.from_template( """Given the conversation history and the user's question, rewrite the question to be a standalone, specific search query. Chat history: {chat_history} User question: {question} Rewritten query:""" ) chain = rewrite_prompt | llm | StrOutputParser() better_query = chain.invoke({ "chat_history": history, "question": "What about the pricing?" })
Query rewriting is the simplest and most universally useful transformation. It is especially critical for conversational RAG where follow-up questions depend on context. LangChain’s create_history_aware_retriever does this automatically.
call_split
Sub-Question Decomposition
Break complex questions into simpler, retrievable parts
When to Use
Complex questions that span multiple topics or require information from different documents. A single retrieval cannot find all the pieces.

Example: “How does our refund policy compare to our competitor’s, and what are the legal implications?”

This needs information from: (1) our refund policy, (2) competitor’s refund policy, (3) legal requirements. No single chunk contains all three.
How It Works
The LLM breaks the question into sub-questions. Each sub-question is retrieved independently. The results are combined and the LLM synthesizes a final answer from all retrieved chunks.
# Sub-question decomposition decompose_prompt = ChatPromptTemplate.from_template( """Break this complex question into 2-4 simpler sub-questions that can each be answered independently: Question: {question} Sub-questions:""" ) # Input: "How does our refund policy compare # to competitors, and what are the # legal implications?" # Output: # 1. What is our current refund policy? # 2. What are our competitors' refund policies? # 3. What are the legal requirements for # refund policies in our jurisdiction? # Each sub-question → separate retrieval # All results → combined context → final answer
LlamaIndex’s SubQuestionQueryEngine implements this pattern natively. It decomposes the question, routes each sub-question to the appropriate index (if you have multiple), retrieves independently, and synthesizes a final answer. LangChain offers similar functionality via custom chains.
auto_awesome
HyDE (Hypothetical Document Embeddings)
Generate a fake answer, embed that instead of the question
The Insight
A question and its answer use very different language. “What is the refund policy?” (question) vs. “Customers may request a full refund within 30 days of purchase…” (answer). The answer is much closer in embedding space to the actual document chunks. HyDE (Gao et al., 2022) exploits this by generating a hypothetical answer and embedding that instead.
How It Works
Step 1: Ask the LLM to generate a hypothetical answer to the question (without any retrieved context).
Step 2: Embed the hypothetical answer (not the original question).
Step 3: Use that embedding to search the vector store.

The hypothetical answer doesn’t need to be correct — it just needs to be in the same “language” as the real documents.
# HyDE implementation hyde_prompt = ChatPromptTemplate.from_template( """Write a short paragraph that would answer this question. It does not need to be accurate — just write in the style of a knowledge base article. Question: {question} Hypothetical answer:""" ) # 1. Generate hypothetical answer hypo_answer = (hyde_prompt | llm).invoke( {"question": "What is the refund policy?"} ) # → "Our refund policy allows customers to # request a full refund within 30 days..." # 2. Embed the hypothetical answer hypo_embedding = embeddings.embed_query(hypo_answer) # 3. Search with that embedding docs = vectorstore.similarity_search_by_vector( hypo_embedding, k=5 )
HyDE works best when there is a large vocabulary gap between questions and documents. It is especially effective for technical documentation, legal text, and academic papers where the document language is very different from how users ask questions. Trade-off: adds one LLM call (~200–500ms) before retrieval.
step_over
Step-Back Prompting
Ask a broader question to get better context
The Idea
Step-back prompting (Zheng et al., 2023) asks the LLM to generate a more general, higher-level question from the user’s specific query. The broader question retrieves foundational context that helps answer the specific question.
Example
User query: “Why did the revenue drop in Q3 2024 for the EMEA region?”

Step-back question: “What were the key factors affecting EMEA revenue performance in 2024?”

The step-back question retrieves broader context about EMEA performance, market conditions, and strategic changes — which helps explain the specific Q3 drop.
# Step-back prompting stepback_prompt = ChatPromptTemplate.from_template( """Given this specific question, generate a more general step-back question that would help retrieve useful background context. Specific question: {question} Step-back question:""" ) # Retrieve for BOTH the original and step-back original_docs = retriever.invoke(question) stepback_q = (stepback_prompt | llm).invoke( {"question": question} ) stepback_docs = retriever.invoke(stepback_q) # Combine both sets of results all_docs = original_docs + stepback_docs # Generate answer with broader context answer = generate(question, all_docs)
Step-back is complementary, not a replacement. You retrieve for both the original query and the step-back query, then combine the results. This gives the LLM both the specific details and the broader context needed for a complete answer.
repeat
Multi-Query Retrieval
Generate multiple query variations for broader recall
How It Works
Ask the LLM to generate 3–5 different versions of the user’s question, each phrased differently. Run retrieval for each variation. Deduplicate and combine the results. This captures documents that any single phrasing might miss.
Why It Helps
Different phrasings match different chunks. “How to cancel?” might retrieve the cancellation guide, while “End my subscription” retrieves the account management docs, and “Stop billing” retrieves the payment FAQ. Together, you get a more complete picture.
# LangChain — MultiQueryRetriever from langchain.retrievers.multi_query import ( MultiQueryRetriever ) multi_retriever = MultiQueryRetriever.from_llm( retriever=vectorstore.as_retriever(), llm=llm ) # User: "How to cancel my subscription?" # LLM generates: # 1. "How do I cancel my subscription?" # 2. "Steps to end my membership" # 3. "How to stop recurring billing" # Each query → retrieval → deduplicate docs = multi_retriever.invoke( "How to cancel my subscription?" )
LangChain’s MultiQueryRetriever handles this end-to-end: generates variations, retrieves for each, deduplicates by document ID, and returns the union. It typically generates 3 variations by default. Combine with reranking to sort the merged results by relevance.
verified
Choosing the Right Strategy
A practical decision framework
Decision Guide
Conversational RAG (follow-up questions):
Use Query Rewriting. Essential for chat-based interfaces where questions depend on context.

Complex multi-part questions:
Use Sub-Question Decomposition. When the answer requires information from multiple different documents.

Technical/specialized documents:
Use HyDE. When document language is very different from how users ask questions.

Specific questions needing background:
Use Step-Back Prompting. When the answer needs broader context to be complete.

General recall improvement:
Use Multi-Query. When you suspect single-query retrieval is missing relevant chunks.
Combining Strategies
These strategies are composable:

Rewrite + Multi-Query: First rewrite for context, then generate variations.

Decompose + HyDE: Break into sub-questions, generate hypothetical answers for each.

Any strategy + Reranking: Transform queries, retrieve broadly, then rerank for precision.
Cost vs Benefit
Every transformation adds one LLM call (~200–500ms, ~$0.001–0.01). Multi-query adds multiple retrieval calls. Decomposition adds multiple LLM + retrieval calls. Start with query rewriting (simplest, most universal), then add others only when you identify specific failure modes.
Measure the impact. Run your evaluation set with and without each transformation. If multi-query improves recall@5 by 10%, the extra latency and cost are justified. If it only improves by 1%, skip it. Always let data drive the decision.