Ch 7: Query Transformation

Ch 7 — Query Transformation

Rewriting the user’s question before retrieval

Index Under the Hood →

High Level

chat

Raw Query

arrow_forward

edit_note

Rewrite

arrow_forward

call_split

Decompose

arrow_forward

auto_awesome

HyDE

arrow_forward

step_over

Step-Back

arrow_forward

repeat

Multi-Query

arrow_forward

check_circle

Retrieve

Click play or press Space to begin the journey...

Step- / 7

chat

Why Transform Queries?

User questions are rarely optimal for retrieval

The Problem

Users ask questions in natural, conversational language. But the best retrieval query is often different from what the user typed. Common issues:

Vague queries: “Tell me about that thing we discussed” — no useful search terms.

Complex questions: “How does X compare to Y in terms of A, B, and C?” — too many concepts for a single embedding.

Vocabulary mismatch: The user says “get my money back” but the docs say “refund policy.”

Conversational context: “What about the pricing?” — depends on what was discussed before.

Query Transformation Strategies

This chapter covers the main techniques to improve queries before retrieval:

1. Query Rewriting — LLM rewrites the query for better retrieval
2. Sub-Question Decomposition — Break complex questions into simpler parts
3. HyDE — Generate a hypothetical answer, embed that instead
4. Step-Back Prompting — Ask a broader question first
5. Multi-Query — Generate multiple query variations

All of these use the LLM as a pre-processing step before retrieval.

Query transformation sits between the user and retrieval. The user’s raw question goes in, a better search query (or queries) comes out. This is one of the highest-leverage improvements in advanced RAG — often more impactful than changing the embedding model or vector store.

edit_note

Query Rewriting

LLM rewrites the user’s question for better retrieval

How It Works

Send the user’s query to an LLM with a prompt like: “Rewrite this question to be more specific and better suited for searching a knowledge base.” The LLM returns a cleaner, more precise version that retrieves better results.

What Rewriting Fixes

Conversational context: “What about the pricing?” → “What is the pricing for the Enterprise plan discussed earlier?” (using chat history)

Ambiguity: “How do I set it up?” → “How do I set up the Stripe payment integration?”

Typos & grammar: “how too cancle subscrption” → “How to cancel a subscription?”

# LangChain — query rewriting from langchain.prompts import ChatPromptTemplate rewrite_prompt = ChatPromptTemplate.from_template( """Given the conversation history and the user's question, rewrite the question to be a standalone, specific search query. Chat history: {chat_history} User question: {question} Rewritten query:""" ) chain = rewrite_prompt | llm | StrOutputParser() better_query = chain.invoke({ "chat_history": history, "question": "What about the pricing?" })

Query rewriting is the simplest and most universally useful transformation. It is especially critical for conversational RAG where follow-up questions depend on context. LangChain’s create_history_aware_retriever does this automatically.

call_split

Sub-Question Decomposition

Break complex questions into simpler, retrievable parts

When to Use

Complex questions that span multiple topics or require information from different documents. A single retrieval cannot find all the pieces.

Example: “How does our refund policy compare to our competitor’s, and what are the legal implications?”

This needs information from: (1) our refund policy, (2) competitor’s refund policy, (3) legal requirements. No single chunk contains all three.

How It Works

The LLM breaks the question into sub-questions. Each sub-question is retrieved independently. The results are combined and the LLM synthesizes a final answer from all retrieved chunks.

# Sub-question decomposition decompose_prompt = ChatPromptTemplate.from_template( """Break this complex question into 2-4 simpler sub-questions that can each be answered independently: Question: {question} Sub-questions:""" ) # Input: "How does our refund policy compare # to competitors, and what are the # legal implications?" # Output: # 1. What is our current refund policy? # 2. What are our competitors' refund policies? # 3. What are the legal requirements for # refund policies in our jurisdiction? # Each sub-question → separate retrieval # All results → combined context → final answer

LlamaIndex’s SubQuestionQueryEngine implements this pattern natively. It decomposes the question, routes each sub-question to the appropriate index (if you have multiple), retrieves independently, and synthesizes a final answer. LangChain offers similar functionality via custom chains.

auto_awesome

HyDE (Hypothetical Document Embeddings)

Generate a fake answer, embed that instead of the question

The Insight

A question and its answer use very different language. “What is the refund policy?” (question) vs. “Customers may request a full refund within 30 days of purchase…” (answer). The answer is much closer in embedding space to the actual document chunks. HyDE (Gao et al., 2022) exploits this by generating a hypothetical answer and embedding that instead.

How It Works

Step 1: Ask the LLM to generate a hypothetical answer to the question (without any retrieved context).
Step 2: Embed the hypothetical answer (not the original question).
Step 3: Use that embedding to search the vector store.

The hypothetical answer doesn’t need to be correct — it just needs to be in the same “language” as the real documents.

# HyDE implementation hyde_prompt = ChatPromptTemplate.from_template( """Write a short paragraph that would answer this question. It does not need to be accurate — just write in the style of a knowledge base article. Question: {question} Hypothetical answer:""" ) # 1. Generate hypothetical answer hypo_answer = (hyde_prompt | llm).invoke( {"question": "What is the refund policy?"} ) # → "Our refund policy allows customers to # request a full refund within 30 days..." # 2. Embed the hypothetical answer hypo_embedding = embeddings.embed_query(hypo_answer) # 3. Search with that embedding docs = vectorstore.similarity_search_by_vector( hypo_embedding, k=5 )

HyDE works best when there is a large vocabulary gap between questions and documents. It is especially effective for technical documentation, legal text, and academic papers where the document language is very different from how users ask questions. Trade-off: adds one LLM call (~200–500ms) before retrieval.

step_over

Step-Back Prompting

Ask a broader question to get better context

The Idea

Step-back prompting (Zheng et al., 2023) asks the LLM to generate a more general, higher-level question from the user’s specific query. The broader question retrieves foundational context that helps answer the specific question.

Example

User query: “Why did the revenue drop in Q3 2024 for the EMEA region?”

Step-back question: “What were the key factors affecting EMEA revenue performance in 2024?”

The step-back question retrieves broader context about EMEA performance, market conditions, and strategic changes — which helps explain the specific Q3 drop.

# Step-back prompting stepback_prompt = ChatPromptTemplate.from_template( """Given this specific question, generate a more general step-back question that would help retrieve useful background context. Specific question: {question} Step-back question:""" ) # Retrieve for BOTH the original and step-back original_docs = retriever.invoke(question) stepback_q = (stepback_prompt | llm).invoke( {"question": question} ) stepback_docs = retriever.invoke(stepback_q) # Combine both sets of results all_docs = original_docs + stepback_docs # Generate answer with broader context answer = generate(question, all_docs)

Step-back is complementary, not a replacement. You retrieve for both the original query and the step-back query, then combine the results. This gives the LLM both the specific details and the broader context needed for a complete answer.

repeat

Multi-Query Retrieval

Generate multiple query variations for broader recall

How It Works

Ask the LLM to generate 3–5 different versions of the user’s question, each phrased differently. Run retrieval for each variation. Deduplicate and combine the results. This captures documents that any single phrasing might miss.

Why It Helps

Different phrasings match different chunks. “How to cancel?” might retrieve the cancellation guide, while “End my subscription” retrieves the account management docs, and “Stop billing” retrieves the payment FAQ. Together, you get a more complete picture.

# LangChain — MultiQueryRetriever from langchain.retrievers.multi_query import ( MultiQueryRetriever ) multi_retriever = MultiQueryRetriever.from_llm( retriever=vectorstore.as_retriever(), llm=llm ) # User: "How to cancel my subscription?" # LLM generates: # 1. "How do I cancel my subscription?" # 2. "Steps to end my membership" # 3. "How to stop recurring billing" # Each query → retrieval → deduplicate docs = multi_retriever.invoke( "How to cancel my subscription?" )

LangChain’s MultiQueryRetriever handles this end-to-end: generates variations, retrieves for each, deduplicates by document ID, and returns the union. It typically generates 3 variations by default. Combine with reranking to sort the merged results by relevance.

verified

Choosing the Right Strategy

A practical decision framework

Decision Guide

Conversational RAG (follow-up questions):
Use Query Rewriting. Essential for chat-based interfaces where questions depend on context.

Complex multi-part questions:
Use Sub-Question Decomposition. When the answer requires information from multiple different documents.

Technical/specialized documents:
Use HyDE. When document language is very different from how users ask questions.

Specific questions needing background:
Use Step-Back Prompting. When the answer needs broader context to be complete.

General recall improvement:
Use Multi-Query. When you suspect single-query retrieval is missing relevant chunks.

Combining Strategies

These strategies are composable:

Rewrite + Multi-Query: First rewrite for context, then generate variations.

Decompose + HyDE: Break into sub-questions, generate hypothetical answers for each.

Any strategy + Reranking: Transform queries, retrieve broadly, then rerank for precision.

Cost vs Benefit

Every transformation adds one LLM call (~200–500ms, ~$0.001–0.01). Multi-query adds multiple retrieval calls. Decomposition adds multiple LLM + retrieval calls. Start with query rewriting (simplest, most universal), then add others only when you identify specific failure modes.

Measure the impact. Run your evaluation set with and without each transformation. If multi-query improves recall@5 by 10%, the extra latency and cost are justified. If it only improves by 1%, skip it. Always let data drive the decision.