Typical Production Stack
A production RAG system is a multi-service architecture:
API layer: FastAPI or Flask serving the RAG endpoint. Handles authentication, rate limiting, request validation.
Ingestion pipeline: Separate service (or batch job) that processes new documents, chunks, embeds, and upserts to the vector store. Runs on a schedule or triggered by events.
Vector store: Managed service (Pinecone, Qdrant Cloud) or self-hosted (Qdrant, Weaviate on Kubernetes).
LLM: API call to OpenAI/Anthropic/Azure OpenAI, or self-hosted via vLLM/TGI.
# Minimal production RAG API
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
app = FastAPI()
class Query(BaseModel):
question: str
session_id: str = "default"
@app.post("/ask")
async def ask(q: Query):
docs = await retriever.ainvoke(q.question)
answer = await rag_chain.ainvoke({
"input": q.question,
"context": docs
})
return {
"answer": answer,
"sources": [d.metadata for d in docs]
}
Separate ingestion from serving. The ingestion pipeline (chunking, embedding, indexing) is CPU/GPU-intensive and runs in batch. The serving API is I/O-bound (waiting for vector search and LLM responses). Different scaling characteristics mean different infrastructure. Use a message queue (SQS, RabbitMQ) to decouple them.