Ch 10: RAG Solutions Landscape

Ch 10 — RAG Solutions Landscape

Frameworks, platforms, and managed services for building RAG systems

Index Under the Hood →

High Level

code

Frameworks

arrow_forward

database

Vector DBs

arrow_forward

model_training

Embeddings

arrow_forward

cloud

Platforms

arrow_forward

deployed_code

Enterprise

arrow_forward

open_in_new

Open Source

arrow_forward

check_circle

Choose

Click play or press Space to begin the journey...

Step- / 7

code

Orchestration Frameworks

Libraries that wire together the RAG pipeline

What They Do

Orchestration frameworks provide the glue code between your LLM, vector store, embedding model, and retrieval logic. They offer pre-built chains, document loaders, text splitters, retrievers, and output parsers so you don't have to build everything from scratch.

LangChain

The most widely adopted RAG framework. Huge ecosystem of integrations (700+ components). LCEL for composable chains. LangGraph for agentic workflows.

PythonTypeScriptOpen Source

LlamaIndex

Purpose-built for RAG. Stronger data ingestion and indexing primitives. Excellent for structured data, multi-modal, and complex query engines.

PythonTypeScriptOpen Source

Haystack (deepset)

Pipeline-based framework with a visual pipeline editor. Strong focus on production readiness and evaluation. Used by enterprise teams.

PythonOpen Source

Semantic Kernel (Microsoft)

Microsoft's SDK for AI orchestration. Deep Azure integration. Supports C#, Python, Java. Built-in memory and planner abstractions.

C#PythonJava

When to Use a Framework

Use a framework when: You're building a standard RAG pipeline and want pre-built integrations for your LLM, vector store, and embedding model. Frameworks save weeks of boilerplate code.

Skip a framework when: You have a simple use case (single LLM call + one vector store) or need maximum control over every step. A thin wrapper around your LLM API and vector store client may be simpler.

LangChain vs LlamaIndex: LangChain is broader (agents, chains, tools, memory). LlamaIndex is deeper on data (ingestion, indexing, query engines). Many teams use both: LlamaIndex for data pipelines, LangChain/LangGraph for the application layer. They are complementary, not competing.

database

Vector Databases & Search

Where your embeddings live and get queried

Managed / Cloud-Native

Pinecone

Fully managed, serverless vector database. Zero infrastructure to manage. Fast hybrid search (dense + sparse). Widely used in production.

ManagedServerless

Weaviate

Open-source vector database with a managed cloud option. Built-in vectorization modules. GraphQL API. Hybrid search with BM25.

Open SourceCloud

Qdrant

High-performance open-source vector database written in Rust. Rich filtering, payload indexing, quantization. Qdrant Cloud for managed hosting.

Open SourceRust

Milvus / Zilliz

Open-source vector database for massive scale (billions of vectors). GPU-accelerated indexing. Zilliz Cloud is the managed version.

Open SourceCloud

Embedded / Add-on

Chroma

Lightweight, in-process vector store. Great for prototyping and small datasets. Python-native. Can run as a server too.

Open SourceEmbedded

pgvector (PostgreSQL)

Vector search as a PostgreSQL extension. Keep vectors alongside your relational data. No separate infrastructure needed.

Open SourceSQL

FAISS (Meta)

In-memory similarity search library. Extremely fast. No persistence or API built-in. Best for batch processing and research.

LibraryC++/Python

LanceDB

Serverless, embedded vector database built on Lance columnar format. Multi-modal support. Zero-copy access. Good for local development.

Open SourceEmbedded

Decision guide: Prototyping → Chroma or LanceDB. Already using PostgreSQL → pgvector. Production with managed infra → Pinecone or Qdrant Cloud. Massive scale (1B+ vectors) → Milvus/Zilliz. Self-hosted control → Qdrant or Weaviate.

model_training

Embedding Models & Providers

Converting text to vectors for semantic search

Commercial APIs

OpenAI text-embedding-3

Two variants: small (1536d, cheap) and large (3072d, best quality). Matryoshka support for dimension reduction. Most widely used.

APIMatryoshka

Cohere Embed v3

Strong multilingual support (100+ languages). Input type parameter (search_document vs search_query). Compression to int8/binary.

APIMultilingual

Voyage AI

Specialized embedding models for code (voyage-code-3) and legal/finance domains. High MTEB scores. Acquired by Anthropic in 2024.

APIDomain-Specific

Google Vertex AI

text-embedding-005 model. 768 dimensions. Task-type parameter. Integrated with Google Cloud ecosystem and Gemini.

APIGoogle Cloud

Open-Source / Self-Hosted

BGE (BAAI)

Beijing Academy of AI. BGE-large-en-v1.5 and BGE-M3 (multilingual, multi-granularity). Top MTEB scores. Apache 2.0 license.

Open SourceHuggingFace

E5 (Microsoft)

E5-mistral-7b-instruct for highest quality. E5-small/base/large for efficiency. Prefix-based query/document asymmetry.

Open SourceMIT

Nomic Embed

nomic-embed-text-v1.5. 768d, Matryoshka support, 8192 token context. Fully open-source (weights + training code + data). Apache 2.0.

Open SourceLong Context

Jina Embeddings v3

8192 token context. Late chunking support. Task-specific LoRA adapters. Multilingual. Available via API and self-hosted.

Open SourceAPI

Start with OpenAI text-embedding-3-small for prototyping (cheap, good quality). Move to open-source (BGE, Nomic) for cost control at scale. Use Cohere for multilingual. Use Voyage for code/legal domains. Always benchmark on your own data before choosing.

cloud

End-to-End RAG Platforms

Managed services that handle the full pipeline

What They Offer

End-to-end platforms handle document ingestion, chunking, embedding, storage, retrieval, and generation as a managed service. You upload documents and get an API or chat interface. No need to choose individual components or manage infrastructure.

LangSmith (LangChain)

Observability, testing, and evaluation platform for LangChain apps. Trace every step, run evaluations, compare prompts. Not a full RAG platform but essential for debugging.

ObservabilityEval

LlamaCloud

Managed parsing and indexing by LlamaIndex. LlamaParse for document parsing (PDFs, tables, images). Managed retrieval API. Pairs with LlamaIndex framework.

ManagedParsing

Vectara

End-to-end RAG-as-a-service. Upload documents, get an API. Built-in chunking, embedding, hybrid search, reranking, and grounded generation. Hallucination detection included.

RAG-as-a-Service

Cohere RAG

Cohere's Command R+ model with built-in RAG capabilities. Grounded generation with inline citations. Connectors for data sources. Enterprise-focused.

LLM + RAGCitations

Ragas

Open-source RAG evaluation framework. Metrics: faithfulness, answer relevancy, context recall, context precision. Works with any RAG pipeline. Essential for measuring quality.

EvaluationOpen Source

Unstructured

Document parsing and preprocessing platform. Handles PDFs, images, HTML, DOCX, PPTX. Hosted API or self-hosted. The de facto standard for document ingestion.

IngestionOpen Source

Platforms trade flexibility for speed. Vectara and Cohere RAG get you to production fastest but limit customization. LlamaCloud and LangSmith augment your custom pipeline with managed services for the hardest parts (parsing, observability). For most teams, a framework (LangChain/LlamaIndex) + vector DB + evaluation (Ragas) is the sweet spot.

deployed_code

Enterprise & Cloud Provider Solutions

RAG built into major cloud platforms

Amazon Bedrock Knowledge Bases

Fully managed RAG on AWS. Upload to S3, auto-chunking, auto-embedding (Titan or Cohere), stores in OpenSearch or Pinecone. Query via Bedrock API with any Bedrock model.

AWSManaged

Azure AI Search + OpenAI

Azure's cognitive search with vector search capabilities. Integrated with Azure OpenAI. Hybrid search (BM25 + vectors), semantic ranker, built-in chunking. Enterprise-grade.

AzureEnterprise

Google Vertex AI Search

Google's enterprise search with RAG. Grounding with Google Search or your own data. Integrated with Gemini models. Auto-handles chunking, embedding, and retrieval.

GCPManaged

Databricks + MosaicML

Vector search built into Databricks lakehouse. Embed with open-source models on Databricks GPUs. Unified data + AI platform. MLflow for experiment tracking.

LakehouseData + AI

When to Choose Cloud Providers

Already on AWS: Bedrock Knowledge Bases is the path of least resistance. Tight S3 integration, IAM security, no new vendors.

Already on Azure: Azure AI Search + Azure OpenAI. Enterprise compliance (SOC2, HIPAA) built in. Best for regulated industries.

Already on GCP: Vertex AI Search with Gemini. Especially strong if you use BigQuery for analytics.

Data-heavy teams: Databricks if your data already lives in the lakehouse.

Cloud provider lock-in is real. These solutions are deeply integrated with their respective clouds. Migrating from Bedrock Knowledge Bases to Azure AI Search is a significant effort. If multi-cloud flexibility matters, use a framework (LangChain) with a standalone vector DB (Pinecone, Qdrant) instead.

open_in_new

Open-Source RAG Applications

Full-stack open-source RAG you can self-host

RAGFlow

Open-source RAG engine with deep document understanding. Template-based chunking for different document types. Built-in UI for document management and chat.

Open SourceSelf-Host

Dify

Open-source LLM application platform with visual workflow builder. RAG pipeline included. Supports multiple LLMs and vector stores. No-code/low-code interface.

Open SourceNo-Code

Anything LLM

All-in-one desktop and Docker app for RAG. Chat with documents locally. Supports many LLMs (OpenAI, Ollama, local models) and vector stores.

Open SourceDesktop

Verba (Weaviate)

Open-source RAG chatbot by Weaviate. Built on Weaviate vector store. Simple UI for document upload and chat. Good starting template.

Open SourceWeaviate

Quivr

Open-source "second brain" powered by RAG. Upload documents, chat with them. Supabase + pgvector backend. Active community.

Open SourceSupabase

PrivateGPT

Fully private RAG. Runs 100% locally with no data leaving your machine. Uses Ollama for local LLMs. Good for sensitive documents.

Open SourcePrivate

Open-source RAG apps are great for prototyping and internal tools. They get you a working RAG chatbot in hours, not weeks. But for production, you'll likely outgrow them and need a custom pipeline built with LangChain/LlamaIndex. Use these to validate the use case, then build custom when you need fine-grained control.

verified

Choosing Your RAG Stack

A practical decision framework

By Team & Stage

Solo developer / prototype:
LangChain or LlamaIndex + Chroma + OpenAI embeddings + GPT-4o. Deploy with FastAPI. Total cost: ~$0 to start.

Startup / small team:
LangChain + Pinecone or Qdrant Cloud + OpenAI embeddings + GPT-4o. LangSmith for observability. Ragas for evaluation.

Enterprise / regulated:
Cloud provider solution (Bedrock/Azure AI Search/Vertex) or LangChain + self-hosted Qdrant/Weaviate + open-source embeddings (BGE). Full audit trail.

Just want it to work:
Vectara or Cohere RAG. Upload docs, get API. Minimal engineering required.

The Recommended Stack (2024-2025)

For most teams building custom RAG:

# The "standard" RAG stack Orchestration: LangChain + LangGraph Embeddings: OpenAI text-embedding-3-small (or BGE/Nomic for self-hosted) Vector Store: Qdrant Cloud or Pinecone (or pgvector if already on Postgres) LLM: GPT-4o or Claude 3.5 Sonnet Reranker: Cohere Rerank or BGE Reranker Observability: LangSmith Evaluation: Ragas Ingestion: Unstructured (complex docs) or LangChain loaders (simple docs)

Start simple, add complexity only when needed. Begin with the basic stack. Measure retrieval quality with Ragas. If retrieval is the bottleneck, add hybrid search and reranking. If document parsing is failing, add Unstructured or LlamaParse. If you need agents, add LangGraph. Every component you add is a component you maintain.