Ch 5: Vector Stores & Indexing

Ch 5 — Vector Stores & Indexing

Where vectors live and how they are searched

Index Under the Hood →

High Level

scatter_plot

Vectors

arrow_forward

database

Store

arrow_forward

account_tree

Index

arrow_forward

label

Metadata

arrow_forward

Query

arrow_forward

tune

Filter

arrow_forward

check_circle

Results

Click play or press Space to begin the journey...

Step- / 7

database

What Is a Vector Store?

A database purpose-built for storing and searching vectors

The Core Problem

You have millions of embedding vectors (each 1536 floats). A user sends a query, you embed it, and now you need to find the top-k most similar vectors in your collection. A brute-force comparison against every vector is too slow. Vector stores solve this with specialized approximate nearest neighbor (ANN) indexes.

What They Store

Each entry in a vector store contains:
1. The vector — the embedding (e.g., 1536 floats)
2. The text — the original chunk content
3. Metadata — source, page, title, tags, timestamps
4. An ID — unique identifier for updates and deletes

# What a vector store entry looks like { "id": "chunk_42", "vector": [0.023, -0.041, ...], # 1536 floats "text": "Customers may request a refund...", "metadata": { "source": "policies/refunds.pdf", "page": 7, "department": "customer_service" } }

Vector stores are not traditional databases. A Postgres table with a vector column (pgvector) is a hybrid. A purpose-built vector DB like Pinecone or Qdrant is optimized entirely for vector operations — faster search, better scaling, built-in ANN indexes.

hub

The Vector Store Landscape

Managed services, self-hosted databases, and in-process libraries

Managed Cloud Services

Pinecone — Fully managed, serverless option. No infrastructure to manage. Scales automatically. The most popular choice for teams that want zero ops.

Weaviate Cloud — Managed Weaviate. Supports hybrid search (vector + keyword) natively. GraphQL API.

Qdrant Cloud — Managed Qdrant. Strong filtering, payload indexing, and quantization support.

Self-Hosted Databases

Weaviate — Open-source. Docker deployment. Hybrid search, multi-tenancy, modules for auto-embedding.

Qdrant — Open-source, Rust-based. Fast, memory-efficient. Excellent filtering performance.

Milvus — Open-source by Zilliz. Designed for billion-scale. GPU-accelerated indexing.

pgvector — PostgreSQL extension. Use your existing Postgres. Good for < 5M vectors.

In-Process Libraries

Chroma — Open-source, Python-native. Runs in-process (no server needed). SQLite backend. Perfect for prototyping and small datasets.

FAISS — By Meta. C++ library with Python bindings. The gold standard for ANN research. No metadata storage — just vectors and IDs.

LanceDB — Serverless, embedded. Stores vectors in Lance columnar format. Zero-copy reads.

Start with Chroma for prototyping, graduate to a managed service for production. Chroma runs in 3 lines of Python with no server. When you need persistence, scaling, or team access, move to Pinecone, Qdrant Cloud, or Weaviate Cloud.

account_tree

How ANN Indexes Work

Finding similar vectors without checking every single one

The Speed Problem

Brute-force search compares your query against every vector. With 1M vectors at 1536 dims, that is 6 billion floating-point operations per query. At 100 QPS, that is 600 billion ops/sec. ANN indexes trade a tiny bit of accuracy for massive speed gains — typically 99%+ recall at 100x speed.

HNSW (Most Popular)

Hierarchical Navigable Small World graphs. Builds a multi-layer graph where each vector is connected to its nearest neighbors. Search starts at the top layer (few nodes, long jumps) and descends to the bottom layer (all nodes, short jumps). Used by Pinecone, Qdrant, Weaviate, pgvector, and Chroma.

IVF (Inverted File Index)

Clusters vectors into groups using k-means. At query time, only searches the nearest clusters instead of all vectors. Used by FAISS and Milvus. IVF + PQ (Product Quantization) compresses vectors for massive scale.

Flat (Brute Force)

No index — compares against every vector. 100% accurate (exact nearest neighbors). Fine for < 50K vectors. Used as a baseline and for small datasets where speed is not a concern.

HNSW is the default for most RAG applications. It offers the best balance of speed, accuracy, and simplicity. You rarely need to think about index type — most vector stores use HNSW automatically. Only consider IVF+PQ at billion-scale.

filter_alt

Metadata Filtering

Combining vector search with structured filters

Why Filtering Matters

Vector similarity alone is not enough. You often need to restrict search to specific subsets: "Only search HR documents", "Only docs from 2024", "Only this customer’s data". Metadata filtering lets you combine semantic search with structured conditions.

Pre-filtering vs Post-filtering

Pre-filtering: Apply metadata filter first, then search only matching vectors. Faster when the filter is very selective. Used by Pinecone, Qdrant.

Post-filtering: Search all vectors first, then filter results. Simpler but may return fewer than k results if many matches are filtered out.

# Pinecone — metadata filtering results = index.query( vector=query_embedding, top_k=5, filter={ "department": {"$eq": "HR"}, "year": {"$gte": 2024} }, include_metadata=True ) # Chroma — where clause results = collection.query( query_embeddings=[query_embedding], n_results=5, where={ "department": "HR", "year": {"$gte": 2024} } )

Design your metadata schema upfront. Think about what filters your users will need: department, document type, date range, access level, customer ID. Add these as metadata at indexing time. Retrofitting metadata later means re-indexing everything.

edit_note

CRUD Operations

Adding, updating, and deleting vectors

Upsert (Insert/Update)

Most vector stores use upsert — insert if new, update if the ID already exists. This is the primary way to add data. Batch upserts (hundreds or thousands at once) are much faster than individual inserts.

Delete

Delete by ID or by metadata filter. When a source document changes, delete all chunks from the old version and upsert the new chunks. Some stores support namespaces or collections to isolate different datasets.

Re-indexing

If you change your embedding model or chunking strategy, you must re-embed and re-index everything. Vectors from different models are incompatible — you cannot mix them in the same index. Plan for this by keeping your original text accessible.

# Chroma — full workflow import chromadb client = chromadb.Client() collection = client.create_collection("my_docs") # Add (upsert) collection.add( ids=["chunk_1", "chunk_2"], embeddings=[vec1, vec2], documents=["text 1", "text 2"], metadatas=[ {"source": "doc.pdf", "page": 1}, {"source": "doc.pdf", "page": 2} ] ) # Query results = collection.query( query_embeddings=[query_vec], n_results=5 ) # Delete collection.delete(ids=["chunk_1"])

Always store the original text alongside the vector. You need it for the LLM prompt (the generation step), for debugging retrieval, and for re-indexing if you change models. Some stores (Chroma, Weaviate) store text natively; others (FAISS) require a separate text store.

extension

Framework Integration

Using vector stores with LangChain and LlamaIndex

LangChain VectorStore Interface

LangChain provides a unified VectorStore interface. All stores (Chroma, Pinecone, Qdrant, pgvector, FAISS) implement the same methods: add_documents(), similarity_search(), as_retriever(). Switch stores by changing one line of code.

LlamaIndex VectorStoreIndex

LlamaIndex wraps vector stores in a VectorStoreIndex. It handles chunking, embedding, and storage in one call. The as_query_engine() method returns a ready-to-use retriever + generator pipeline.

# LangChain — Chroma from langchain_chroma import Chroma from langchain_openai import OpenAIEmbeddings vectorstore = Chroma.from_documents( documents=chunks, embedding=OpenAIEmbeddings() ) # Search docs = vectorstore.similarity_search( "refund policy", k=5 ) # As a retriever (for chains) retriever = vectorstore.as_retriever( search_kwargs={"k": 5} )

The framework handles embedding automatically. When you call from_documents(), LangChain embeds all chunks using the provided embedding model and stores them. When you call similarity_search(), it embeds the query and searches. You never touch raw vectors directly.

verified

Choosing the Right Vector Store

A practical decision framework

Decision Framework

Prototyping / < 100K vectors:
Use Chroma. In-process, no server, 3 lines of code. Persists to disk with SQLite.

Production / < 10M vectors:
Use Pinecone (managed, zero ops) or Qdrant Cloud (managed, strong filtering). Or self-host Qdrant/Weaviate if you need data control.

Already using Postgres:
Use pgvector. No new infrastructure. Good enough for < 5M vectors with HNSW index.

Billion-scale:
Use Milvus (GPU-accelerated) or Pinecone (serverless scales automatically).

Offline / research:
Use FAISS. Fastest raw ANN performance. No metadata — pair with a separate store.

Key Questions to Ask

How many vectors? Under 100K = anything works. Over 10M = need careful index tuning.

Do you need metadata filtering? If yes, avoid FAISS (no metadata). Pinecone, Qdrant, and Weaviate have excellent filtering.

Managed or self-hosted? Managed = less ops, more cost. Self-hosted = more control, more work.

Hybrid search needed? Weaviate and Qdrant support vector + BM25 keyword search natively. Pinecone added sparse-dense support.

The vector store is rarely the bottleneck. Most RAG quality issues come from bad chunking or bad embeddings, not the vector store. Pick one that fits your ops model, get it running, and focus your optimization energy on the retrieval pipeline above the store.