Ch 5 — Vector Stores & Indexing

Where vectors live and how they are searched
High Level
scatter_plot
Vectors
arrow_forward
database
Store
arrow_forward
account_tree
Index
arrow_forward
label
Metadata
arrow_forward
search
Query
arrow_forward
tune
Filter
arrow_forward
check_circle
Results
-
Click play or press Space to begin the journey...
Step- / 7
database
What Is a Vector Store?
A database purpose-built for storing and searching vectors
The Core Problem
You have millions of embedding vectors (each 1536 floats). A user sends a query, you embed it, and now you need to find the top-k most similar vectors in your collection. A brute-force comparison against every vector is too slow. Vector stores solve this with specialized approximate nearest neighbor (ANN) indexes.
What They Store
Each entry in a vector store contains:
1. The vector — the embedding (e.g., 1536 floats)
2. The text — the original chunk content
3. Metadata — source, page, title, tags, timestamps
4. An ID — unique identifier for updates and deletes
# What a vector store entry looks like { "id": "chunk_42", "vector": [0.023, -0.041, ...], # 1536 floats "text": "Customers may request a refund...", "metadata": { "source": "policies/refunds.pdf", "page": 7, "department": "customer_service" } }
Vector stores are not traditional databases. A Postgres table with a vector column (pgvector) is a hybrid. A purpose-built vector DB like Pinecone or Qdrant is optimized entirely for vector operations — faster search, better scaling, built-in ANN indexes.
hub
The Vector Store Landscape
Managed services, self-hosted databases, and in-process libraries
Managed Cloud Services
Pinecone — Fully managed, serverless option. No infrastructure to manage. Scales automatically. The most popular choice for teams that want zero ops.

Weaviate Cloud — Managed Weaviate. Supports hybrid search (vector + keyword) natively. GraphQL API.

Qdrant Cloud — Managed Qdrant. Strong filtering, payload indexing, and quantization support.
Self-Hosted Databases
Weaviate — Open-source. Docker deployment. Hybrid search, multi-tenancy, modules for auto-embedding.

Qdrant — Open-source, Rust-based. Fast, memory-efficient. Excellent filtering performance.

Milvus — Open-source by Zilliz. Designed for billion-scale. GPU-accelerated indexing.

pgvector — PostgreSQL extension. Use your existing Postgres. Good for < 5M vectors.
In-Process Libraries
Chroma — Open-source, Python-native. Runs in-process (no server needed). SQLite backend. Perfect for prototyping and small datasets.

FAISS — By Meta. C++ library with Python bindings. The gold standard for ANN research. No metadata storage — just vectors and IDs.

LanceDB — Serverless, embedded. Stores vectors in Lance columnar format. Zero-copy reads.
Start with Chroma for prototyping, graduate to a managed service for production. Chroma runs in 3 lines of Python with no server. When you need persistence, scaling, or team access, move to Pinecone, Qdrant Cloud, or Weaviate Cloud.
account_tree
How ANN Indexes Work
Finding similar vectors without checking every single one
The Speed Problem
Brute-force search compares your query against every vector. With 1M vectors at 1536 dims, that is 6 billion floating-point operations per query. At 100 QPS, that is 600 billion ops/sec. ANN indexes trade a tiny bit of accuracy for massive speed gains — typically 99%+ recall at 100x speed.
HNSW (Most Popular)
Hierarchical Navigable Small World graphs. Builds a multi-layer graph where each vector is connected to its nearest neighbors. Search starts at the top layer (few nodes, long jumps) and descends to the bottom layer (all nodes, short jumps). Used by Pinecone, Qdrant, Weaviate, pgvector, and Chroma.
IVF (Inverted File Index)
Clusters vectors into groups using k-means. At query time, only searches the nearest clusters instead of all vectors. Used by FAISS and Milvus. IVF + PQ (Product Quantization) compresses vectors for massive scale.
Flat (Brute Force)
No index — compares against every vector. 100% accurate (exact nearest neighbors). Fine for < 50K vectors. Used as a baseline and for small datasets where speed is not a concern.
HNSW is the default for most RAG applications. It offers the best balance of speed, accuracy, and simplicity. You rarely need to think about index type — most vector stores use HNSW automatically. Only consider IVF+PQ at billion-scale.
filter_alt
Metadata Filtering
Combining vector search with structured filters
Why Filtering Matters
Vector similarity alone is not enough. You often need to restrict search to specific subsets: "Only search HR documents", "Only docs from 2024", "Only this customer’s data". Metadata filtering lets you combine semantic search with structured conditions.
Pre-filtering vs Post-filtering
Pre-filtering: Apply metadata filter first, then search only matching vectors. Faster when the filter is very selective. Used by Pinecone, Qdrant.

Post-filtering: Search all vectors first, then filter results. Simpler but may return fewer than k results if many matches are filtered out.
# Pinecone — metadata filtering results = index.query( vector=query_embedding, top_k=5, filter={ "department": {"$eq": "HR"}, "year": {"$gte": 2024} }, include_metadata=True ) # Chroma — where clause results = collection.query( query_embeddings=[query_embedding], n_results=5, where={ "department": "HR", "year": {"$gte": 2024} } )
Design your metadata schema upfront. Think about what filters your users will need: department, document type, date range, access level, customer ID. Add these as metadata at indexing time. Retrofitting metadata later means re-indexing everything.
edit_note
CRUD Operations
Adding, updating, and deleting vectors
Upsert (Insert/Update)
Most vector stores use upsert — insert if new, update if the ID already exists. This is the primary way to add data. Batch upserts (hundreds or thousands at once) are much faster than individual inserts.
Delete
Delete by ID or by metadata filter. When a source document changes, delete all chunks from the old version and upsert the new chunks. Some stores support namespaces or collections to isolate different datasets.
Re-indexing
If you change your embedding model or chunking strategy, you must re-embed and re-index everything. Vectors from different models are incompatible — you cannot mix them in the same index. Plan for this by keeping your original text accessible.
# Chroma — full workflow import chromadb client = chromadb.Client() collection = client.create_collection("my_docs") # Add (upsert) collection.add( ids=["chunk_1", "chunk_2"], embeddings=[vec1, vec2], documents=["text 1", "text 2"], metadatas=[ {"source": "doc.pdf", "page": 1}, {"source": "doc.pdf", "page": 2} ] ) # Query results = collection.query( query_embeddings=[query_vec], n_results=5 ) # Delete collection.delete(ids=["chunk_1"])
Always store the original text alongside the vector. You need it for the LLM prompt (the generation step), for debugging retrieval, and for re-indexing if you change models. Some stores (Chroma, Weaviate) store text natively; others (FAISS) require a separate text store.
extension
Framework Integration
Using vector stores with LangChain and LlamaIndex
LangChain VectorStore Interface
LangChain provides a unified VectorStore interface. All stores (Chroma, Pinecone, Qdrant, pgvector, FAISS) implement the same methods: add_documents(), similarity_search(), as_retriever(). Switch stores by changing one line of code.
LlamaIndex VectorStoreIndex
LlamaIndex wraps vector stores in a VectorStoreIndex. It handles chunking, embedding, and storage in one call. The as_query_engine() method returns a ready-to-use retriever + generator pipeline.
# LangChain — Chroma from langchain_chroma import Chroma from langchain_openai import OpenAIEmbeddings vectorstore = Chroma.from_documents( documents=chunks, embedding=OpenAIEmbeddings() ) # Search docs = vectorstore.similarity_search( "refund policy", k=5 ) # As a retriever (for chains) retriever = vectorstore.as_retriever( search_kwargs={"k": 5} )
The framework handles embedding automatically. When you call from_documents(), LangChain embeds all chunks using the provided embedding model and stores them. When you call similarity_search(), it embeds the query and searches. You never touch raw vectors directly.
verified
Choosing the Right Vector Store
A practical decision framework
Decision Framework
Prototyping / < 100K vectors:
Use Chroma. In-process, no server, 3 lines of code. Persists to disk with SQLite.

Production / < 10M vectors:
Use Pinecone (managed, zero ops) or Qdrant Cloud (managed, strong filtering). Or self-host Qdrant/Weaviate if you need data control.

Already using Postgres:
Use pgvector. No new infrastructure. Good enough for < 5M vectors with HNSW index.

Billion-scale:
Use Milvus (GPU-accelerated) or Pinecone (serverless scales automatically).

Offline / research:
Use FAISS. Fastest raw ANN performance. No metadata — pair with a separate store.
Key Questions to Ask
How many vectors? Under 100K = anything works. Over 10M = need careful index tuning.

Do you need metadata filtering? If yes, avoid FAISS (no metadata). Pinecone, Qdrant, and Weaviate have excellent filtering.

Managed or self-hosted? Managed = less ops, more cost. Self-hosted = more control, more work.

Hybrid search needed? Weaviate and Qdrant support vector + BM25 keyword search natively. Pinecone added sparse-dense support.
The vector store is rarely the bottleneck. Most RAG quality issues come from bad chunking or bad embeddings, not the vector store. Pick one that fits your ops model, get it running, and focus your optimization energy on the retrieval pipeline above the store.