FAISS vector stores in production RAG
You reached for Pinecone before checking if FAISS already does the job
You started a new RAG project. The tutorial you followed used Pinecone. You signed up, got API keys, set up billing, and spent 2 hours on auth and rate limits. Your corpus is 20k documents. FAISS would have handled this in-process in 30 lines of Python with zero infrastructure. You did not even consider it because the tutorial said "use a vector database."
This post is the FAISS decision tree and production setup for RAG: when FAISS is the right choice, which index type to pick, how to persist the index across restarts, and the 4 settings that determine if you should stick with FAISS or graduate to a managed store.
Why is FAISS underrated for production RAG?
Because tutorials skip it in favor of managed services that require accounts, API keys, and monthly bills. For small-to-medium corpuses (under 1-5 million vectors), FAISS is faster, cheaper, and simpler than any managed alternative. 3 specific advantages:
- In-process latency. No network hop. FAISS queries land in under 5 ms for 100k vectors vs 50-200 ms for a managed service.
- Zero infra cost. No monthly bill, no rate limits, no multi-tenant noisy neighbors.
- Trivial deployment. Ships inside your service binary or as a mmapped file. No separate service to maintain.
The trade-off: FAISS is a library, not a service, so you handle persistence, sharding, and replication yourself. Above ~5 million vectors or ~10k queries per second, managed services start earning their keep.
graph TD
Corpus[Document corpus] --> Embed[Embedding model]
Embed --> Vecs[float32 vectors]
Vecs --> Index[FAISS index]
Index -->|in-process| Search[Query response < 5ms]
Query[Query text] --> EmbedQ[Same embedding model]
EmbedQ --> QV[Query vector]
QV --> Search
Index -.->|save| Disk[(index.faiss file on disk)]
Disk -.->|load on startup| Index
style Index fill:#dbeafe,stroke:#1e40af
style Search fill:#dcfce7,stroke:#15803d
Which FAISS index type should you use?
4 index types cover 95 percent of RAG workloads. Pick based on corpus size.
| Index | Corpus size | Trade-off |
|---|---|---|
IndexFlatL2 or IndexFlatIP |
Under 50k | Exact search, simplest, no training needed |
IndexIVFFlat |
50k to 1M | Approximate search, 10x faster at 95 percent recall |
IndexIVFPQ |
1M to 10M | Quantized, 10x smaller memory at some recall loss |
IndexHNSWFlat |
Any, up to ~5M | Very fast, no training, higher memory |
For most RAG projects, start with IndexFlatIP if your corpus is under 50k documents. It is exact, simple, and fast enough. Move to IndexIVFFlat or IndexHNSWFlat only when query latency becomes noticeable.
For the broader embedding model selection that feeds the index, see the Choosing an embedding model for RAG post.
What does the production FAISS setup look like?
30 lines to build, save, load, and query.
# filename: app/rag/faiss_store.py
# description: Production FAISS setup for RAG. Build once, save, load on startup.
import faiss
import numpy as np
from pathlib import Path
from sentence_transformers import SentenceTransformer
class FaissStore:
def __init__(self, dim: int, index_path: str):
self.dim = dim
self.index_path = Path(index_path)
self.index = None
self.docs: list[str] = [] # keep raw docs alongside vectors
def build(self, docs: list[str], embedder: SentenceTransformer) -> None:
vectors = embedder.encode(docs, normalize_embeddings=True)
self.index = faiss.IndexFlatIP(self.dim)
self.index.add(np.asarray(vectors, dtype='float32'))
self.docs = docs
def save(self) -> None:
faiss.write_index(self.index, str(self.index_path))
(self.index_path.with_suffix('.docs.txt')).write_text('\n\x1e\n'.join(self.docs))
def load(self) -> None:
self.index = faiss.read_index(str(self.index_path))
self.docs = (self.index_path.with_suffix('.docs.txt')).read_text().split('\n\x1e\n')
def search(self, query_vector: np.ndarray, k: int = 5) -> list[tuple[str, float]]:
scores, idxs = self.index.search(query_vector.reshape(1, -1).astype('float32'), k)
return [(self.docs[i], float(scores[0, rank])) for rank, i in enumerate(idxs[0])]
3 decisions matter. IndexFlatIP uses inner product similarity, which matches most sentence-transformer models out of the box (after normalization). The docs list is stored alongside the index so a single load restores both. normalize_embeddings=True at encode time means you do not need to normalize query vectors before search.
How do you persist and reload the index across restarts?
Build once, save to disk, load on startup via FastAPI lifespan. The index becomes part of your service's initialization, not something you rebuild every time.
# filename: app/main.py
# description: FastAPI lifespan loads the FAISS index once per worker.
from contextlib import asynccontextmanager
from fastapi import FastAPI
from app.rag.faiss_store import FaissStore
from sentence_transformers import SentenceTransformer
@asynccontextmanager
async def lifespan(app: FastAPI):
app.state.embedder = SentenceTransformer('BAAI/bge-small-en-v1.5')
app.state.store = FaissStore(dim=384, index_path='data/corpus.faiss')
app.state.store.load()
yield
app = FastAPI(lifespan=lifespan)
Load time is typically 100-500 ms for a 100k-document index. Happens once per worker at startup. Zero cost per request afterward.
For the FastAPI lifespan pattern that this fits into, see the FastAPI lifespan for agentic services post.
What are the 4 settings that decide FAISS vs a managed store?
Ask these questions before choosing:
- How many vectors? Under 1 million: FAISS. 1-10 million: FAISS with
IndexIVFPQ. Over 10 million: managed store starts to make sense. - How many queries per second? Under 100: FAISS. 100-1000: FAISS with careful index choice. Over 1000: managed store or custom sharding.
- Do you need live updates? FAISS is build-once, read-many. If you need to insert vectors in real time without rebuilding, managed stores (Pinecone, Qdrant) win.
- Do you need multi-tenancy? FAISS has no native tenancy. If you need to isolate vectors per customer, either build one index per tenant or use a managed store that supports it natively.
If all 4 answers favor FAISS, use FAISS. If any one favors managed, evaluate both.
For the trade-off comparison across vector databases, see the Choosing a vector database post.
What to do Monday morning
- If your current RAG project uses a managed vector store and has under 1 million vectors, build a FAISS prototype in parallel. It will likely be faster and cheaper.
- Pick
IndexFlatIPfor under 50k vectors,IndexIVFFlatorIndexHNSWFlatfor 50k-1M. Both are good defaults. - Store the index as a file and load it at startup via FastAPI
lifespan. No separate service to run. - Benchmark query latency and recall on your real corpus. If p95 latency is under 10ms and recall@5 is above 0.90, you are done.
- Revisit the decision annually. If the corpus crosses 5-10 million vectors or you need live updates, graduate to a managed store.
The headline: FAISS is the right default for small-to-medium RAG corpuses. 30 lines of Python, zero infrastructure, sub-5ms latency. Reach for a managed store only when the corpus, query rate, or tenancy needs genuinely exceed FAISS capabilities.
Frequently asked questions
When should I use FAISS instead of Pinecone or Weaviate?
For corpuses under 1-5 million vectors, query rates under 100 per second, and single-tenant or static-tenant setups. FAISS is faster (no network), cheaper (no bill), and simpler (no service to maintain). Move to a managed store when you cross those thresholds or need live vector updates in real time.
Which FAISS index type should I start with?
IndexFlatIP for under 50k vectors: exact search, no training, simple. IndexIVFFlat for 50k-1M: approximate search, 10x faster at 95 percent recall. IndexHNSWFlat for any size up to ~5M: very fast but higher memory. For most RAG projects, start flat and graduate only when latency becomes noticeable.
How do I persist a FAISS index across service restarts?
Call faiss.write_index(index, 'corpus.faiss') after building. On startup, call faiss.read_index('corpus.faiss'). Store the raw document texts in a sibling file so you can restore both vectors and source material with one load. Load time is typically 100-500 ms for a 100k-document index.
Can FAISS handle concurrent queries?
Yes, FAISS indexes are thread-safe for read operations. Multiple workers can search the same in-memory index concurrently. Writes (adding or removing vectors) require external synchronization. For high-write workloads, consider a managed store; for read-heavy RAG, FAISS scales linearly with CPU cores.
How does FAISS compare to pgvector for RAG?
FAISS is faster for pure vector search (in-process, optimized C++). pgvector is better when you need to combine vector search with SQL filters (e.g., "find similar docs where user_id = X and created_at > Y"). For pure similarity search, FAISS wins; for hybrid filter+vector workloads, pgvector or a dedicated store like Weaviate wins.
Key takeaways
- FAISS is underrated for small-to-medium RAG corpuses. Faster, cheaper, and simpler than any managed vector store for under 1-5 million vectors.
- Pick the index type from corpus size:
IndexFlatIPunder 50k,IndexIVFFlatup to 1M,IndexHNSWFlatfor high-query-rate setups. - Persist the index as a file, load at startup via FastAPI
lifespan. No separate service to run. - Normalize embeddings at encode time so cosine similarity becomes a simple inner product.
- Revisit the FAISS-vs-managed decision annually. Graduate when corpus, QPS, live updates, or multi-tenancy push you past what FAISS handles cleanly.
- To see FAISS wired into a full production RAG pipeline with reranking and evaluation, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.
For the full FAISS documentation covering all index types, GPU support, and quantization, see the FAISS wiki.
Continue Reading
Ready to go deeper?
Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.