FAISS vector stores in production RAG

You reached for Pinecone before checking if FAISS already does the job

You started a new RAG project. The tutorial you followed used Pinecone. You signed up, got API keys, set up billing, and spent 2 hours on auth and rate limits. Your corpus is 20k documents. FAISS would have handled this in-process in 30 lines of Python with zero infrastructure. You did not even consider it because the tutorial said "use a vector database."

This post is the FAISS decision tree and production setup for RAG: when FAISS is the right choice, which index type to pick, how to persist the index across restarts, and the 4 settings that determine if you should stick with FAISS or graduate to a managed store.

Why is FAISS underrated for production RAG?

Because tutorials skip it in favor of managed services that require accounts, API keys, and monthly bills. For small-to-medium corpuses (under 1-5 million vectors), FAISS is faster, cheaper, and simpler than any managed alternative. 3 specific advantages:

In-process latency. No network hop. FAISS queries land in under 5 ms for 100k vectors vs 50-200 ms for a managed service.
Zero infra cost. No monthly bill, no rate limits, no multi-tenant noisy neighbors.
Trivial deployment. Ships inside your service binary or as a mmapped file. No separate service to maintain.

The trade-off: FAISS is a library, not a service, so you handle persistence, sharding, and replication yourself. Above ~5 million vectors or ~10k queries per second, managed services start earning their keep.

graph TD
    Corpus[Document corpus] --> Embed[Embedding model]
    Embed --> Vecs[float32 vectors]
    Vecs --> Index[FAISS index]
    Index -->|in-process| Search[Query response < 5ms]

    Query[Query text] --> EmbedQ[Same embedding model]
    EmbedQ --> QV[Query vector]
    QV --> Search

    Index -.->|save| Disk[(index.faiss file on disk)]
    Disk -.->|load on startup| Index

    style Index fill:#dbeafe,stroke:#1e40af
    style Search fill:#dcfce7,stroke:#15803d

Which FAISS index type should you use?

4 index types cover 95 percent of RAG workloads. Pick based on corpus size.

Index	Corpus size	Trade-off
`IndexFlatL2` or `IndexFlatIP`	Under 50k	Exact search, simplest, no training needed
`IndexIVFFlat`	50k to 1M	Approximate search, 10x faster at 95 percent recall
`IndexIVFPQ`	1M to 10M	Quantized, 10x smaller memory at some recall loss
`IndexHNSWFlat`	Any, up to ~5M	Very fast, no training, higher memory

For most RAG projects, start with IndexFlatIP if your corpus is under 50k documents. It is exact, simple, and fast enough. Move to IndexIVFFlat or IndexHNSWFlat only when query latency becomes noticeable.

For the broader embedding model selection that feeds the index, see the Choosing an embedding model for RAG post.

What does the production FAISS setup look like?

30 lines to build, save, load, and query.

# filename: app/rag/faiss_store.py
# description: Production FAISS setup for RAG. Build once, save, load on startup.
import faiss
import numpy as np
from pathlib import Path
from sentence_transformers import SentenceTransformer


class FaissStore:
    def __init__(self, dim: int, index_path: str):
        self.dim = dim
        self.index_path = Path(index_path)
        self.index = None
        self.docs: list[str] = []  # keep raw docs alongside vectors

    def build(self, docs: list[str], embedder: SentenceTransformer) -> None:
        vectors = embedder.encode(docs, normalize_embeddings=True)
        self.index = faiss.IndexFlatIP(self.dim)
        self.index.add(np.asarray(vectors, dtype='float32'))
        self.docs = docs

    def save(self) -> None:
        faiss.write_index(self.index, str(self.index_path))
        (self.index_path.with_suffix('.docs.txt')).write_text('\n\x1e\n'.join(self.docs))

    def load(self) -> None:
        self.index = faiss.read_index(str(self.index_path))
        self.docs = (self.index_path.with_suffix('.docs.txt')).read_text().split('\n\x1e\n')

    def search(self, query_vector: np.ndarray, k: int = 5) -> list[tuple[str, float]]:
        scores, idxs = self.index.search(query_vector.reshape(1, -1).astype('float32'), k)
        return [(self.docs[i], float(scores[0, rank])) for rank, i in enumerate(idxs[0])]

3 decisions matter. IndexFlatIP uses inner product similarity, which matches most sentence-transformer models out of the box (after normalization). The docs list is stored alongside the index so a single load restores both. normalize_embeddings=True at encode time means you do not need to normalize query vectors before search.

How do you persist and reload the index across restarts?

Build once, save to disk, load on startup via FastAPI lifespan. The index becomes part of your service's initialization, not something you rebuild every time.

# filename: app/main.py
# description: FastAPI lifespan loads the FAISS index once per worker.
from contextlib import asynccontextmanager
from fastapi import FastAPI
from app.rag.faiss_store import FaissStore
from sentence_transformers import SentenceTransformer


@asynccontextmanager
async def lifespan(app: FastAPI):
    app.state.embedder = SentenceTransformer('BAAI/bge-small-en-v1.5')
    app.state.store = FaissStore(dim=384, index_path='data/corpus.faiss')
    app.state.store.load()
    yield


app = FastAPI(lifespan=lifespan)

Load time is typically 100-500 ms for a 100k-document index. Happens once per worker at startup. Zero cost per request afterward.

For the FastAPI lifespan pattern that this fits into, see the FastAPI lifespan for agentic services post.

What are the 4 settings that decide FAISS vs a managed store?

Ask these questions before choosing:

How many vectors? Under 1 million: FAISS. 1-10 million: FAISS with IndexIVFPQ. Over 10 million: managed store starts to make sense.
How many queries per second? Under 100: FAISS. 100-1000: FAISS with careful index choice. Over 1000: managed store or custom sharding.
Do you need live updates? FAISS is build-once, read-many. If you need to insert vectors in real time without rebuilding, managed stores (Pinecone, Qdrant) win.
Do you need multi-tenancy? FAISS has no native tenancy. If you need to isolate vectors per customer, either build one index per tenant or use a managed store that supports it natively.

If all 4 answers favor FAISS, use FAISS. If any one favors managed, evaluate both.

For the trade-off comparison across vector databases, see the Choosing a vector database post.

What to do Monday morning

If your current RAG project uses a managed vector store and has under 1 million vectors, build a FAISS prototype in parallel. It will likely be faster and cheaper.
Pick IndexFlatIP for under 50k vectors, IndexIVFFlat or IndexHNSWFlat for 50k-1M. Both are good defaults.
Store the index as a file and load it at startup via FastAPI lifespan. No separate service to run.
Benchmark query latency and recall on your real corpus. If p95 latency is under 10ms and recall@5 is above 0.90, you are done.
Revisit the decision annually. If the corpus crosses 5-10 million vectors or you need live updates, graduate to a managed store.

The headline: FAISS is the right default for small-to-medium RAG corpuses. 30 lines of Python, zero infrastructure, sub-5ms latency. Reach for a managed store only when the corpus, query rate, or tenancy needs genuinely exceed FAISS capabilities.

Frequently asked questions

When should I use FAISS instead of Pinecone or Weaviate?

For corpuses under 1-5 million vectors, query rates under 100 per second, and single-tenant or static-tenant setups. FAISS is faster (no network), cheaper (no bill), and simpler (no service to maintain). Move to a managed store when you cross those thresholds or need live vector updates in real time.

Which FAISS index type should I start with?

IndexFlatIP for under 50k vectors: exact search, no training, simple. IndexIVFFlat for 50k-1M: approximate search, 10x faster at 95 percent recall. IndexHNSWFlat for any size up to ~5M: very fast but higher memory. For most RAG projects, start flat and graduate only when latency becomes noticeable.

How do I persist a FAISS index across service restarts?

Call faiss.write_index(index, 'corpus.faiss') after building. On startup, call faiss.read_index('corpus.faiss'). Store the raw document texts in a sibling file so you can restore both vectors and source material with one load. Load time is typically 100-500 ms for a 100k-document index.

Can FAISS handle concurrent queries?

Yes, FAISS indexes are thread-safe for read operations. Multiple workers can search the same in-memory index concurrently. Writes (adding or removing vectors) require external synchronization. For high-write workloads, consider a managed store; for read-heavy RAG, FAISS scales linearly with CPU cores.

How does FAISS compare to pgvector for RAG?

FAISS is faster for pure vector search (in-process, optimized C++). pgvector is better when you need to combine vector search with SQL filters (e.g., "find similar docs where user_id = X and created_at > Y"). For pure similarity search, FAISS wins; for hybrid filter+vector workloads, pgvector or a dedicated store like Weaviate wins.

Key takeaways

FAISS is underrated for small-to-medium RAG corpuses. Faster, cheaper, and simpler than any managed vector store for under 1-5 million vectors.
Pick the index type from corpus size: IndexFlatIP under 50k, IndexIVFFlat up to 1M, IndexHNSWFlat for high-query-rate setups.
Persist the index as a file, load at startup via FastAPI lifespan. No separate service to run.
Normalize embeddings at encode time so cosine similarity becomes a simple inner product.
Revisit the FAISS-vs-managed decision annually. Graduate when corpus, QPS, live updates, or multi-tenancy push you past what FAISS handles cleanly.
To see FAISS wired into a full production RAG pipeline with reranking and evaluation, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For the full FAISS documentation covering all index types, GPU support, and quantization, see the FAISS wiki.

You reached for Pinecone before checking if FAISS already does the job

Why is FAISS underrated for production RAG?

In-process latency. No network hop. FAISS queries land in under 5 ms for 100k vectors vs 50-200 ms for a managed service.
Zero infra cost. No monthly bill, no rate limits, no multi-tenant noisy neighbors.
Trivial deployment. Ships inside your service binary or as a mmapped file. No separate service to maintain.

graph TD
    Corpus[Document corpus] --> Embed[Embedding model]
    Embed --> Vecs[float32 vectors]
    Vecs --> Index[FAISS index]
    Index -->|in-process| Search[Query response < 5ms]

    Query[Query text] --> EmbedQ[Same embedding model]
    EmbedQ --> QV[Query vector]
    QV --> Search

    Index -.->|save| Disk[(index.faiss file on disk)]
    Disk -.->|load on startup| Index

    style Index fill:#dbeafe,stroke:#1e40af
    style Search fill:#dcfce7,stroke:#15803d

Which FAISS index type should you use?

4 index types cover 95 percent of RAG workloads. Pick based on corpus size.

Index	Corpus size	Trade-off
`IndexFlatL2` or `IndexFlatIP`	Under 50k	Exact search, simplest, no training needed
`IndexIVFFlat`	50k to 1M	Approximate search, 10x faster at 95 percent recall
`IndexIVFPQ`	1M to 10M	Quantized, 10x smaller memory at some recall loss
`IndexHNSWFlat`	Any, up to ~5M	Very fast, no training, higher memory

For the broader embedding model selection that feeds the index, see the Choosing an embedding model for RAG post.

What does the production FAISS setup look like?

30 lines to build, save, load, and query.

# filename: app/rag/faiss_store.py
# description: Production FAISS setup for RAG. Build once, save, load on startup.
import faiss
import numpy as np
from pathlib import Path
from sentence_transformers import SentenceTransformer


class FaissStore:
    def __init__(self, dim: int, index_path: str):
        self.dim = dim
        self.index_path = Path(index_path)
        self.index = None
        self.docs: list[str] = []  # keep raw docs alongside vectors

    def build(self, docs: list[str], embedder: SentenceTransformer) -> None:
        vectors = embedder.encode(docs, normalize_embeddings=True)
        self.index = faiss.IndexFlatIP(self.dim)
        self.index.add(np.asarray(vectors, dtype='float32'))
        self.docs = docs

    def save(self) -> None:
        faiss.write_index(self.index, str(self.index_path))
        (self.index_path.with_suffix('.docs.txt')).write_text('\n\x1e\n'.join(self.docs))

    def load(self) -> None:
        self.index = faiss.read_index(str(self.index_path))
        self.docs = (self.index_path.with_suffix('.docs.txt')).read_text().split('\n\x1e\n')

    def search(self, query_vector: np.ndarray, k: int = 5) -> list[tuple[str, float]]:
        scores, idxs = self.index.search(query_vector.reshape(1, -1).astype('float32'), k)
        return [(self.docs[i], float(scores[0, rank])) for rank, i in enumerate(idxs[0])]

How do you persist and reload the index across restarts?

Build once, save to disk, load on startup via FastAPI lifespan. The index becomes part of your service's initialization, not something you rebuild every time.

# filename: app/main.py
# description: FastAPI lifespan loads the FAISS index once per worker.
from contextlib import asynccontextmanager
from fastapi import FastAPI
from app.rag.faiss_store import FaissStore
from sentence_transformers import SentenceTransformer


@asynccontextmanager
async def lifespan(app: FastAPI):
    app.state.embedder = SentenceTransformer('BAAI/bge-small-en-v1.5')
    app.state.store = FaissStore(dim=384, index_path='data/corpus.faiss')
    app.state.store.load()
    yield


app = FastAPI(lifespan=lifespan)

Load time is typically 100-500 ms for a 100k-document index. Happens once per worker at startup. Zero cost per request afterward.

For the FastAPI lifespan pattern that this fits into, see the FastAPI lifespan for agentic services post.

What are the 4 settings that decide FAISS vs a managed store?

Ask these questions before choosing:

How many vectors? Under 1 million: FAISS. 1-10 million: FAISS with IndexIVFPQ. Over 10 million: managed store starts to make sense.
How many queries per second? Under 100: FAISS. 100-1000: FAISS with careful index choice. Over 1000: managed store or custom sharding.
Do you need live updates? FAISS is build-once, read-many. If you need to insert vectors in real time without rebuilding, managed stores (Pinecone, Qdrant) win.
Do you need multi-tenancy? FAISS has no native tenancy. If you need to isolate vectors per customer, either build one index per tenant or use a managed store that supports it natively.

If all 4 answers favor FAISS, use FAISS. If any one favors managed, evaluate both.

For the trade-off comparison across vector databases, see the Choosing a vector database post.

What to do Monday morning

If your current RAG project uses a managed vector store and has under 1 million vectors, build a FAISS prototype in parallel. It will likely be faster and cheaper.
Pick IndexFlatIP for under 50k vectors, IndexIVFFlat or IndexHNSWFlat for 50k-1M. Both are good defaults.
Store the index as a file and load it at startup via FastAPI lifespan. No separate service to run.
Benchmark query latency and recall on your real corpus. If p95 latency is under 10ms and recall@5 is above 0.90, you are done.
Revisit the decision annually. If the corpus crosses 5-10 million vectors or you need live updates, graduate to a managed store.

FAISS is underrated for small-to-medium RAG corpuses. Faster, cheaper, and simpler than any managed vector store for under 1-5 million vectors.
Pick the index type from corpus size: IndexFlatIP under 50k, IndexIVFFlat up to 1M, IndexHNSWFlat for high-query-rate setups.
Persist the index as a file, load at startup via FastAPI lifespan. No separate service to run.
Normalize embeddings at encode time so cosine similarity becomes a simple inner product.
Revisit the FAISS-vs-managed decision annually. Graduate when corpus, QPS, live updates, or multi-tenancy push you past what FAISS handles cleanly.
To see FAISS wired into a full production RAG pipeline with reranking and evaluation, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For the full FAISS documentation covering all index types, GPU support, and quantization, see the FAISS wiki.

FAISS vector stores in production RAG

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Retry patterns for LLM API errors in production

Weekly Bytes of AI

Ready to go deeper?

FAISS vector stores in production RAG

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Retry patterns for LLM API errors in production

Weekly Bytes of AI

Ready to go deeper?