Which Vector Database Should You Use for RAG?
Start with pgvector inside your existing Postgres instance. For most production RAG systems handling under 5 million vectors with moderate query load, pgvector is the correct answer, and switching to a dedicated vector database later is far cheaper than managing a second infrastructure dependency today.
I am Mahmoud Zalt, an independent senior AI systems architect with 16+ years building production software since 2010. I founded Sista AI, where choosing and living with vector databases for a production workforce of autonomous agents has been a year-long, opinionated education. Through my AI Architecture advisory practice, I have designed RAG pipelines and vector search systems for startups and scale-ups. The advice below reflects real decisions, not benchmarks run in isolation. See more about my background on the about page.
Why Most Teams Should Not Start With a Dedicated Vector DB
The dedicated vector database ecosystem (Pinecone, Weaviate, Qdrant, Milvus, Chroma) is well-marketed and technically impressive. It is also operationally heavier than it needs to be for most teams at the point they are evaluating it.
Here is what you actually take on when you add a dedicated vector store on day one:
- A second persistence layer to back up, monitor, and tune
- A second auth surface to secure and rotate credentials for
- Data sync logic between your relational records and your vector store (document IDs, metadata, soft deletes)
- An extra network hop and potential point of inconsistency
- A new operational runbook your on-call rotation has to learn
None of these are dealbreakers at scale. They are all unnecessary costs at the prototyping and early-production stage when your corpus is under a few million chunks and your query rate is measured in hundreds per minute, not tens of thousands.
The question is never 'which vector database is best.' It is 'what is the cheapest solution that meets my requirements today, with a clear upgrade path tomorrow.'
What pgvector Actually Handles Well
pgvector adds a vector column type and approximate nearest-neighbor (ANN) indexes (HNSW and IVFFlat) directly to Postgres. Since Postgres 15 and pgvector 0.5+, the HNSW index performs within a few percent of dedicated databases on standard benchmarks for corpora under 5 million vectors.
Concrete capabilities you get immediately:
- Vector similarity search with cosine, L2, or inner product distance
- HNSW index with tunable
mandef_constructionparameters - Hybrid search: combine
ts_rankBM25-style full-text with vector similarity in a single SQL query - Standard Postgres filters on metadata columns with proper indexes, no separate filter pass
- Transactional consistency: your document row and its embedding update in the same commit
- Existing Postgres tooling: logical replication, pg_dump, VACUUM, RLS, pgBouncer
The hybrid search point matters more than most teams realize. A query like 'find the 20 most relevant chunks about invoice disputes, where tenant_id = 42 and created_at > 2024-01-01' executes cleanly in one SQL statement. In a dedicated vector store, the metadata filter either runs before (pre-filter, reduces recall) or after (post-filter, wastes compute) the ANN pass. Postgres plans both together.
The Decision Framework: When to Upgrade
Use this table as a starting point. The thresholds are based on patterns I see in production, not vendor documentation.
| Signal | Stay with pgvector | Evaluate dedicated store |
|---|---|---|
| Corpus size | Under 5M vectors | 5M to 50M+ vectors |
| Query throughput | Under 500 QPS | 500+ QPS with p99 SLA |
| Embedding dimensions | Up to 1536 (OpenAI ada-002, text-3-small) | 3072+ or multiple embedding models per record |
| Existing stack | Already running Postgres | No Postgres, or using MongoDB/Cassandra |
| Filtering complexity | Simple predicates on a handful of columns | Dynamic faceted filtering across dozens of attributes |
| Multi-tenancy isolation | Row-level security is sufficient | Hard namespace isolation per tenant required |
| Geo/latency requirements | Single region, normal latency | Multi-region with sub-20ms p99 anywhere |
The honest version of this table: if you are reading this article before launching your first RAG feature, you almost certainly belong in the left column.
When a Dedicated Vector Database Earns Its Complexity
There are real scenarios where pgvector becomes the wrong tool:
Very large corpora with frequent updates
At 50 million vectors, HNSW index rebuild time during schema migration or reindexing becomes painful. Dedicated stores like Qdrant and Weaviate support online index updates without downtime. Milvus was designed from the ground up for this case.
Multiple embedding models per document
If you need to store a dense embedding (OpenAI), a sparse embedding (BM25 weights via SPLADE), and a late-interaction embedding (ColBERT) for the same document and query across all three at retrieval time, pgvector syntax gets awkward. Qdrant's named vectors handle this natively.
Managed SaaS with no Postgres expertise on the team
Pinecone Serverless genuinely abstracts away operations. If you have no one to tune HNSW parameters or interpret VACUUM behavior, and you are already paying for managed everything, the operational simplicity can justify the cost and the sync overhead.
Real-time streaming ingestion at high volume
If you are ingesting 10,000 new vectors per second from an event stream, Milvus or Qdrant handle high-write throughput better than Postgres. pgvector with Postgres write limits will be your bottleneck before the ANN index is.
Hybrid Search: Why Pure Vector Retrieval is Rarely Enough
A retrieval mistake I see repeatedly: teams build a RAG pipeline using cosine similarity only, then wonder why their chatbot confidently gives wrong answers to exact-match queries like product codes, legal clause references, or error codes.
Vector similarity captures semantic meaning. BM25 captures exact lexical matches. You need both.
In pgvector, hybrid search looks like this conceptually:
SELECT id, content,
(1 - (embedding <=> $query_embedding)) * 0.6
+ ts_rank(search_vector, plainto_tsquery($query_text)) * 0.4 AS score
FROM documents
WHERE tenant_id = $tenant_id
ORDER BY score DESC
LIMIT 20;The weights (0.6 / 0.4) are a starting point. Tune them by running your eval set against both extremes and picking the split that maximizes your chosen retrieval metric (typically NDCG@10 or Recall@10). This tuning step is not optional if you care about answer quality.
Dedicated databases handle hybrid search differently. Weaviate has a built-in alpha parameter on its hybrid operator. Qdrant uses sparse vector support via its named vectors API. Neither approach is more accurate than a well-tuned pgvector hybrid query. The difference is ergonomics, not capability, at moderate scale.
What Teams Get Wrong When Building RAG Retrieval
In AI architecture advisory engagements, I see the same mistakes repeatedly. None of them are database choices.
Skipping evals entirely
Teams ship a RAG pipeline without a single eval dataset. They have no idea if retrieval quality improved or regressed when they changed chunking strategy, embedding model, or reranker. Build a minimum eval set of 50 to 100 question-context pairs before touching production. Tools like RAGAS or LangSmith evals take a day to set up and pay back immediately.
Ignoring chunking strategy
The vector database you pick has almost no effect on retrieval quality compared to your chunking decisions. Fixed-size 512-token chunks with no sentence boundary awareness will produce bad retrieval regardless of whether you use pgvector or Pinecone. Use sentence-aware splitting, overlap the context window by 10 to 15 percent, and store the parent document ID so you can fetch wider context after retrieval.
Embedding model mismatch
Do not use text-embedding-ada-002 for a multilingual corpus. Do not use a 1536-dimension model for a 200-word corpus where a 384-dimension model matches your scale and costs 4x less per query. Model choice depends on language, domain specificity, and embedding dimension versus recall tradeoff. Benchmark at least three options on your actual data before committing.
No reranker in the retrieval pipeline
Retrieve 20 to 50 candidates from the vector index, then pass them through a cross-encoder reranker (Cohere Rerank, a local BGE reranker, or Jina Reranker) before sending the top 5 to the LLM context. Reranking consistently improves answer quality by 15 to 30 percent in my experience, at a fraction of the cost of using a larger LLM.
Storing raw embeddings without versioning
You will change your embedding model. When you do, you need to re-embed everything. Store the model name and version alongside every vector. A schema column like embedding_model varchar(64) next to the vector column takes five minutes to add and saves hours of confusion during migration.
Observability and Guardrails in Production RAG
Retrieval quality is not static. Corpus drift, embedding model changes, and query distribution shifts all degrade quality silently. Production RAG needs instrumentation the same way a production API does.
Minimum viable observability for a RAG system:
- Log retrieval scores per query. Track cosine similarity of the top-k chunks returned. A sudden drop in average top-1 score signals corpus quality issues or query distribution shift.
- Log chunk sources. Know which document chunks are cited by the LLM. If one chunk is cited in 40 percent of answers, your retrieval is not actually diverse.
- Measure retrieval latency separately from generation latency. A slow index query looks like a slow LLM response unless you instrument the pipeline stages independently.
- Run evals on a schedule. Re-run your eval set weekly on production data. Alert when NDCG@10 drops by more than 5 percent from baseline.
On guardrails: always apply a similarity threshold before passing retrieved chunks to the LLM. If your top-1 result has a cosine similarity below 0.70 (threshold depends on your embedding model and corpus), return 'I don't have information about that' rather than hallucinating from low-signal context. This single guardrail eliminates a large class of confident wrong answers.
Frequently Asked Questions
Is pgvector production-ready for RAG in 2025?
Yes. pgvector with HNSW indexing is production-ready and is used in production by companies processing millions of queries per day. The main constraint is write-heavy workloads at very high vector counts (50M+) and multi-region latency requirements. For the majority of RAG applications, pgvector with a well-sized Postgres instance handles the load with no issues.
What is the difference between Pinecone, Weaviate, Qdrant, and Chroma?
Pinecone is a fully managed SaaS with no self-hosting option. Its Serverless tier is cheap at low scale and expensive at high scale. Weaviate is open-source with a managed cloud offering and strong hybrid search support built in. Qdrant is open-source, Rust-based, has excellent named vector support for multiple embedding models per document, and has the best self-hosted performance I have tested. Chroma is primarily a development and prototyping tool and should not be your first choice for production. For most self-hosted scenarios, Qdrant is the strongest option if you have outgrown pgvector.
How many vectors can pgvector handle before I need to migrate?
The practical limit depends on your query latency requirements. In testing on a reasonable Postgres instance (16 vCPU, 64GB RAM), pgvector with HNSW handles 5 million 1536-dimension vectors at under 50ms p99 for queries. At 20 million vectors, you will start seeing p99 latency climb unless you partition aggressively. At 50 million vectors, plan the migration to a dedicated store.
Do I need a vector database if I use a framework like LangChain or LlamaIndex?
The framework is independent of the database choice. LangChain and LlamaIndex both support pgvector as a vector store backend. Using a framework does not force you to use a dedicated vector database. Start with the pgvector integration in whichever framework you use, and swap the backend later if you hit the limits described above.
Should I use sparse vectors, dense vectors, or hybrid for RAG?
For general-purpose document retrieval, start with dense vectors from a quality embedding model plus BM25 full-text search combined in a hybrid query. Add sparse vectors (SPLADE, BM25-weighted) explicitly only if you have a domain with heavy exact-term requirements (legal, medical, code search). ColBERT late-interaction is worth evaluating if your corpus is highly technical and you can afford the per-query compute cost. Most teams should start with dense plus BM25 hybrid and measure before adding complexity.
What is the cheapest way to run a production RAG system?
pgvector on a Postgres instance you already operate, plus a small open-source embedding model (BGE-M3 or nomic-embed-text) running on a single GPU node for embedding generation, plus a reranker (BGE reranker or Jina). For generation, choose your LLM by task: use Haiku-class models for retrieval scoring and summarization, Sonnet-class for final answer generation. This stack handles millions of queries per month at a fraction of the cost of fully managed alternatives.
Work With Me on Your RAG Architecture
Vector database selection is a small decision in the context of building a production AI system. The decisions that actually determine system quality are chunking strategy, embedding model selection, hybrid search weighting, reranker choice, eval methodology, observability design, and cost modeling across the full inference pipeline.
If you are designing a RAG system and want an independent review of your architecture before you commit to infrastructure and vendors, I offer focused AI Architecture advisory engagements. Engagements typically run 4 to 8 weeks and cover retrieval design, model selection, cost modeling, and production readiness. I do not resell tools or take referral fees. The advice is independent.
Reach out via the contact page with a brief description of your system and current blockers. Or go directly to the service details: Book an AI Architecture Advisory Session.







