Skip to main content
المدونة

Zalt Blog

Deep Dives into Code & Architecture

AT SCALE

Embeddings and Vector Databases Explained for Non-ML Builders

By محمود الزلط
Insights
12m read
<

Most teams reach for Pinecone or Weaviate before they need them. pgvector inside your existing Postgres handles millions of rows fine, and embeddings are simpler than the ML crowd makes them sound.

/>
Embeddings and Vector Databases Explained for Non-ML Builders - Featured blog post image
Mahmoud Zalt

1:1 Mentor

Are you a software engineer moving into AI?

Let's have a call. I'll help you modernize your skills and learn the tools, systems, and architecture behind real AI products. One session or ongoing.

Hire AI Employees

Hire AI Employees that work 24/7. No code.

Do You Need a Vector Database? Probably Not Yet

Embeddings are floating-point coordinates that encode meaning, and for most production teams under a few million rows, pgvector inside Postgres is all you need. A dedicated vector database is a scaling decision, not an architectural prerequisite.

I am Mahmoud Zalt, an independent senior AI systems architect with 16+ years building production software since 2010. For the past year I have run a production workforce of autonomous agents at Sista AI, the company I founded, where embeddings and vector stores are load-bearing infrastructure, not theory. I now help engineering teams design and ship AI systems that actually work in production. If you are evaluating where vector search fits in your architecture, I cover this directly in my AI agent development and systems work.

What Embeddings Actually Are

An embedding is a list of numbers, typically 768 to 3072 floats, that positions a piece of text (or image, or audio) inside a high-dimensional space such that semantically similar things land close together. That is the whole idea. The number 0.82 in position 417 of the vector means nothing on its own. The relative distance between two vectors is everything.

Concretely: the sentence 'how do I cancel my subscription' and 'I want to stop being billed' will produce vectors that are very close in cosine distance, even though they share no words. A keyword search using LIKE or full-text search would miss that match entirely. That is the gap embeddings close.

How they are generated

You pass text to an embedding model (OpenAI text-embedding-3-small, Cohere embed-v3, or a local model like nomic-embed-text) and get back a fixed-length array. You store that array. Later, you embed a query the same way and find the stored vectors with the smallest cosine or dot-product distance. That retrieval step is called approximate nearest neighbor (ANN) search.

A minimal worked example

import openai, psycopg2

client = openai.OpenAI()

# embed a document at index time
response = client.embeddings.create(
    model='text-embedding-3-small',
    input='Cancel my subscription'
)
vector = response.data[0].embedding  # 1536 floats

# store in postgres with pgvector
cur.execute(
    'INSERT INTO docs (content, embedding) VALUES (%s, %s)',
    ('Cancel my subscription', vector)
)

# query at runtime
query_vec = embed('stop being billed')  # same model
cur.execute(
    'SELECT content FROM docs ORDER BY embedding <=> %s LIMIT 5',
    (query_vec,)
)

The <=> operator is pgvector cosine distance. That is the entire retrieval pipeline, running inside ordinary Postgres, no extra infrastructure.

pgvector vs. a Dedicated Vector Database: The Honest Comparison

The vector database market (Pinecone, Weaviate, Qdrant, Milvus, Chroma) exploded because embedding search is genuinely useful and VCs funded a lot of tooling. That created a pressure to adopt dedicated infrastructure before it is warranted. Here is the real picture.

Dimensionpgvector (Postgres)Dedicated vector DB
Setup costOne extension installNew service, new ops burden
Query performanceGood to ~5M rows with HNSW indexExcellent at 50M+ rows
Filtering on metadataNative SQL joins, full SQL plannerVaries, payload filtering often limited
TransactionsFull ACIDUsually none
Operational complexityYou already run PostgresAdditional deployment, backups, auth
Cost at small scaleNear zero (existing DB)$70-$700+/month for managed services
When it breaks downHundreds of millions of rows, sub-10ms P99 SLA at high QPSRarely, at scale

My default recommendation: start with pgvector. If you hit more than ~5 million embeddings and your P99 query latency degrades past your SLA, then evaluate Qdrant (self-hosted, excellent performance, Apache 2.0) or Pinecone (managed, easy, pricey). Do not pre-optimize for a scale you have not reached.

Choosing an Embedding Model

The model choice matters more than most people realize because you cannot change it later without re-embedding your entire corpus. The embedding model defines your coordinate space. Documents embedded with model A are incompatible with queries embedded with model B.

Practical decision tree

  • Default choice: OpenAI text-embedding-3-small. 1536 dimensions, excellent quality, $0.02 per million tokens. Hard to beat for most English-language tasks.
  • Higher accuracy: text-embedding-3-large (3072 dims). Roughly 2x cost, measurably better on retrieval benchmarks (MTEB). Use when retrieval precision is critical, for example in a medical or legal context.
  • Multilingual or privacy-sensitive: Cohere embed-v3 (multilingual variant) or a locally hosted model like nomic-embed-text via Ollama. Local models eliminate the API call and keep data on-premises.
  • High throughput, cost-sensitive: Cohere embed-v3 supports batching up to 96 inputs per call. For indexing pipelines processing millions of documents, this matters.

Dimension reduction

OpenAI text-embedding-3 models support Matryoshka representation learning, meaning you can truncate the vector to fewer dimensions (say 256 or 512) and trade a small accuracy loss for significantly faster ANN search and smaller storage. At 256 dimensions you cut storage by 6x vs the full 1536. For most applications the accuracy loss is under 5% on MTEB benchmarks and completely worth it.

Retrieval-Augmented Generation: Where Embeddings Do Real Work

The primary production use case for embeddings is RAG (retrieval-augmented generation): grounding an LLM answer in specific documents rather than its training weights. The pipeline is: embed a user query, retrieve the top-k relevant chunks from your store, stuff those chunks into the LLM context window, generate an answer.

What teams get wrong in RAG

The retrieval step is where most RAG systems fail. Common mistakes:

  • Chunking too large or too small. A 3000-token chunk buries the relevant sentence in noise. A 50-token chunk loses context. 300-500 tokens with 50-100 token overlap is a solid starting point. The right size depends on your documents; measure it.
  • Skipping hybrid search. Pure vector search misses exact matches. A product SKU, a person's name, a specific error code: keyword search finds these better. Hybrid search (BM25 + vector, fused with reciprocal rank fusion) consistently outperforms either alone. pgvector combined with Postgres full-text search handles this in a single query.
  • No reranking. Top-k ANN retrieval returns the geometrically closest vectors, not necessarily the most contextually relevant ones. A cross-encoder reranker (Cohere rerank, or a local cross-encoder) re-scores the top 20-50 candidates and dramatically improves final answer quality. This single step can raise answer accuracy by 15-25% in my experience.
  • Evaluating with vibes. You need evals. At minimum: context precision (did you retrieve the right chunks?), context recall (did you miss relevant chunks?), and answer faithfulness (did the LLM hallucinate beyond the context?). RAGAS is a solid open-source framework for these metrics. Run evals on a golden set of 50-100 query/answer pairs before shipping.

Production Considerations: Guardrails, Observability, Cost

Getting embeddings to work in a notebook takes an afternoon. Getting them to work reliably in production is a different problem. Here is what I track on every deployment.

Observability

Log the full RAG trace: query text, retrieved chunk IDs with their similarity scores, token count sent to the LLM, and the final answer. Without this, debugging a bad answer is guesswork. Tools like LangSmith, Langfuse (open-source, self-hostable), or a simple structured JSON log to your existing stack are all valid. The key is that every inference call is traceable end-to-end.

Guardrails

Input guardrails: check query length (very long queries often indicate prompt injection attempts), optionally classify intent before retrieval. Output guardrails: hallucination detection via a cheap second LLM call that checks whether the answer is grounded in the retrieved context. A 'I don't know' response when confidence is low is better than a confident hallucination.

Cost management

Embedding is cheap. Running retrieved context through GPT-4o is not. Cost almost always lives in the generation step, not retrieval. Typical breakdown: embedding a 500-token query costs $0.00001 with text-embedding-3-small. Generating a 600-token answer with gpt-4o-mini costs $0.00036. At 100k queries/day that is $36/day for generation vs. $1/day for retrieval. Optimize the LLM call first: use a smaller model for straightforward queries, cache frequent queries (semantic caching with a similarity threshold of ~0.95 works well), and trim retrieved context aggressively.

Security

Embeddings themselves do not contain the original text, but your vector store almost certainly stores the source chunks alongside them. Treat the chunk store with the same access controls as your primary data store. Namespace embeddings by tenant if you have a multi-tenant product: never let one tenant's query retrieve another's chunks.

When You Actually Do Need a Dedicated Vector Database

I push back on premature vector-DB adoption, but there are real signals that mean it is time to move off pgvector.

  • Corpus size over 5-10 million rows and P99 ANN latency is degrading past your SLA even after HNSW index tuning. At this scale Qdrant or Weaviate will serve queries in single-digit milliseconds where pgvector starts struggling.
  • Very high QPS with strict latency requirements. If you need to handle thousands of vector queries per second at P99 under 10ms, a purpose-built system with purpose-built memory layout will outperform a general-purpose relational engine.
  • Multi-modal search at scale. Searching across text, images, and structured data simultaneously. Some vector databases have native multi-modal support that is awkward to replicate in Postgres.
  • You need real-time index updates at high write throughput. pgvector's HNSW index is built offline; heavy concurrent writes while querying can degrade. Qdrant's on-disk HNSW handles this better.

If none of these apply to you today, pgvector is the correct choice. Add complexity only when you have measured evidence that you need it.

Embeddings in Agent Systems: Tool-Calling and MCP

Beyond RAG, embeddings appear in agent systems as part of memory and tool routing. If you are building an AI agent that calls multiple tools or APIs, you often need to route the user intent to the right tool. Embedding the user query and finding the nearest tool description is a fast, reliable way to do this, especially when you have more than 10-15 tools where stuffing all tool descriptions into context becomes expensive and noisy.

In an MCP (Model Context Protocol) architecture, the same pattern applies: embed all available resource descriptions at startup, then at query time retrieve the top-k most relevant resources before injecting them into the model context. This keeps the context window lean and focused.

A concrete pattern I use:

  • At startup: embed all tool/resource descriptions, store in pgvector or in-memory (Numpy for small sets).
  • At query time: embed the user message, retrieve top 3-5 tools by cosine similarity, inject only those into the system prompt.
  • Threshold: if the best match cosine similarity is below 0.5, return 'I cannot handle this request' rather than hallucinating a tool call.

This human-in-the-loop signal at the retrieval threshold is cheap and prevents entire categories of agent failures.

Frequently Asked Questions

what are embeddings in simple terms

An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Text with similar meaning produces vectors that are close together in space, which lets you find semantically related content without exact keyword matching.

do I need a vector database for my AI app

Probably not yet. If your dataset is under a few million rows, pgvector running inside Postgres gives you vector search with full SQL, transactions, and no extra infrastructure. Dedicated vector databases like Pinecone or Qdrant are a scaling tool for high QPS or very large corpora, not a starting point.

pgvector vs pinecone which is better

pgvector is better for teams that already run Postgres, have under 5 million vectors, and want to avoid operational overhead. Pinecone is better when you need managed infrastructure, are at 50M+ vectors, or need sub-10ms P99 at thousands of QPS. Start with pgvector and migrate only when you have measured evidence you need to.

how do I choose an embedding model

For most English-language use cases, start with OpenAI text-embedding-3-small. It is cheap, high quality, and widely supported. If you need multilingual support or data privacy, use a local model like nomic-embed-text via Ollama. Avoid mixing models in the same index as switching models requires re-embedding your entire corpus.

what is RAG and how does it use embeddings

RAG (retrieval-augmented generation) is a pattern where you embed a user query, retrieve the most relevant document chunks from a vector store, and pass those chunks as context to an LLM before generating an answer. Embeddings power the retrieval step, letting the system find relevant content by meaning rather than keywords.

can embeddings be used for anything other than search

Yes. Beyond retrieval, embeddings are used for clustering similar documents, deduplication, anomaly detection, semantic caching (cache LLM responses when a new query is nearly identical to a cached one), and routing in agent systems where the user intent needs to be matched to the right tool or workflow.

Work With Someone Who Has Done This in Production

Embeddings and vector search are genuinely useful primitives, but they are also an area where the tooling ecosystem moves fast and the default advice leans toward over-engineering. The teams I work with consistently ship faster when they start simple (pgvector, a good embedding model, hybrid search, real evals) and add complexity only when the metrics demand it.

If you are building an AI system that involves retrieval, agents, or tool-calling and you want a clear-eyed assessment of what your architecture actually needs, I cover this in depth as part of my AI agent development work. You can also read more about my background on the about page or browse past projects. When you are ready to talk through your specific situation, reach out directly.

Get a production-focused AI architecture review

Thanks for reading! I hope this was useful. If you have questions or thoughts, feel free to reach out.

Content Creation Process: This article was generated via a semi-automated workflow using AI tools. I prepared the strategic framework, including specific prompts and data sources. From there, the automation system conducted the research, analysis, and writing. The content passed through automated verification steps before being finalized and published without manual intervention.

Mahmoud Zalt

About the Author

I’m Zalt, a technologist with 16+ years of experience, passionate about designing and building AI systems that move us closer to a world where machines handle everything and humans reclaim wonder.

Let's connect if you're working on interesting AI projects, looking for technical advice or want to discuss anything.

Support this content

Share this article