Skip to main content
المدونة

Zalt Blog

Deep Dives into Code & Architecture

AT SCALE

RAG Explained Simply: How AI Answers From Your Own Documents

By محمود الزلط
Insights
12m read
<

Most RAG demos work fine on 10 documents and fall apart on 10,000. The failure is almost never the LLM. It is bad chunking, missing reranking, and no evals. Here is the full production picture.

/>
RAG Explained Simply: How AI Answers From Your Own Documents - Featured blog post image
Mahmoud Zalt

1:1 Mentor

Are you a software engineer moving into AI?

Let's have a call. I'll help you modernize your skills and learn the tools, systems, and architecture behind real AI products. One session or ongoing.

Hire AI Employees

Hire AI Employees that work 24/7. No code.

What Is RAG and How Does It Work?

Retrieval-augmented generation (RAG) gives a language model an open-book exam instead of a closed one: before the model writes its answer, a retrieval step pulls the most relevant passages from your documents and injects them into the prompt as context. The model then generates a grounded response based on those passages rather than relying solely on what it memorised during training.

That one sentence is the citable definition. Everything else in this article is about why naive implementations break in production and what to do about it.

I am Mahmoud Zalt, an independent senior AI systems architect with 16-plus years building production software since 2010. At Sista AI, the company I founded, my agents retrieve and ground their answers against live data every day, so what follows comes from a year of running retrieval in production rather than from a diagram. I now help product teams design and ship AI systems that actually hold up at scale. The pattern analysis here comes from production systems, not tutorials. You can read more about my background on the about page.

The Four Stages of a RAG Pipeline

Every RAG system, regardless of framework or cloud vendor, runs through four conceptual stages: chunk, embed, retrieve, ground. Understanding each stage independently is how you debug failures later.

1. Chunk

You split source documents into passages small enough to fit inside a prompt alongside the question and the generated answer. Common strategies are fixed-size token windows (512 to 1024 tokens with 10-20 percent overlap), recursive character splitting, and semantic splitting that tries to respect paragraph or section boundaries. The overlap prevents a fact that straddles a boundary from disappearing from both chunks.

2. Embed

Each chunk is converted to a dense vector by an embedding model. The embedding model maps meaning into a high-dimensional space so that semantically similar text ends up near each other geometrically. Popular choices include OpenAI text-embedding-3-small (1536 dimensions, cheap), Cohere Embed v3 (with input-type flags for query vs. document), and open-weight models like bge-m3 for on-premise setups. The vectors are stored in a vector database: Pinecone, pgvector on Postgres, Weaviate, or Qdrant are all reasonable choices depending on your existing stack.

3. Retrieve

At query time, the user question is embedded with the same model, and you run an approximate nearest-neighbour search to find the top-k most similar chunks. k is typically 5 to 20. Most teams also add a sparse-retrieval layer (BM25 or Elasticsearch keyword search) and combine the scores, a technique called hybrid retrieval. Sparse retrieval catches exact keyword matches that dense vectors sometimes miss.

4. Ground

The retrieved chunks are assembled into the prompt as a context block. A typical prompt template looks like: Answer the question using only the context below. If the answer is not in the context, say so. followed by the context passages, then the question. The model generates its answer constrained to that material. The instruction to say 'I do not know' when the answer is absent is not optional; without it, the model will hallucinate from training memory instead of declining.

Why RAG Demos Fall Apart in Production

I have seen this pattern repeatedly: a team builds a RAG proof of concept over a weekend, it works impressively on a curated 20-document test set, and then they ingest 50,000 internal documents and retrieval quality collapses. The culprits are almost always the same.

Bad Chunking Destroys Recall

The single most common mistake is chunking by a fixed character count without respecting document structure. A 512-token hard cut through the middle of a table, a code block, or a numbered list produces chunks that are individually meaningless. The embedding of a half-table is not semantically useful. The fix is recursive or semantic chunking that respects natural boundaries: paragraphs, headings, code fences, and table rows. For structured documents (PDFs, HTML, Markdown), parse the structure first and chunk within sections, not across them.

Nearest Neighbours Are Not the Same As Relevant

Cosine similarity between vectors measures semantic proximity, not factual relevance to the specific question. A chunk that is generally 'about' the same topic as the question will score well even if it does not answer it. The standard fix is a reranker: a cross-encoder model (Cohere Rerank, bge-reranker-large, or a fine-tuned model) that scores each (question, chunk) pair jointly and re-orders the top-k results. Adding a reranker on top of a mediocre retriever typically improves answer quality more than switching embedding models.

Context Window Stuffing

Passing all top-k chunks into one prompt regardless of their individual relevance dilutes the signal. If chunks 6 through 20 are weakly relevant, they add noise and cost. A reranker plus a relevance score threshold (discard anything below 0.6, for example) keeps the context clean. For long documents, map-reduce patterns work: retrieve per section, summarise each independently, then synthesise.

No Grounding Instruction

If you do not explicitly instruct the model to answer only from the provided context, it will blend retrieval with parametric memory. The answer may be factually correct but not sourced from your documents, which makes it unauditable and introduces confidently-stated hallucinations when the documents contradict the model's training data.

Worked Example: Internal Policy Q and A

Here is a concrete end-to-end sketch for a company that wants employees to query 200 internal HR policy PDFs.

# Indexing (run once per document update)
chunks = semantic_splitter.split(pdf_text, max_tokens=800, overlap=80)
for chunk in chunks:
    vector = embed_model.embed(chunk.text, input_type='document')
    vector_db.upsert(id=chunk.id, vector=vector, metadata={
        'source': chunk.source_file,
        'section': chunk.section_heading,
        'page': chunk.page_number
    })

# Query (runs per user question)
query_vector = embed_model.embed(user_question, input_type='query')
candidates = vector_db.query(query_vector, top_k=20)
reranked = reranker.rerank(user_question, candidates, top_n=5)

prompt = f'''
Answer the question using only the HR policy excerpts below.
If the policy does not address the question, say: 'This is not covered in current policy.'
Cite the source document and section for each claim.

Policy excerpts:
{format_chunks(reranked)}

Question: {user_question}
'''
answer = llm.generate(prompt)

Notice the metadata stored alongside the vector: source file, section heading, page number. This lets you render citations in the UI so employees can verify the answer themselves. That citation layer is not a nice-to-have; it is what makes the system auditable and trustworthy enough for HR use.

The input_type distinction matters for models like Cohere Embed: documents and queries are embedded differently to improve retrieval precision. Skipping this flag and embedding both with the same mode degrades recall by 5-15 percent in typical benchmarks.

Evals and Observability: How You Know It Is Working

A RAG system without evals is a system you cannot improve. The three metrics that matter most in production are retrieval recall, answer faithfulness, and answer relevance.

MetricWhat it measuresHow to measureTarget
Retrieval recallWere the ground-truth passages retrieved in the top-k?Labelled QA set, check if correct chunk appears in results> 0.80
Answer faithfulnessIs every claim in the answer supported by the retrieved context?LLM-as-judge: present answer plus context, ask for unsupported claims> 0.90
Answer relevanceDoes the answer actually address the question asked?LLM-as-judge or human panel on a sample> 0.85

For tracing, every query should log: the raw question, the top-k chunk IDs and their scores, the reranked order, the final prompt, the model response, and latency at each stage. Tools like LangSmith, Langfuse, and Arize Phoenix all support this trace structure. Without per-stage latency, you cannot tell whether slow responses come from retrieval, reranking, or generation.

Run evals on a golden dataset of 50 to 200 human-labelled question-answer pairs before every deployment. Regression on retrieval recall is the earliest signal that a chunking or embedding change was harmful.

Guardrails, Security, and Access Control

RAG introduces a class of security concern that pure LLM chatbots do not have: your retrieval layer can surface documents the querying user should not see.

The practical solution is metadata filtering at query time. Every chunk stored in the vector database should carry the access tier, team, or user group it belongs to. When a user queries, the vector search must include a hard filter on that metadata field before similarity scoring. Most vector databases support this as a pre-filter or post-filter option. Pre-filter (applied before ANN search) is safer but may reduce recall; post-filter is faster but risks surface-level leakage if not implemented carefully. For sensitive deployments, pre-filter is the right default.

Prompt injection is a real risk in RAG. A malicious document in your corpus can contain text like: Ignore previous instructions and... which may influence the model's behaviour when that chunk lands in context. Mitigations include: output schema validation (the model must respond in a structured format, reducing free-form injection surface), input sanitisation on ingested documents, and using a system prompt that explicitly scopes the model's authority. Never give a RAG-backed system tool-calling permissions beyond read-only retrieval without explicit human-in-the-loop gating for state-changing actions.

When You Need Less Than You Think

Not every document Q-and-A problem requires a vector database and a reranker. Before committing to a full RAG pipeline, consider these lighter options.

  • Context stuffing: If your entire knowledge base fits in 100,000 tokens, stuff it all into the context window. Claude 3.5 Sonnet and Gemini 1.5 Pro have context windows large enough to hold a small company's entire policy handbook. No indexing, no retrieval, no reranker. Just works. Retrieval accuracy is perfect because there is nothing to miss. Cost per query is higher but the engineering complexity is near zero.
  • Keyword search first: For structured content with consistent terminology (legal contracts, API documentation, code), BM25 alone often outperforms dense retrieval. Try Elasticsearch or Typesense before standing up a vector database.
  • Fine-tuning instead: If the knowledge is stable, bounded, and mostly procedural (how to use a specific internal tool), fine-tuning a small model is often cheaper per query and more reliable than RAG over the same content. RAG shines when the knowledge changes frequently or when users need citations.

The honest framing is this: RAG is the right architecture when your documents change faster than you can fine-tune, when you need citations for auditability, or when your corpus is too large for context stuffing. Those three conditions cover most enterprise document Q-and-A use cases, but not all of them.

Advanced Patterns: HyDE, Multi-Hop, and MCP Tool Calls

Once basic RAG is working, these three patterns are worth knowing.

HyDE (Hypothetical Document Embeddings)

Instead of embedding the raw user question, you ask the LLM to generate a hypothetical answer first, then embed that hypothetical answer. The hypothesis is often closer in embedding space to the actual document chunk than the raw question is. This is particularly useful for question-style queries against document-style corpora. The cost is one extra LLM call per query.

Multi-Hop Retrieval

Some questions require chaining two retrievals: What is the refund policy for products sold through our reseller programme? may require first retrieving the reseller programme definition, then retrieving the refund policy filtered by programme type. A simple single-shot RAG will fail this. The solution is an agent loop: retrieve, read, identify missing information, retrieve again, then synthesise. LangGraph and similar orchestration frameworks handle this loop with explicit state tracking.

RAG as an MCP Tool

In modern AI agent architectures, the retrieval pipeline is exposed as a tool via the Model Context Protocol (MCP) or similar. The agent decides when to call the retrieval tool rather than having retrieval always fire. This is the right architecture when the agent has multiple knowledge sources (an internal wiki, a CRM, a code repository) and needs to route queries to the appropriate one. The tool description matters enormously: a well-written tool description that tells the model exactly what kind of questions this retrieval source can answer improves routing accuracy substantially over generic names like 'search'.

Frequently Asked Questions

what is the difference between RAG and fine-tuning

Fine-tuning bakes knowledge into the model weights permanently. RAG retrieves knowledge at inference time from an external index. Use fine-tuning for stable procedural knowledge and stylistic consistency. Use RAG when documents change frequently, when you need citations, or when the corpus is too large to encode in weights. The two are not mutually exclusive: fine-tuning a model's instruction-following behaviour while using RAG for factual grounding is a common production pattern.

what chunk size is best for RAG

There is no universal answer. Start with 512 to 800 tokens with 10 to 15 percent overlap and measure retrieval recall on your golden dataset. Shorter chunks (256 tokens) improve precision for narrow factual queries. Longer chunks (1024 tokens or more) preserve more context for complex reasoning questions. The right size depends on your document type and query type, not on a benchmark from someone else's corpus.

how do I stop a RAG system from hallucinating

Four things together: include an explicit grounding instruction telling the model to answer only from the provided context; use a reranker so only high-confidence chunks reach the prompt; set a faithfulness eval that flags answers with unsupported claims; and monitor retrieval recall so you know when the retriever is failing to find the right chunks. A faithfulness score below 0.85 almost always means either the right chunk was not retrieved or the grounding instruction is too weak.

do I need a vector database or can I use Postgres

pgvector on Postgres handles millions of vectors comfortably and supports hybrid search with full-text indexes alongside vector indexes. For most teams under 10 million chunks, pgvector is the correct starting point because it eliminates a separate operational dependency. Move to a dedicated vector database (Pinecone, Qdrant, Weaviate) when you need sub-10ms p99 latency at high query volume, advanced filtering, or multi-tenancy isolation that pgvector cannot provide cleanly.

how do I evaluate my RAG pipeline before shipping

Build a golden dataset of 50 to 200 question-answer-source triples, labelled by humans who know the corpus. Run three automated metrics: retrieval recall (is the correct chunk in the top-k), answer faithfulness (is every claim in the answer traceable to the retrieved context), and answer relevance (does the answer address the question). Use an LLM-as-judge prompt for faithfulness and relevance, but always validate the judge's scoring on 20 to 30 samples manually to confirm it matches human judgment on your domain.

Need Help Building a Production RAG System?

Most RAG projects stall not because the technology is hard, but because the gap between a demo and a reliable production system requires production judgment: proper chunking strategy for your document types, evals before you ship, access control from day one, and a reranker that actually improves recall on your specific corpus. That work is the part tutorials skip.

I work with product teams as an independent architect to design and build these systems end to end. If you are moving from prototype to production, or if a RAG pipeline you already have is underperforming, get in touch and we will start with a direct technical review.

Work with me on your AI system architecture

Thanks for reading! I hope this was useful. If you have questions or thoughts, feel free to reach out.

Content Creation Process: This article was generated via a semi-automated workflow using AI tools. I prepared the strategic framework, including specific prompts and data sources. From there, the automation system conducted the research, analysis, and writing. The content passed through automated verification steps before being finalized and published without manual intervention.

Mahmoud Zalt

About the Author

I’m Zalt, a technologist with 16+ years of experience, passionate about designing and building AI systems that move us closer to a world where machines handle everything and humans reclaim wonder.

Let's connect if you're working on interesting AI projects, looking for technical advice or want to discuss anything.

Support this content

Share this article