Skip to main content

RAG Inside an AI Agent: Giving Agents the Right Context

Most teams bolt RAG onto agents as a static context prefix. That is the wrong architecture. Here is how retrieval should actually work inside a production AI agent.

Insights
13m read
#RAG#AIAgents#LLM#RetrievalAugmentedGeneration#AIArchitecture#ProductionAI
RAG Inside an AI Agent: Giving Agents the Right Context - Featured blog post image
Mahmoud Zalt

1:1 Mentor

Are you a software engineer moving into AI?

Let's have a call. I'll help you modernize your skills and learn the tools, systems, and architecture behind real AI products. One session or ongoing.

Hire AI Employees

Hire AI Employees that work 24/7. No code.

The Short Answer: Retrieval Is a Tool, Not a Prefix

Add RAG to your AI agent by exposing retrieval as a callable tool the agent invokes when it decides it needs external context, not by stuffing a static document blob into the system prompt before every call. That one architectural shift changes your chunking strategy, your eval design, and your cost profile in ways that matter at production scale.

I am Mahmoud Zalt, an independent senior AI systems architect with 16 years of production software experience since 2010. I founded Sista AI, where retrieval sits inside agents that have run in production for the past year, so every RAG tradeoff here is one I have paid for in latency or wrong answers. I now design and build production AI agent systems for product teams and enterprises. If you are integrating retrieval into an agent and want it done right, you can read more about my work on my background or go directly to my AI Agent Development service page.

Why the Static Prefix Pattern Fails in Agents

The classic RAG pattern, retrieve top-k chunks at request time and prepend them to the prompt, works well for a single-turn QA system. It breaks down inside an agent for three concrete reasons.

  • Context budget waste. A multi-step agent spends 6 to 12 tool calls completing a task. If you inject 4,000 tokens of retrieved context on every call regardless of relevance, you burn context budget on steps that do not need it (a math calculation step, a formatting step, a routing decision). At GPT-4o pricing that is real money across millions of agent runs.
  • Retrieval mismatch. The user query that kicks off the agent is often not the right retrieval query for later steps. By turn 3, the agent knows things the user did not say. A static prefix retrieved from the original query is stale.
  • No selective grounding. Some steps need retrieval (policy lookup, product spec check), others do not (date arithmetic, JSON formatting). A prefix-based system cannot make that distinction. The agent calls everything with the same bloated context, which degrades generation quality on simple steps.

The fix is to treat retrieval as a first-class tool in the agent's tool registry, callable on demand with a query the agent constructs itself at the moment it decides it needs grounding.

Wiring Retrieval as an Agent Tool

In practice this means defining a retrieve_context tool (or multiple specialized retrieval tools) in your agent's tool schema, alongside your other tools like run_query, send_email, or call_api. The agent's planner decides when to call it and what query to pass.

Minimal Tool Definition (OpenAI-style JSON Schema)

{
'name': 'retrieve_context',
'description': 'Search the internal knowledge base for facts, policies, or specifications relevant to the current step. Call this before answering any question that requires factual grounding.',
'parameters': {
'type': 'object',
'properties': {
'query': { 'type': 'string', 'description': 'The specific question or topic to search for.' },
'top_k': { 'type': 'integer', 'default': 5 },
'source_filter': { 'type': 'string', 'enum': ['docs', 'policies', 'products', 'all'], 'default': 'all' }
},
'required': ['query']
}
}

A few design notes on that schema. The description is load-bearing: the LLM decides whether to call this tool based almost entirely on it. Be explicit about when to call it. The source_filter lets you route to specialized vector indexes without the agent needing to know the underlying storage architecture. The query is agent-constructed, which is the key advantage: by step 4 the agent can ask 'what is the refund policy for enterprise contracts signed before 2024' rather than just echoing the user's original message.

MCP as the Retrieval Transport Layer

If your team is standardizing on the Model Context Protocol, you can expose your vector store as an MCP server. The agent calls it through the same tool-calling interface and you get a clean separation between agent logic and retrieval infrastructure. I covered the practical tradeoffs of MCP-based tool architectures in detail in other pieces on this blog.

Chunking Strategy Changes When Retrieval Is Agentic

Standard RAG chunking advice (512 tokens, 10% overlap, chunk by paragraph) is optimized for a single retrieval call against a user query. Agentic retrieval has different properties and needs a different chunking approach.

What Changes

PropertyClassic RAGAgentic RAG
Query sourceUser's original messageAgent-constructed query, mid-task
Query specificityOften vague, conversationalUsually precise, task-scoped
Retrieval frequencyOnce per user turn1 to N times per task, as needed
Chunk useSupport a single answerMay feed into further tool calls
Context budget pressureModerateHigh (many active tool results)

Practical Chunking Rules for Agents

  • Chunk by semantic unit, not token count. A policy paragraph is one chunk. A product spec section is one chunk. If you split mid-clause to hit 512 tokens you will retrieve half an answer and the agent will hallucinate the rest.
  • Include rich metadata at index time. Source document, section heading, date, version, and any domain tags. The agent can pass these as filters. Retrieval with a metadata filter is 3 to 5x more precise than embedding-only retrieval on structured knowledge bases.
  • Prefer smaller, self-contained chunks over larger overlapping ones. Agentic queries are specific. A 200-token chunk that fully answers a precise question beats a 1,000-token chunk that partially answers it and adds noise. Test with your actual agent-generated queries, not your users' raw messages.
  • Build a separate index per knowledge domain. Product specs, legal policies, and support history have very different retrieval semantics. Mixing them into one index forces the embedding model to represent them in the same space, which degrades recall on the less-frequent domain.

Evaluating Retrieval in an Agentic Context

This is where most teams get it wrong. They evaluate retrieval in isolation (did we retrieve the right chunks?) and skip evaluation of retrieval inside the agent loop (did the agent decide to retrieve at the right moment, construct a good query, and use the result correctly?). Both matter, but the second one matters more for production quality.

Three Eval Layers You Need

1. Retrieval quality (offline). Build a golden set of 50 to 200 (query, expected chunk IDs) pairs. Measure recall@5 and mean reciprocal rank. Run this on every index change or embedding model upgrade. Target recall@5 above 0.85 before wiring retrieval into an agent.

2. Tool-call decision quality (trace-based). Record agent traces. For each trace, annotate whether the agent called retrieve_context when it should have, skipped it when it should have, and passed a reasonable query. A simple rubric: correct call / correct skip / wrong call / missed call. You want wrong call plus missed call below 10% of steps on your task distribution.

3. Answer faithfulness (LLM-as-judge). For each final answer that used retrieved context, check whether every factual claim in the answer is grounded in the retrieved chunks. I use a 3-point scale: fully grounded / partially grounded / hallucinated. Flag any task where a hallucinated answer reached the user. Target fully grounded above 90% on your priority task types.

A Quick Worked Example

A team I worked with had an enterprise support agent. Retrieval recall@5 was 0.91, well above threshold. But tool-call decision quality was 0.67: the agent was skipping retrieval on pricing questions because the tool description said 'search for facts and policies' and the agent had learned from training data that it knew pricing. Fix: update the description to 'always call this for any pricing, contract, or entitlement question.' Decision quality went to 0.89 in the next eval run without touching the index or embedding model.

Guardrails and Observability for Production Retrieval

Running retrieval inside an agent loop without observability is flying blind. These are the specific things I instrument on every production deployment.

What to Log on Every Retrieval Tool Call

  • The agent-constructed query (not the user message)
  • The top-k chunk IDs and their similarity scores
  • Whether the score crossed your confidence threshold (I use 0.72 cosine similarity as a soft floor for most domains)
  • Time to retrieve in milliseconds
  • The downstream agent step that consumed the result

With these five fields you can answer every production question: why did the agent say that, was the retrieval accurate, is latency degrading, and which documents are actually being used versus indexed but never retrieved?

Retrieval Guardrails

Low-confidence fallback. If all top-k chunks score below your threshold, return an explicit 'no confident match found' signal to the agent rather than returning low-quality chunks. The agent should be prompted to acknowledge uncertainty or escalate rather than hallucinate against weak context.

Source diversity check. If all 5 retrieved chunks come from the same document, surface a warning. It usually means the query is too narrow or the index has a coverage gap.

Human-in-the-loop trigger. For high-stakes agent actions (sending a contract, issuing a refund, modifying account state), require that the retrieval step returned at least one high-confidence chunk before the agent is allowed to proceed. If retrieval confidence is low, route to a human review queue. This one guardrail has prevented the most expensive production incidents I have seen.

Cost and Latency: What Actually Moves the Numbers

Retrieval adds latency and cost. Here is how to think about each honestly.

Latency

A single vector search on a well-run managed index (Pinecone, Weaviate, pgvector on RDS) returns in 20 to 80ms at the p99 for indexes up to 10 million vectors. That is cheap. The expensive part is what you do with the retrieved chunks: if you pass all 5 chunks (averaging 300 tokens each) back into the LLM context, you are adding 1,500 tokens of input on that tool step. At multiple retrieval calls per task, that adds up both in latency and in cost. The fix is to rerank before injecting: run a fast cross-encoder reranker (Cohere Rerank, a local Sentence Transformers model) and pass only the top 2 or 3 chunks. For most tasks, top-2 after reranking outperforms top-5 without reranking on both quality and cost.

Cost

Agentic RAG costs come from three places: embedding new documents (one-time, cheap), vector queries at runtime (very cheap per query, pennies per thousand), and the LLM tokens consumed by the retrieved context (the real cost driver). On a GPT-4o task that averages 8 agent steps with 3 retrieval calls each injecting 600 tokens, you add roughly 1,440 input tokens per task. At $2.50 per million input tokens that is $0.0036 per task. Across 100,000 tasks per month that is $360 per month from retrieval context alone. Know that number for your system before you scale.

What Teams Get Wrong: The Five Most Common Agentic RAG Mistakes

  • Using the user message as the retrieval query. The user said 'help me with my account.' The agent knows by step 3 that the user has an enterprise contract expiring next month and wants to discuss renewal pricing. Retrieve against that specific context, not the original vague message.
  • One giant mixed index. Mixing product docs, support tickets, legal contracts, and internal runbooks into one embedding index is a recall disaster. Build domain-specific indexes and let the agent pick the right one via the source_filter parameter or by calling domain-specific retrieval tools.
  • Skipping the reranker. Embedding similarity is a first-pass filter, not a precision ranking. A cross-encoder reranker that sees the full query and the full chunk together will outperform top-k cosine similarity by 15 to 30 percentage points on faithfulness evals. It is a 50ms addition that pays for itself.
  • No citation in the agent output. When an agent makes a factual claim, it should reference which chunk it retrieved. This is not cosmetic: it is the only way users and operators can audit and trust the output. Build citation into your agent output schema from day one.
  • Evaluating retrieval offline only. Recall@5 on your golden set is necessary but not sufficient. The agent's ability to construct a good retrieval query and use the result correctly is a separate skill that only shows up in end-to-end traces. Eval both layers separately.

Frequently Asked Questions

What is the difference between RAG and agentic RAG?

Standard RAG retrieves once per user turn and prepends the chunks to the prompt. Agentic RAG treats retrieval as a tool the agent calls on demand, potentially multiple times per task, with queries the agent constructs based on its evolving understanding of the task. The agent decides when to retrieve, what to search for, and how to use the result, rather than having retrieval happen automatically on every call.

Which vector database should I use for an AI agent knowledge base?

For most teams starting out: pgvector on your existing Postgres instance if you are already running Postgres, or Pinecone Serverless if you want zero infrastructure management. I only reach for dedicated vector databases like Weaviate or Qdrant when I need multi-tenancy, hybrid search (BM25 plus dense), or the ability to store and query structured metadata alongside vectors at scale. Do not over-engineer the vector store before you have validated your chunking and retrieval quality.

How do I prevent my agent from hallucinating when retrieval returns nothing useful?

Return an explicit low-confidence signal to the agent when all chunks score below your similarity threshold. In the system prompt, instruct the agent: 'If the retrieve_context tool returns no confident match, say so explicitly and do not proceed with actions that require factual grounding.' Pair this with a human-in-the-loop guardrail on high-stakes steps so that low-confidence retrieval triggers escalation rather than a confident but wrong answer.

What chunk size should I use for an agent knowledge base?

Chunk by semantic unit, not by fixed token count. A policy paragraph, a product spec subsection, a procedure step: these are natural chunk boundaries. If you must use a token budget, 200 to 400 tokens per chunk works better for agentic retrieval than the standard 512-token advice, because agent-constructed queries are precise and a smaller, fully-answering chunk beats a larger, noisier one. Always test with real agent-generated queries, not your users' raw messages.

How do I evaluate whether my agent is calling retrieval at the right times?

Record full agent traces and annotate each retrieval tool call as: correct call, correct skip, wrong call (retrieved when not needed), or missed call (should have retrieved but did not). A small human-annotated set of 100 to 200 traces gives you a reliable decision-quality metric. Target wrong call plus missed call below 10% before going to production. If missed calls are high, improve the tool description. If wrong calls are high, add negative examples to the description or tighten the agent's system prompt.

Can I use RAG with open-source models in an AI agent?

Yes. The retrieval-as-tool pattern works with any model that supports function calling or tool use: Mistral, Llama 3.1, Qwen2.5, Command R+. The main difference is that smaller open-source models are less reliable at deciding when to call retrieval tools and at constructing precise retrieval queries. You may need to be more prescriptive in your system prompt and do more eval work on tool-call decision quality. For production agents handling complex tasks, I typically use a frontier model for the planner step and can route simpler steps to a smaller model.

Build Retrieval-Augmented Agents That Actually Work in Production

Retrieval inside an agent is not a feature you bolt on. It is an architectural decision that touches your chunking strategy, your tool schema, your eval pipeline, your observability layer, and your cost model. Get the architecture right from the start and you end up with an agent that grounds itself precisely when it needs to, stays fast and cheap on steps that do not need retrieval, and gives you the traceability to audit every factual claim it makes.

I design and build these systems for product teams and enterprises. If you are building an AI agent and want retrieval done right the first time, see how I work on the about page or review past work on projects. When you are ready to talk, reach out via the contact page.

Work with me on AI Agent Development

Thanks for reading! I hope this was useful. If you have questions or thoughts, feel free to reach out.

Content Creation Process: This article was generated via a semi-automated workflow using AI tools. I prepared the strategic framework, including specific prompts and data sources. From there, the automation system conducted the research, analysis, and writing. The content passed through automated verification steps before being finalized and published without manual intervention.

Mahmoud Zalt

About the Author

I’m Zalt, a technologist with 16+ years of experience, passionate about designing and building AI systems that move us closer to a world where machines handle everything and humans reclaim wonder.

Let's connect if you're working on interesting AI projects, looking for technical advice or want to discuss anything.

Support this content

Share this article

Get notified of the next one

I'll email you when I publish something new. No spam, leave anytime.

CONSULTING

AI advisory. From strategy to production.

Architecture, implementation, team guidance.