What Are the Ongoing Running Costs of an AI Agent?
The ongoing running cost of an AI agent is dominated by per-request token spend, not infrastructure. A production agent handling 10,000 requests per day can easily cost $500 to $3,000 per month in LLM API fees alone, depending on model choice, prompt design, and how many tool calls each request triggers. Compute, storage, and retrieval are real but secondary.
I am Mahmoud Zalt, an independent senior AI systems architect with 16+ years building production software since 2010. I founded Sista AI, and the line between build cost and run cost is something I live with daily, paying the monthly bill to keep a workforce of autonomous agents running in production. I now design and build AI agent systems for product teams as a solo independent consultant. You can read more about me here or go straight to my AI Agent Development service page if you are already scoping a project.
The Build vs. Run Trap Nobody Warns You About
Engineering teams budget the build. They estimate developer time, infrastructure setup, and integration work. What they do not budget is the operational token spend that starts the moment the agent goes live and scales with every user interaction.
This is not a small rounding error. I have seen projects where the annual run cost exceeded the build cost within six months of launch. The problem is structural: most teams treat the LLM as a fixed dependency, like a database, when it is actually a variable-cost compute layer billed by the token, by the call, and by the second.
The pattern is always the same. The prototype works beautifully on a small model with short prompts. Then the team adds context, adds tools, adds memory, adds retries, and suddenly the production agent is sending 4,000-token prompts, calling three tools per request, and routing through a frontier model that costs 15x what the prototype used. Nobody modeled it because the prototype was cheap.
The fix is to build a cost model before you finalize the architecture, not after.
The Per-Request Cost Model
Every agent request has a predictable cost structure. Break it into four components and you can estimate your monthly bill before writing a line of production code.
Component 1: LLM Token Spend
This is your biggest lever. Token cost = (input tokens + output tokens) x price per million tokens for your chosen model. As of mid-2025, rough tiers look like this:
| Model Tier | Input (per 1M tokens) | Output (per 1M tokens) | Typical Use |
|---|---|---|---|
| Frontier (GPT-4o, Claude Opus) | $5 to $15 | $15 to $75 | Complex reasoning, ambiguous tasks |
| Mid-tier (Claude Sonnet, GPT-4o-mini) | $0.15 to $3 | $0.60 to $15 | Most production agents |
| Small/fast (Haiku, GPT-4.1-mini) | $0.08 to $0.40 | $0.30 to $1.60 | High-volume classification, routing |
Component 2: Tool-Call Multiplier
Every tool call an agent makes is a separate LLM inference round. An agent that makes three tool calls per user request is running four LLM inferences: one to decide what to do, three to execute and observe. Your cost multiplier is roughly equal to your average tool-call depth plus one. A deep research agent with five tool calls per request is running at 6x the base token cost of a simple Q&A agent.
Component 3: Retrieval and Memory
Vector search (Pinecone, pgvector, Weaviate) is cheap per query, typically $0.002 to $0.01, but the retrieved chunks feed back into your context window and inflate your input token count. A retrieval step that returns 2,000 tokens of context per request adds real cost at scale. Persistent memory stored and retrieved per session compounds this further.
Component 4: Infrastructure and Observability
This is the smallest line item but the one most teams forget to include: API gateway, compute (Lambda or container), logging, tracing (LangSmith, Langfuse, Helicone), and any human-in-the-loop queuing infrastructure. Budget $50 to $300 per month for a mid-scale deployment. It is not your main cost, but it is not zero.
Worked Example: A Customer Support Agent at 10,000 Requests per Day
Let me show you how to apply this model to a concrete scenario. Suppose you are building a customer support agent with the following profile:
- Average user message: 120 tokens
- System prompt plus context: 800 tokens
- RAG retrieval: 1,500 tokens per request
- Average tool calls: 2 (order lookup, knowledge base search)
- Average output: 250 tokens per LLM call
- Model: Claude Sonnet at $3 input / $15 output per 1M tokens
Per request breakdown:
Each of the 3 LLM calls (1 main + 2 tool calls) receives roughly 2,420 input tokens (120 + 800 + 1,500) and produces 250 output tokens. Total per request: (3 x 2,420 x $3) + (3 x 250 x $15) divided by 1,000,000 = $0.022 + $0.011 = $0.033 per request.
At 10,000 requests per day: $330/day, roughly $10,000/month. That is before infra and observability.
The same agent on Claude Haiku ($0.80 input / $4 output per 1M, estimates): roughly $0.008 per request, or $2,400/month. Model selection alone is a 4x cost lever. That is the kind of decision you want to make with real numbers before you pick your stack.
The Six Levers That Actually Cut Your Token Bill
Teams reach for 'use a smaller model' as the first and only lever. It helps, but it is one of six. Using all six together typically cuts costs by 60 to 85 percent without degrading output quality, if you apply them correctly.
1. Prompt Compression
Audit your system prompt. I routinely see 1,500-token system prompts that do the same job as a 400-token prompt after editing. Every token you remove from the system prompt saves money on every single request. This is the highest-ROI optimization in most codebases.
2. Model Routing
Not every step in your agent needs the smartest model. Use a small, fast model to classify intent and route. Use a mid-tier model for the main reasoning step. Reserve frontier models for genuinely ambiguous or high-stakes decisions. A routing layer that costs $0.001 per request can cut your average inference cost by 40 percent.
3. Caching
Both Anthropic and OpenAI offer prompt caching for repeated context blocks (system prompts, static documents, long tool definitions). If your system prompt is 1,000 tokens and you are running 10,000 requests per day, caching that prefix typically cuts input costs by 50 to 90 percent on the cached portion. This is free money. Enable it first.
4. Tool-Call Depth Control
Set a hard maximum on tool-call iterations per request. An agent without a ceiling can spiral into 10+ calls on a complex task. Four is usually the right production ceiling for most use cases. Above that, either the task needs a different architecture (multi-agent with human-in-the-loop) or the agent is confused and about to produce garbage anyway.
5. Context Window Management
Do not dump the entire conversation history into every request. Use sliding window summarization: keep the last two to three turns verbatim, summarize older turns into a 200-token digest. For multi-session agents, store summaries in a memory layer and retrieve only what is relevant to the current intent. This is the single biggest source of token bloat in long-running agents.
6. Output Length Constraints
Instruct your model to be concise in the system prompt, and set max_tokens to a realistic ceiling for your use case. Output tokens are typically three to five times more expensive than input tokens on frontier models. A prompt that says 'respond in 2-3 sentences' is a cost optimization, not just a UX choice.
What Teams Get Wrong When They Budget AI Agents
After reviewing a lot of agent architectures, the mistakes cluster around a small set of patterns. Here are the ones that actually sink projects.
Prototyping on GPT-4o and forgetting to re-evaluate
You pick the frontier model to get the prototype working quickly. It works well. You push to production without re-running your eval suite on a mid-tier model. Six months later you are paying frontier prices for tasks that Sonnet handles just as well. Always re-run evals on cheaper models before finalizing your production stack.
No cost observability from day one
If you cannot see your token spend broken down by agent step, by user segment, and by time of day, you cannot optimize it. Set up Langfuse, Helicone, or LangSmith on day one. The cheapest problems to fix are the ones you can see early.
Treating agent retries as free
When an LLM call fails or returns a malformed response, your retry logic runs the full inference again. A 2 percent error rate with three retries means some requests cost 3x. Guard your tool outputs with schemas (Pydantic, Zod), validate before retrying, and track retry rates as a cost signal, not just a reliability signal.
Ignoring egress in retrieval pipelines
RAG pipelines that return too many chunks, or that retrieve on every turn regardless of whether retrieval is actually needed, inflate input tokens silently. Build a retrieval gate: only call the vector store when the classifier determines the query requires external knowledge. This one change typically reduces retrieval calls by 30 to 60 percent on conversational agents.
Security, Guardrails, and the Costs They Add
Guardrails have a cost, and that cost is worth paying, but you should model it explicitly. A content moderation call on every user input adds latency and a small per-call fee (typically $0.001 to $0.003 using a small classifier model). A human-in-the-loop queue for high-risk actions adds infrastructure cost and latency. Neither is optional in production, but both need to appear in your cost model.
The guardrails I consider non-negotiable for production agents:
- Input validation: schema-check and sanitize every user message before it touches your prompt template. Prompt injection is a real attack vector.
- Output validation: parse and validate structured outputs before acting on them. An agent that executes a malformed tool call in production is a security incident, not just a bug.
- Tool permission scoping: each tool should have the minimum permissions it needs. An agent with read-only database access cannot exfiltrate or corrupt data even if the LLM is manipulated.
- Rate limiting per user: prevent cost amplification attacks where a single user drives unbounded token spend.
- Human-in-the-loop gates for irreversible actions: anything that sends an email, charges a card, deletes a record, or calls an external API with side effects should require a confirmation step with a timeout. This is architecture, not just policy.
Model these costs as a fixed overhead per request: roughly $0.003 to $0.008 depending on how many guardrail layers you run. It is small but it changes your break-even math.
Evals and Observability as Cost Control
The teams that control their run costs long-term are the ones who treat evals as an ongoing engineering practice, not a one-time pre-launch check. Here is the operational setup I recommend:
Build a regression eval suite of 50 to 200 representative inputs with expected outputs scored on a rubric (correctness, format compliance, tool-call count, output length). Run this suite against every model and prompt change before deploying. This is what lets you safely downgrade models or compress prompts: you have a signal for when quality drops below acceptable thresholds.
In production, trace every request with a correlation ID through your observability layer. The metrics that matter for cost control are: average input tokens, average output tokens, average tool-call depth, retry rate, cache hit rate, and cost per request by agent type. Alert when any of these drift more than 20 percent from baseline. A prompt change that silently inflates token counts by 30 percent is a budget incident, and you want to catch it in hours, not on your next monthly invoice.
Frequently Asked Questions
How much does it cost to run an AI agent per month?
It depends almost entirely on request volume, model choice, and tool-call depth. A low-volume internal tool at 1,000 requests per day on a mid-tier model might cost $30 to $100 per month in LLM fees. A customer-facing agent at 50,000 requests per day on a frontier model can easily run $15,000 to $50,000 per month. The right answer is to build a per-request cost model before you pick your stack.
What is the cheapest model for production AI agents?
For most production agents, Claude Haiku or GPT-4.1-mini offer the best cost-to-quality ratio on high-volume, well-defined tasks. Reserve mid-tier models (Sonnet, GPT-4o) for tasks requiring multi-step reasoning or nuanced judgment. Only use frontier models (Opus, GPT-4o) when evals show cheaper models failing on your specific workload. Model routing between tiers is often more cost-effective than picking one model for everything.
Do AI agents cost more than traditional software to run?
Yes, typically by a meaningful margin, but the comparison depends on what the agent is replacing. If it replaces human labor at $30 to $50 per hour, even a $0.05 per request agent is dramatically cheaper at scale. If it replaces a simple rule-based system, the LLM overhead is usually unjustifiable unless the task genuinely requires language understanding. Build the cost model for both and compare them honestly.
How do I reduce AI agent token costs without losing quality?
In priority order: enable prompt caching, compress your system prompt, implement context window management with summarization, add model routing so only complex steps use expensive models, set a tool-call depth ceiling, and constrain output length in your prompt. Applied together, these typically cut costs 60 to 85 percent. Run your eval suite after each change to confirm quality holds.
What observability tools should I use to track AI agent costs?
Langfuse (open source, self-hostable), Helicone, and LangSmith are the three I see most in production. All three give you per-request token breakdowns, latency traces, and cost attribution by agent step. Pick one, instrument it from day one, and set cost-per-request alerts. The tool matters less than the discipline of looking at the data weekly.
Should I use an agent framework or build from scratch to control costs?
Frameworks like LangChain and CrewAI add abstraction layers that can obscure token spend and make prompt compression harder. For cost-sensitive production deployments, I usually recommend building a thin custom orchestration layer: a routing function, a tool registry, a context manager, and a retry policy. It is 200 to 400 lines of code and gives you full visibility into every token that leaves your system.
Ready to Model and Build Your Agent the Right Way?
If you are scoping an AI agent project and want to know what it will actually cost to run before you commit to an architecture, that is exactly the kind of engagement I take on. I work with product teams as a solo independent AI systems architect, helping them design agent systems that are cost-predictable, observable, and secure from day one.
You can read more about how I work on my about page or browse past work on my projects page. If you are ready to scope your agent architecture, reach out via the contact page or go straight to the service details.







