How Reliable Can an AI Agent Actually Be? Setting Honest Accuracy Targets

How Reliable Are AI Agents in Production?

In production, a well-engineered AI agent on a narrowly scoped task can hit 85 to 95% accuracy per step. But end-to-end, across a multi-step workflow, reliability compounds downward fast: a 95%-per-step agent running 10 steps delivers correct final output roughly 60% of the time. That is the number your team needs to plan around before you ship.

I am Mahmoud Zalt, an independent senior AI systems architect with 16 years building production software. I created Laradock (tens of millions of Docker pulls) and Apiato, and I founded Sista AI. I consult on AI Agent Development for engineering teams that need these systems working in the real world, not just in demos. You can read more about my background here.

The Compounding Error Problem: The Math Teams Miss

Most engineers benchmark a single LLM call and feel good about a 92% accuracy score. What they don't model is that every agent step multiplies the failure probability.

The formula is simple: end-to-end reliability = per-step accuracy raised to the power of the number of steps.

Per-Step Accuracy	5 Steps	10 Steps	20 Steps
99%	95%	90%	82%
95%	77%	60%	36%
90%	59%	35%	12%
80%	33%	11%	1%

This is not a benchmark problem. It is an architecture problem. A 10-step research-and-draft agent with 90% per-step accuracy will produce a correct final document roughly 35% of the time. That is not good enough for any business process. The fix is not a better model. The fix is designing the system so that errors are caught and corrected before they compound.

What Actually Drives Per-Step Reliability

Before you can improve end-to-end numbers, you need to know where per-step reliability comes from. In my experience across production deployments, these five factors dominate.

1. Task Scope and Specificity

Narrow, well-defined steps with deterministic success criteria score 15 to 25 percentage points higher than open-ended ones. 'Extract the invoice total from this PDF and return it as a number' is not the same complexity class as 'research this company and summarize their risk profile.'

2. Context Quality

Retrieval quality is the single biggest lever after task scope. Garbage context degrades even a strong model. In RAG-backed agents, the retrieval step is often where the chain actually breaks. A wrong chunk retrieved at step 2 poisons every downstream step.

3. Model Selection Per Step

Routing cheap steps to a faster, cheaper model and reserving a stronger model for high-stakes reasoning steps is not just a cost optimization: it is a reliability strategy. A classification step does not need GPT-4o or Claude Sonnet. A legal-risk assessment step does.

4. Prompt Engineering and Output Constraints

Structured outputs with schema enforcement (JSON mode, function-calling schemas, constrained decoding) eliminate an entire class of parse-and-validate failures. I treat unstructured free-text LLM output as a reliability antipattern for any step that feeds another step.

5. Tool and Integration Reliability

An agent is only as reliable as its tools. If the external API it calls has 99.5% uptime, and the agent calls it at 5 steps, the tool alone introduces a compounding failure path. You need retry logic, timeout budgets, and graceful degradation at every tool boundary.

Design Patterns That Actually Improve End-to-End Reliability

Knowing the compounding math, here are the patterns I use in production systems to push end-to-end reliability to acceptable levels.

Reduce Step Count Aggressively

The most effective reliability improvement is removing steps. Every step you eliminate raises end-to-end reliability exponentially. Before adding a step, ask: can this be done as part of the preceding step, or can it be replaced with a deterministic function? LLM calls should only exist where soft judgment is genuinely needed.

Checkpoint and Validate Between Steps

Insert schema validation, rule-based checks, or a lightweight LLM-as-judge call at step outputs before they feed the next step. A step that produces malformed output should fail loudly at the checkpoint, not silently corrupt the chain. I typically implement this as a validation layer in the agent graph that runs after every LLM node.

Use Reversible Steps and Rollback Windows

Design side effects to be reversible where possible. Write to a staging area first, confirm, then commit. For agents that take actions (send email, update record, call API), a human-in-the-loop gate at high-stakes points is not a weakness: it is a reliability mechanism.

Retry With Backoff on Transient Failures

Transient failures (API rate limits, flaky context retrieval, network timeouts) account for a meaningful fraction of real-world failures. Exponential backoff with jitter and a maximum retry budget of 2 to 3 attempts recovers a substantial fraction of these without looping infinitely.

Parallelize Independent Steps

When steps are independent, run them in parallel. This does not improve per-step accuracy, but it eliminates sequential compounding for those branches: parallel steps each have their own failure probability, and only the joined result compounds. This also cuts latency, which matters for user-facing agents.

Evals: The Only Honest Way to Know Your Real Numbers

Teams ship agents without an eval suite and then discover reliability problems in production. Evals are not optional engineering overhead. They are how you know whether your reliability is 60% or 85% before your users find out.

What a Minimal Eval Suite Looks Like

For any production agent, I define: a golden dataset of 50 to 200 representative inputs with known correct outputs; a set of adversarial inputs designed to trigger common failure modes; and a set of edge cases at the boundary of the agent's intended scope.

For each input, I measure: did the agent produce the correct final output (end-to-end pass rate), did each intermediate step produce a valid output (per-step pass rates), and did the agent stay within its latency and cost budget. This gives three separate reliability numbers, all of which matter.

LLM-as-Judge for Subjective Outputs

When the output is a document, a summary, or a recommendation, there is no deterministic correct answer. I use a separate evaluator model (LLM-as-judge) with a rubric that specifies: factual accuracy, instruction following, format compliance, and absence of hallucinations. The rubric matters more than the judge model choice.

Regression Gating

Every change to prompts, retrieval, or model version runs the eval suite before deployment. A drop of more than 2 percentage points in end-to-end pass rate blocks the deployment. This sounds strict, but without it, you will slowly erode reliability over months of incremental changes and not notice until something goes wrong in production.

Observability in Production: What You Need to Measure

Evals tell you reliability before deployment. Observability tells you what is actually happening in production. The two are not substitutes for each other.

At minimum, a production agent needs traces. Every step in the agent graph should emit a structured trace that includes: the input and output of the step, the model and version used, the latency, the token count, the tool calls made and their results, and a success or failure classification. Tools like LangSmith, Arize Phoenix, and Langfuse give you this out of the box for common frameworks. If you are building a custom agent, instrument it yourself with OpenTelemetry.

The metrics I track in production are: end-to-end success rate (by task type, not aggregate), per-step failure rates, mean latency at the p50 and p95, cost per completed task, and human-escalation rate. The escalation rate is especially important: if your human-in-the-loop gate is triggering 40% of the time, your agent is not reliable enough to be useful.

Set alerts on end-to-end success rate drops and on per-step failure spikes. A retrieval step that starts returning irrelevant chunks shows up as a per-step failure spike 24 to 48 hours before it causes a visible end-to-end regression. Catching it at the step level is much cheaper than diagnosing a downstream failure.

Guardrails, Security, and What Reliability Actually Requires

Reliability is not just about producing the right answer. It includes not producing a wrong answer that causes harm, not leaking information, and not taking unauthorized actions. These are part of the reliability envelope in any serious production system.

Input and Output Guardrails

Input guardrails screen for prompt injection, jailbreak attempts, and out-of-scope queries before the agent starts processing. Output guardrails run on the final response before it is delivered: checking for PII exposure, policy violations, hallucinated citations, and format compliance. I implement both as separate, lightweight components, not as part of the main agent prompt.

Tool Calling and MCP Security

Agents that use tools via the Model Context Protocol or direct function calling need a permission model. Each tool should declare its access scope. Tools that write data, send communications, or call external APIs should require explicit capability grants that are validated at runtime, not just declared at prompt time. An agent that can call arbitrary tools is not a reliable system: it is an attack surface.

Scope Containment

The most common reliability failure I see in production is scope creep: the agent attempts to solve a problem outside its defined scope and produces a confident but wrong answer. Hard scope boundaries, implemented as routing logic or a classification step before the main agent, prevent this. 'I don't know, escalating' is a reliable answer. A hallucinated answer is not.

Worked Example: A 7-Step Document Review Agent

Here is a concrete case. A client wanted an agent to review vendor contracts and flag non-standard clauses. The initial design had 7 steps: ingest PDF, parse structure, retrieve relevant policy clauses, compare each section, classify risk for each section, aggregate risk score, generate summary report.

At 90% per-step accuracy, that is a 48% end-to-end pass rate. Not acceptable for legal review.

Here is what we changed. First, we merged parse-and-retrieve into one step using a structured extraction prompt with a strict JSON schema, reducing to 6 steps. Second, we added a validation checkpoint after the comparison step that verified every section had been evaluated (a deterministic check, not an LLM call). Third, we added an LLM-as-judge evaluator on the risk classification step that re-scored any classification with confidence below 0.85. Fourth, we routed the final summary generation to the strongest available model and gave it the full intermediate outputs as context.

Result: 6 steps instead of 7, with a validation checkpoint and a confidence-gated re-evaluation loop. Per-step accuracy on the remaining steps measured at 96 to 98% in evals. End-to-end pass rate: 83%. Still not 99%, but acceptable for a human-reviewed workflow, with clear traceability at every step so the reviewing lawyer knows exactly where to check.

Honest Accuracy Targets by Use Case

Different use cases warrant different targets. Here are the benchmarks I use when advising clients on what is realistic and what the architecture needs to achieve it.

Use Case	Minimum Acceptable	Achievable with Good Engineering	Primary Lever
Internal automation (low stakes)	70%	85 to 90%	Step reduction, schema outputs
Customer-facing triage / routing	85%	92 to 95%	Scope containment, guardrails
Research and summarization	80%	88 to 93%	Retrieval quality, evals
Document extraction and structuring	90%	95 to 98%	Schema constraints, validation
Autonomous action (write/send/commit)	95%	97 to 99%	Human-in-the-loop gates, scope limits

The 'autonomous action' row deserves emphasis. Any agent that takes irreversible real-world actions needs to be held to a much higher standard, and human-in-the-loop is not a fallback: it is part of the design from the start. An agent that sends emails or commits code at 90% accuracy is not a 90% reliable system. It is a system that does the wrong thing 1 in 10 times, with real consequences each time.

Frequently Asked Questions

How accurate are AI agents in production?

Per-step accuracy for well-scoped agents on narrow tasks typically ranges from 88 to 97%. End-to-end accuracy across a multi-step workflow is substantially lower due to compounding: a 10-step agent at 95% per step delivers correct final output roughly 60% of the time. Real-world numbers depend heavily on task scope, retrieval quality, and how many steps the workflow requires.

Why do AI agents fail so often in production?

The most common causes are: steps that are too broad and open-ended, poor retrieval quality feeding bad context into the chain, no validation between steps so errors compound silently, and no eval suite so teams don't measure actual performance before shipping. Compounding error is the structural problem: each step multiplies the failure risk of every preceding step.

How do I improve AI agent reliability?

The highest-leverage actions are: reduce step count by merging or replacing LLM steps with deterministic logic where possible, add schema-constrained structured outputs at every step, insert validation checkpoints between steps, build an eval suite with 50+ representative cases before shipping, and implement LLM-as-judge re-evaluation for any step where confidence is below threshold. Observability with per-step traces lets you find and fix failures at the source rather than diagnosing from end-to-end failures.

What is a realistic accuracy target for an AI agent?

For internal automation with human review, 80 to 90% end-to-end is acceptable. For customer-facing workflows, you need 90%+ end-to-end, which usually requires holding per-step accuracy above 97% or keeping the workflow under 5 steps. For any autonomous action that is hard to reverse, target 95%+ end-to-end and gate irreversible steps behind a human-in-the-loop confirmation.

Do more powerful models make agents more reliable?

Partially. A stronger model raises per-step accuracy, which helps. But it doesn't fix compounding error structurally. A GPT-4-class model at 95% per step over 10 steps is still a 60% end-to-end system. Model selection matters most for high-stakes individual steps. The architectural fixes (fewer steps, validation checkpoints, structured outputs, evals) have more leverage than model upgrades alone.

What is human-in-the-loop and when do AI agents need it?

Human-in-the-loop is a design pattern where the agent pauses and requests human confirmation before taking a high-stakes or irreversible action. It is required any time the consequence of a wrong action is material: sending communications, writing to production databases, committing or deploying code, making financial transactions, or generating content for legal or compliance review. It is not a fallback for an unreliable agent. It is a reliability mechanism built into the design from the start.

Ready to Build AI Agents That Are Actually Reliable?

Reliability in production AI agents is an architecture decision, not a model choice. If your team is designing a multi-step agent workflow and you want honest numbers and a design that holds up, I can help. I work directly with engineering teams as an independent consultant, from architecture through production deployment, including evals, observability, and guardrails.

See the full scope of how I work on AI Agent Development, or reach out directly to talk through your specific system. If you want to know more about my background and previous projects, start at my about page or projects.

Work with me to build reliable AI agents in production.