Skip to main content
المدونة

Zalt Blog

Deep Dives into Code & Architecture

AT SCALE

Build AI Agents on Observability, Not Around It

By محمود الزلط
Insights
11m read
<

AI agents do not crash. They loop, drift, and quietly degrade with no error code. Here is why observability has to be the foundation you build agents on, not a layer you bolt on at the end.

/>
Build AI Agents on Observability, Not Around It - Featured blog post image
Mahmoud Zalt

1:1 Mentor

Are you a software engineer moving into AI?

Let's have a call. I'll help you modernize your skills and learn the tools, systems, and architecture behind real AI products. One session or ongoing.

Hire AI Employees

Hire AI Employees that work 24/7. No code.

Why Observability Has to Come First for AI Agents

You build agents on observability, not around it, because an agent that fails in production almost never crashes. It loops, it picks the wrong tool, it acts on stale context, or it slowly drifts off the goal it started with. None of that throws an exception. So the only thing standing between you and a silent failure is whether you can reconstruct, after the fact, what the agent saw and why it chose what it did. If you cannot, you are not debugging. You are guessing.

I am Mahmoud Zalt, an independent senior AI systems architect with 16 years of production software behind me since 2010. I founded Sista AI, where a year of running autonomous agents in production has convinced me that you cannot operate what you cannot observe. This is the lesson that reshaped how I build all of them, and it is the one most teams learn the expensive way.

The Old Debugging Playbook Quietly Breaks

In normal software, failure is explicit. A request times out, a service returns a 500, an exception fires. Debugging is mostly deterministic: reproduce locally, read the trace, fix the line. You can even add the missing log line after the incident, because the same input gives you the same output every time.

Agents take that away. The same input can produce different reasoning, a different tool choice, and a different outcome on every run. And the failures that hurt do not announce themselves. They surface as worse output, higher latency, or a cost spike, long after the decision was made. The question you actually need to answer is no longer what error fired. It is what did the agent see, and why did it choose that. That is a different kind of question, and it demands a different kind of system underneath it.

Capture Everything, Because You Cannot Log It Later

The first rule is the one everyone skips: you cannot retroactively log what was never recorded. When an agent does something wrong at 2pm and you notice at 6pm, going back to add instrumentation is not an option. The signal either existed or it is gone.

So capture the full context in real time. The reasoning, every tool call, every retrieved document, every retry, every intermediate decision. Storage is cheap. An unreconstructable decision is expensive. And there is a second payoff most teams never collect on: the trace you captured to debug a failure is the exact same data that makes the next run smarter. It becomes an evaluation case and it becomes context. The data is not overhead. It is fuel.

One Stack, One Identity

You do not need an exotic platform for this. You need one tool per layer, unified into a single pane, and full ownership of your own data so a traffic spike does not turn into an unsustainable bill. The boring, well-worn layers are the right ones on purpose. The discipline is not in the tooling. It is in deciding, up front, that nothing ships unless it is observable.

The layers are unremarkable: infrastructure metrics, structured application logs, tracing for prompts and latency and cost, product analytics for real behavior, and release-aware error tracking so every error pins to the exact deployed commit. But the stack is only half of it. The piece that makes the whole thing usable is a single correlation ID that follows one request across every service, worker, queue, tool, and external API it touches, carrying tenant, user, session, and execution state with it. When a failure shows up hours later, one query has to rebuild the entire execution end to end. Logs without that shared identity are noise. Logs that all carry it are a time machine. If you want this portable as the ecosystem keeps shifting, lean on the emerging open standards for agent telemetry rather than a proprietary schema.

Evaluate Decisions, Not Just Outputs

Traces tell you what happened. They do not tell you how well it happened. For that you need a continuous evaluation layer running on top of live traffic, not a benchmark you ran once before launch and never looked at again.

Score a sample of production traces with model-graded judges, custom scorers, and plain rule-based assertions. Track tool accuracy, grounding, and whether the agent is still serving the goal it was given. Catch the regression the moment a prompt or model changes, instead of finding it in a customer complaint a week later. The most valuable signal here is not the final answer. It is the path the agent took to get there, measured against the goal.

Monitor for Agent Failures, Not Just Dead Boxes

CPU and uptime tell you the box is alive. They tell you nothing about whether your agents are behaving. The failures that actually hurt are agent-shaped, and most are silent, so error-log monitoring never sees them. You have to watch for each one deliberately:

  • Infinite loops. The agent keeps working but repeats itself, burning cost with no progress. A step ceiling and no-progress detection stop it. You need both, because a loop can technically progress while going nowhere.
  • Tool misuse. It calls the wrong tool, or the right tool with bad parameters, or exceeds the scope the task warranted. Tool-call accuracy scoring and least-privilege permissions catch it.
  • Goal drift. No single step fails, but the cumulative effect of small deviations produces an output that no longer serves the original intent. Compare the reasoning at the final step against the goal it started with.
  • Silent degradation. Quality slowly drops with no error and no crash. Only continuous eval scoring on live traces surfaces it.
  • Cost and latency anomalies. A spike with no obvious cause. Metrics with real thresholds on spend and tail latency page you before it compounds.

Then route the alerts like you mean it. Tiered channels keep a broken user journey separate from a noisy background warning, so the page that wakes you is always the one that matters. And the alert teams forget is the one for silence: a dead man's switch that fires if the telemetry pipeline itself goes quiet. The worst outage is the one where your monitoring went down too and never told you. At this layer observability stops being a passive dashboard and becomes a control system, deciding when a run gets stopped, escalated, or paused before it cascades into the next agent.

Close the Loop: Telemetry as Fuel the System Can Use

This is where agentic systems pull ahead of everything that came before. In classic software, telemetry is for humans staring at dashboards. In an agentic system, telemetry is fuel the system itself can consume. A reliable feedback loop has four stages: detect, diagnose, decide, deploy. Observability owns the first two, and increasingly the agents can drive the rest.

  • Monitoring agents watch the telemetry stream and act on anomalies without waiting for a human.
  • Failed traces and low eval scores flow to coding agents that propose a root-cause fix, which a human reviews before it ships.
  • Execution history becomes dynamic context, so the next run starts smarter than the last one did.

The strongest pattern in the field right now is turning a production failure straight into a permanent regression test. A trace that went wrong becomes an eval case that runs in CI on the next change, so the same mistake cannot ship twice, and the loop from incident to guardrail shrinks from days to minutes. None of it is possible without rich telemetry underneath. Evals need traces to score. Self-correction needs history to learn from. Take observability away and the whole self-improvement story collapses into wishful thinking.

Build the Layer Underneath Before You Tune a Single Prompt

If you take one thing from this: before you optimize a prompt, build the observability layer beneath it. Prompts improve what your agents say. Observability is what lets them improve themselves. In deterministic software it tells you what happened. In an agentic system it is the only thing that tells you why, and the only thing your agents can actually learn from. Build on it, not around it.

If you are putting agents into production and want them to hold up without quietly going off the rails, I work with teams directly as an independent architect. See my background and the systems I have shipped. The fastest path is a focused engagement on your specific system, not a generic audit. Reach out through the contact page.

Work with me to build AI agents you can actually trust in production.

Thanks for reading! I hope this was useful. If you have questions or thoughts, feel free to reach out.

Content Creation Process: This article was generated via a semi-automated workflow using AI tools. I prepared the strategic framework, including specific prompts and data sources. From there, the automation system conducted the research, analysis, and writing. The content passed through automated verification steps before being finalized and published without manual intervention.

Mahmoud Zalt

About the Author

I’m Zalt, a technologist with 16+ years of experience, passionate about designing and building AI systems that move us closer to a world where machines handle everything and humans reclaim wonder.

Let's connect if you're working on interesting AI projects, looking for technical advice or want to discuss anything.

Support this content

Share this article