How to Test a Non-Deterministic AI Agent

The Short Answer: Stop Asserting Exact Outputs

You test a non-deterministic AI agent by replacing exact-match assertions with behavioral contracts: eval suites that check intent, LLM-as-judge scoring that rates quality, and trajectory checks that verify the agent took the right steps regardless of the exact words it used. The output changes every run. The behavior should not.

I am Mahmoud Zalt, an independent senior AI systems architect with 16+ years building production software. I founded Sista AI, and the past year of holding non-deterministic agents to account in production is exactly where my approach to testing them was forged. I build production AI agents and evaluation pipelines for teams that need them to actually work. If you are shipping an agent and need this done right, I offer AI Agent Development as a standalone engagement. You can read more about my background here.

Why Classic Unit Tests Break on AI Agents

A traditional unit test looks like this: given input X, assert output equals Y. That contract is meaningless when Y is generated by a language model. The same prompt can return 'The answer is 42', '42 is correct', and 'Based on the data, I would say 42' on three consecutive calls. All three are correct. A string equality check would fail two of them.

The deeper problem is that agents are not pure functions. They call tools, maintain state across turns, retrieve documents, and branch on intermediate results. The surface you need to test is not a string, it is a trajectory: did the agent call the right tools in a sensible order, pass the right parameters, handle failures gracefully, and arrive at an answer that satisfies the original intent?

Teams that try to bolt unit tests onto agents waste months fighting flakiness and eventually give up on testing entirely. The fix is not better unit tests. It is a different testing philosophy built around four layers: golden dataset evals, LLM-as-judge scoring, trajectory assertions, and regression baselines.

The Four-Layer Eval Stack for Production Agents

Here is the stack I use on every agent I build. Each layer catches a different class of failure.

Layer	What it checks	Tooling	When to run
1. Golden dataset evals	Does the agent answer known questions correctly?	Custom harness, LangSmith, Braintrust	Every PR
2. LLM-as-judge scoring	Is the answer high quality, grounded, on-topic?	GPT-4o or Claude as evaluator, custom rubric	Every PR + nightly
3. Trajectory assertions	Did the agent call the right tools in the right order?	Trace inspection, OpenTelemetry + OTEL-AI	Every PR
4. Regression baseline	Did scores drop vs last stable release?	Stored eval run snapshots, threshold alerts	Every PR + before deploy

You do not need all four on day one. Start with a golden dataset and a regression baseline. Add LLM-as-judge once you have at least 50 examples. Add trajectory assertions once your agent uses two or more tools.

Building a Golden Dataset That Actually Catches Regressions

A golden dataset is a set of (input, expected behavior) pairs where 'expected behavior' is a rubric, not a string. For each example you define: what the answer must contain, what it must not contain, which tools it must or must not call, and a minimum quality score from your judge.

A minimal golden dataset entry looks like this:

{
  'id': 'order-status-1',
  'input': 'What is the status of order #8821?',
  'must_call_tools': ['get_order_status'],
  'must_not_call_tools': ['send_email'],
  'answer_must_contain': ['order', '#8821'],
  'answer_must_not_contain': ['error', 'I do not know'],
  'min_judge_score': 0.8
}

Thirty examples like this, covering your core use cases and known edge cases, give you a meaningful regression signal. I typically build the first 30 by running the agent manually on real queries, reviewing the traces, and encoding what I observed as rubric constraints.

What teams get wrong: they build golden datasets from synthetic data they invented themselves. That misses the actual failure modes. Use real queries from real users or from the product spec. Synthetic data is fine to pad coverage once you have a real baseline.

LLM-as-Judge: Writing Rubrics That Do Not Hallucinate Pass Grades

LLM-as-judge means using a second language model to score your agent's outputs against a rubric. It tolerates surface variance (phrasing, order, length) while still catching quality regressions. Done wrong, it is a rubber stamp. Done right, it is the closest thing to a human reviewer you can automate.

The rubric is everything. A bad rubric asks 'Is this a good answer? Score 1-10.' A useful rubric asks specific binary questions:

Does the answer directly address the user's question? (yes/no)
Is every factual claim in the answer grounded in the retrieved context? (yes/no)
Does the answer include information not present in the source documents? (yes/no, this is a hallucination check)
Is the tone appropriate for the stated persona? (yes/no)
Is the answer complete, or does it defer without reason? (yes/no)

Score each dimension independently. Aggregate to a composite. Set a per-dimension minimum, not just an average minimum. An answer that is perfectly grounded but completely hallucinated on one dimension should not pass by averaging.

Use a different model as your judge than the one powering your agent. If your agent is Claude Sonnet, judge with GPT-4o. This prevents the judge from having a systematic blind spot toward the agent's failure patterns.

One concrete example: on a customer support agent I built, the agent would occasionally answer a question about a return policy using a slightly outdated version of the policy retrieved from a stale chunk. String matching never caught it because the phrasing was plausible. A grounding check in the judge rubric caught it consistently because the specific policy dates were verifiable in the source.

Trajectory Assertions: Testing What the Agent Did, Not Just What It Said

The answer an agent returns is the last thing it does. For complex agents, the important failures happen in the middle: wrong tool called, wrong parameters passed, tool result misread, loop taken twice when once was correct, retrieval step skipped entirely.

Trajectory assertions inspect the agent's execution trace and assert on the sequence of operations. You need observability instrumentation for this. I use OpenTelemetry with an AI-aware span schema, or platform-native tracing if I am on LangSmith or Langfuse.

Example trajectory assertions for a research agent:

Tool call order: search_web must be called before synthesize_answer
Parameter integrity: search_web must receive a query derived from the user's input, not a hardcoded string
Retry behavior: if get_document returns a 404, the agent must not call it again with the same ID
Loop guard: total tool calls must be below a threshold (I use 20 as a default cap)
Handoff correctness: if the agent hands off to a sub-agent, the handoff payload must include the required fields

Trajectory assertions are deterministic even when outputs are not. The agent may phrase the final answer differently each time. It should almost always call the same tools in the same order for the same class of query. When it does not, that is a signal worth investigating.

Regression Baselines and CI Integration

A single eval run tells you the current score. A regression baseline tells you whether the score got worse. The workflow is simple: store the eval results from your last stable release, and fail the PR if any dimension drops by more than a threshold you set deliberately.

My default thresholds:

Judge composite score: fail if drops more than 0.05 (5 points on a 0-1 scale)
Golden dataset pass rate: fail if drops below 90%
Any individual dimension: fail if drops below its per-dimension floor
Trajectory: fail if any mandatory tool-call assertion goes from passing to failing

For CI, I run the eval suite as a step in the GitHub Actions pipeline on every PR that touches agent code, prompts, retrieval logic, or tool definitions. I skip it for pure infrastructure changes. Eval runs cost money, so I keep the golden dataset under 100 examples for the CI gate and run the full suite (300-500 examples) nightly.

Cost reality check: 100 examples with GPT-4o as judge at roughly 1k tokens per evaluation costs about $0.30 per CI run at current pricing. That is not a reason to skip testing. That is a rounding error next to the cost of shipping a broken agent.

Guardrails, Security, and Human-in-the-Loop Testing

Behavioral testing is about quality. Guardrail testing is about safety and security. They are different and both are required before you go to production.

For every agent I build, I run a separate suite of adversarial tests:

Prompt injection: does the agent follow instructions embedded in retrieved documents or tool outputs? It should not.
Scope creep: does the agent perform actions outside its defined scope when a user asks it to? A customer support agent should not be able to initiate a refund just because a user typed 'please issue a full refund' forcefully.
Data exfiltration: does the agent leak system prompt content, internal document IDs, or other users' data when asked?
Loop exploitation: can a malicious input cause the agent to loop until it hits rate limits or costs the operator money?

Human-in-the-loop (HITL) testing deserves its own mention. If your agent takes irreversible actions (sends emails, places orders, modifies records), you need test cases that verify the confirmation step works. The agent must pause and request confirmation before crossing irreversible action thresholds. Test that the pause fires. Test that a 'cancel' at that step actually cancels. These are integration tests you run against a staging environment with mocked downstream systems.

Worked Example: Eval Pipeline for a Support Agent

Here is how I would set up testing for a customer support agent that answers questions about orders, products, and return policies using retrieval-augmented generation (RAG).

Step 1: Build the golden dataset. Take 50 real support queries from the product team. For each one, define the rubric: which knowledge base articles should be retrieved, whether a tool call is required, what the answer must and must not say.

Step 2: Add a judge prompt. Write a system prompt for your judge model that includes the rubric dimensions specific to support: grounded in retrieved context, does not invent policy details, does not promise things outside documented policy, matches the brand tone guide.

Step 3: Add trajectory assertions. For queries that require a tool call (like order status lookups), assert that the correct tool is called with the correct parameter type. Assert that the agent does not call a tool when the answer is available in retrieved context.

Step 4: Add adversarial cases. Add 10 adversarial queries: users trying to get the agent to reveal the system prompt, users asking the agent to bypass return policy, documents seeded with injection attempts.

Step 5: Baseline and gate. Run once to establish baseline. Set thresholds. Wire into CI. Run nightly with the full 200-example suite.

Total setup time for this pipeline on a greenfield agent: about two days of engineering work. The payoff is that you can iterate on prompts, retrieval chunking, model versions, and tool definitions with confidence that regressions surface immediately rather than in production.

Frequently Asked Questions

how do I test an AI agent when the output changes every time

Use behavioral contracts instead of string assertions. Define what the answer must contain, must not contain, which tools must be called, and a minimum quality score from an LLM-as-judge. The exact phrasing can vary. The behavior should not. A passing test suite means all behavioral contracts are satisfied, not that the output is identical to a stored snapshot.

what is LLM-as-judge and does it actually work

LLM-as-judge uses a second language model to evaluate your agent's output against a rubric you define. It works well when the rubric is specific and binary (yes/no per dimension), when you use a different model as the judge than the one you are testing, and when you validate the judge's decisions against human ratings on at least a sample of your golden dataset. It breaks down when rubrics are vague or when you ask the judge to give holistic scores without criteria.

how many test cases do I need to test an AI agent properly

Start with 30 to 50 golden dataset examples covering your core use cases and the failure modes you already know about. That is enough to build a meaningful regression baseline. Add 10 adversarial cases for safety testing. Scale to 200-300 examples once the agent is in production and you can seed from real queries. More examples are better, but 30 well-chosen examples beat 500 synthetic ones that all look the same.

can I use pytest to test an AI agent

Yes, pytest works as a test runner for AI agent evals. You write test functions that call your agent, collect the trace and output, run your judge, and assert that scores and trajectory constraints are satisfied. The assertions are not on strings, they are on the structured eval results. Libraries like Pytest-asyncio handle async agent calls cleanly. The eval logic itself (judge prompts, rubrics, trace parsing) lives in your own harness or in a platform like Braintrust or LangSmith.

how do I prevent AI agent regressions when I change the prompt

Run your full golden dataset eval suite before and after the prompt change and compare scores dimension by dimension. A prompt change that improves the aggregate score but drops a specific dimension (like grounding or scope adherence) is still a regression in that dimension. Store eval results as artifacts in CI so you can diff any two runs. Never ship a prompt change without running evals first, even if the change looks trivially safe.

what tools do teams use to evaluate AI agents in production

The most common options I see in production are: Braintrust (strong eval framework, good CI integration), LangSmith (native LangChain tracing plus evals), Langfuse (open source, good for self-hosted requirements), and Weave from Weights and Biases. For teams that want full control, a custom harness built on OpenTelemetry with eval results stored in a database is entirely viable and gives you the most flexibility. The tool matters less than having a rubric, a golden dataset, and a regression gate in CI.

Build AI Agents That Hold Up Under Testing

Testing non-deterministic agents is not harder than testing deterministic software. It is different. Once you shift from exact-match assertions to behavioral contracts, eval suites, and trajectory checks, you get a test suite that is actually informative: it tells you when quality drops, when the agent goes off-script, and when a prompt or model change introduced a regression you did not intend.

If you are building an agent and need the eval infrastructure done correctly from the start, or if you have an agent in production and need to know why it fails, I take on AI Agent Development engagements as a solo architect. No junior handoffs, no agency overhead. Get in touch at /contact and tell me what you are building.

Work with me on your AI agent

Zalt Blog

How to Test a Non-Deterministic AI Agent

Are you a software engineer moving into AI?

AI Personal Assistant

AI Marketing Manager

AI Sales Representative

AI Support Specialist