How to Test an LLM Application When Outputs Are Not Deterministic
You test an LLM application by building a curated dataset of representative inputs, defining a scoring function for each behavior you care about, and running those scored checks automatically on every prompt change. Determinism is a red herring: you do not need identical outputs, you need outputs that reliably satisfy your requirements.
I am Mahmoud Zalt, an independent AI systems architect with 16+ years building production software since 2010. As the founder of Sista AI, I spend my days holding a workforce of autonomous agents to a quality bar in production, and eval-driven development is the discipline that makes that possible. I now work as an independent AI consultant helping engineering teams design, build, and evaluate LLM systems that hold up under production load. This article explains the eval-driven development approach I use on every project. If your prompt changes still feel like coin flips, read on.
Evals Are the Unit Tests of AI Systems
In traditional software, a unit test checks that a function returns the correct value for a given input. In an LLM system, 'correct' is often a range: the response must be factually accurate, stay in scope, follow the tone guide, and not hallucinate a company name. That is four distinct behaviors, each needing its own check.
The analogy holds in practice. A good eval suite has:
- Coverage: inputs that represent every user intent your system is supposed to handle.
- Regression protection: a check that a prompt tweak did not silently break a previously passing case.
- Fast feedback: results in minutes, not days, so developers ship with confidence.
Where teams go wrong is treating evals as a one-time QA gate rather than a living artifact. The eval dataset should grow every time a production failure is reported. Think of it the same way you treat a bug fix: the fix is the code change, the regression test is the new eval case.
What teams get wrong first
Most teams start by measuring accuracy on a benchmark dataset they downloaded from somewhere. That dataset does not represent their users. Benchmark scores give you a false sense of coverage while your actual failure modes go undetected. Build your own dataset from real traffic first, then supplement with synthetic cases for edge conditions.
Building a Representative Eval Dataset
Your eval dataset is the foundation. If it does not represent your real distribution of inputs, every score you compute is measuring the wrong thing.
Step 1: Start with real traffic
Pull 200 to 500 real queries from your logs within the first two weeks of any new feature. Cluster them by intent (classification, extraction, summarization, generation, refusal). Sample proportionally so each intent cluster is represented. If you have no traffic yet, generate synthetic cases from your product spec, then have a human review them for plausibility.
Step 2: Add adversarial and edge cases
After covering the happy path, add:
- Queries that should trigger a refusal (off-topic, harmful, out-of-scope).
- Ambiguous queries where the correct behavior is to ask a clarifying question.
- Long inputs near the context limit.
- Inputs in languages your system was not explicitly tuned for.
- Inputs that previously caused production failures (your regression suite).
Step 3: Attach expected behavior, not expected output
This is the most important shift. Instead of storing the 'correct answer,' store a behavioral assertion: 'the response must mention the refund window,' 'the response must not include the competitor name,' 'the response must be under 100 words.' Assertions compose, scale, and survive prompt rewrites in a way that exact-match expectations do not.
Dataset size guidance
| System complexity | Minimum dataset size | Coverage goal |
|---|---|---|
| Single-task (e.g. classifier) | 150 to 300 cases | All label classes, class-balanced |
| Multi-turn chat assistant | 400 to 800 cases | All intent clusters, adversarial 15%+ |
| Agentic / tool-calling system | 300 to 600 cases | Each tool path, multi-hop chains, error recovery |
Choosing the Right Scoring Method
Not every behavior needs the same kind of check. Using a single scoring method across all evals is the second most common mistake I see. Here is the decision tree I use.
Exact match
Use when the output is structured and there is objectively one correct answer: JSON field values, extracted entities, classification labels, SQL snippets, yes/no responses. Fast, cheap, deterministic. Implement it in 10 lines of Python. If you are extracting a date from a document, exact match is the right tool and using anything heavier is waste.
Regex and rule-based checks
Use for format compliance: 'response starts with a capital letter,' 'response does not contain the string [INST],' 'response is valid JSON,' 'response length is between 50 and 200 tokens.' Layer these on top of other checks. They catch regressions that LLM judges miss because they are too lenient.
Embedding similarity
Use for semantic equivalence when paraphrase is acceptable. Embed the output and the reference answer, compute cosine similarity, threshold at 0.85 or higher. Useful for QA systems where 'The window is 30 days' and 'You have 30 days to request a refund' should both pass. Not useful when the exact phrasing carries legal or brand meaning.
LLM-as-judge
Use for subjective qualities: tone, coherence, helpfulness, groundedness, instruction-following on open-ended tasks. Have a separate model (often a stronger one than the one you are testing) score the output on a 1-to-5 scale or as pass/fail with a rubric. The rubric is the critical part. A prompt that says 'rate this response' gives you noise. A prompt that says 'rate whether this response answers the user question using only information from the provided context, where 1 = fabricated facts and 5 = fully grounded' gives you signal.
A minimal LLM-as-judge prompt structure I use:
You are an evaluator. Given the user query, the retrieved context, and the assistant response, score the response on GROUNDEDNESS from 1 to 5.
Rubric:
1 - Response contains facts not present in the context.
2 - Response mostly fabricated with some grounded elements.
3 - Response mixes grounded and fabricated elements.
4 - Response is mostly grounded with minor unsupported details.
5 - Every claim in the response is directly supported by the context.
User query: {query}
Context: {context}
Response: {response}
Return JSON: {'score': , 'reason': ''} Human evaluation
Use for calibrating your LLM judge and for high-stakes decisions. Run human evals on a 10 to 15 percent sample monthly. Use the human scores to audit judge agreement. If your judge agrees with humans less than 80 percent of the time, the rubric needs refinement or a different judge model. Human evals are not a replacement for automated evals: they are the ground truth that keeps your automated pipeline honest.
Scoring method selection table
| Behavior to test | Method | Cost |
|---|---|---|
| Structured output correctness | Exact match / JSON schema | Near zero |
| Format and safety constraints | Regex / rule checks | Near zero |
| Semantic equivalence | Embedding similarity | Low |
| Groundedness, tone, helpfulness | LLM-as-judge | Medium |
| Judge calibration, high-stakes | Human review | High |
Running Evals in CI: Stop Treating Prompt Changes as Coin Flips
The goal is to make a prompt change feel like a code change: reviewable, testable, and reversible. Here is the pipeline I set up for teams.
The eval CI loop
- Store prompts as versioned artifacts. Every prompt template lives in source control alongside the code that calls it. A prompt change is a PR. This alone eliminates most 'we changed the prompt and something broke in prod' incidents.
- Run evals on every PR. On each pull request that modifies a prompt, the CI job runs the eval suite against the candidate prompt. It reports pass rate, score distribution, and a diff against the baseline (the current main branch scores). Fail the PR if pass rate drops more than 2 percentage points.
- Cache model responses. For speed and cost control, cache model responses for eval inputs that have not changed. On a cold run, 500 cases at GPT-4o pricing costs roughly $0.50 to $2.00 depending on input/output length. With caching, repeat runs on unchanged cases cost near zero.
- Track metrics over time. Log every eval run to a time-series store (even a simple SQLite table or a Weights and Biases project). Visualize pass rate, average judge score, and latency percentiles (p50/p95). Regressions that do not break the CI threshold show up as gradual drift that you catch in weekly reviews.
- Gate on regressions, not perfection. Do not set your threshold at 100 percent pass rate. LLM systems have irreducible variance. Set the gate at your current baseline minus an acceptable tolerance (usually 2 to 5 percent). The goal is catching regressions, not chasing a score.
Worked example: a support chatbot
A team I worked with had a support chatbot that handled billing queries. They had a 340-case eval dataset covering six intent clusters. Their CI pipeline ran on every prompt PR and reported three metrics: intent accuracy (exact match on extracted intent), groundedness score (LLM-as-judge, 1-5), and refusal rate on out-of-scope queries (rule check). When a developer rewrote the system prompt to sound 'more friendly,' the CI run showed intent accuracy dropped from 94 percent to 87 percent on the billing-dispute cluster. The PR was declined in review rather than discovered by a customer complaint three days later.
Evaluating Agentic Systems and Tool-Calling Pipelines
Single-turn evals are straightforward. Agentic systems are harder because the failure can happen at any step in a multi-hop chain, and the final output can look correct even when the path was wrong.
What to evaluate in an agentic system
- Tool selection accuracy: did the agent call the right tool for the user intent?
- Tool argument correctness: were the arguments passed to the tool valid and appropriate?
- Step count efficiency: did the agent complete the task in a reasonable number of steps, or did it loop?
- Trajectory correctness: does the sequence of tool calls match a reference trajectory for that task?
- Final answer quality: did the agent produce the right final output regardless of path?
For MCP-based systems (Model Context Protocol), I evaluate at the protocol boundary: log every tool call and response, replay the trace in evals, and assert on both the call sequence and the final synthesis. This gives you coverage at the integration point where most real failures occur.
The stubbed environment pattern
Do not run evals against live APIs or databases. Stub every external tool with deterministic responses for eval runs. This makes evals fast, free, and reproducible. The real integration is tested in a separate integration test suite with actual calls, run less frequently against a staging environment.
RAG-Specific Evals: Testing Retrieval Separately from Generation
Retrieval-augmented generation has two independently failing components. Most teams only measure the end-to-end answer quality and cannot diagnose whether a failure is a retrieval problem or a generation problem. Separate the two.
Retrieval evals
For each eval query, store the set of document chunks that contain the correct answer. Measure:
- Recall@k: does the correct chunk appear in the top-k retrieved results? Use k values of 3, 5, and 10.
- Mean Reciprocal Rank (MRR): how high in the ranked list does the correct chunk appear?
- Context precision: what fraction of the retrieved chunks are actually relevant? High recall with low precision means the LLM is drowning in irrelevant context.
Generation evals
Given the retrieved context (fixed for the eval run, not re-retrieved), measure:
- Groundedness: every claim in the response is supported by the provided context (LLM-as-judge).
- Answer completeness: the response addresses all parts of the user query (LLM-as-judge).
- Faithfulness: the response does not contradict any statement in the context.
When you split these, diagnosis becomes fast. Groundedness failures with good retrieval scores point to the generation prompt. Recall failures with good generation scores point to the embedding model or chunking strategy. Chasing a single end-to-end metric hides both.
Observability, Guardrails, and Human-in-the-Loop
Evals run before deployment. Observability catches what gets through. Both are required in a production AI system.
What to log in production
- Every request and response (with retention policy for privacy compliance).
- Latency, token count, cost per call.
- Any guardrail trigger (input moderation, output filter, refusal).
- User feedback signals (thumbs up/down, explicit corrections, session abandonment).
Guardrails as assertions at runtime
Guardrails are your eval assertions running live. An input guardrail checks for prompt injection, off-topic queries, or PII before the model sees the input. An output guardrail checks the response for hallucinated entities, policy violations, or toxic content before it reaches the user. Libraries like Guardrails AI and NeMo Guardrails provide the scaffolding. The rules inside them should mirror your eval rubrics so you are testing the same behaviors you are enforcing.
Human-in-the-loop design
For high-stakes actions (sending an email on behalf of a user, executing a financial transaction, publishing content), always require explicit human confirmation before the action runs. The confirmation UI should show what the agent is about to do in plain language, not 'confirm action.' This is not a fallback for when the system fails: it is a first-class part of the system design for anything with irreversible consequences.
Closing the feedback loop
Production failures and low-confidence outputs should feed back into your eval dataset automatically. When a user clicks 'that answer is wrong,' the input goes into a triage queue. A human reviews it weekly, labels the expected behavior, and it enters the eval suite. Your dataset gets stronger with every production failure rather than accumulating silent debt.
Cost Discipline in Eval Pipelines
Eval pipelines have a cost problem if you design them naively. Running 500 cases through GPT-4o for every PR is fine. Running 5,000 cases with a chain of three LLM calls each is not sustainable at $0.015 per 1k output tokens.
The approach I use is tiered evals. Cheap checks run on every commit: exact match, regex, embedding similarity. These cost near zero and catch most regressions. Medium-cost checks (LLM-as-judge with a fast model like GPT-4o-mini or Haiku) run on every PR. Expensive checks (stronger judge model, human sampling) run nightly or weekly. You get fast feedback on the common case and thorough coverage on a schedule.
A practical cost target: a full eval run on a 500-case suite should cost under $5 and complete in under 10 minutes. If it costs more or takes longer, the pipeline will get skipped or disabled under deadline pressure. Design for sustainability from the start.
Model selection for judges
The judge model should be stronger than, or at least as strong as, the model under test. Using GPT-4o-mini to judge GPT-4o outputs introduces systematic blind spots. For most production systems I use Claude Sonnet or GPT-4o as the judge when the tested model is anything smaller. Reserve the strongest available model (Opus, GPT-4o, o3) for calibration runs and high-stakes audits, not for every CI eval.
Frequently Asked Questions
How many eval cases do I need before my evals are meaningful?
150 cases covering all your intent clusters gives you a meaningful signal for a focused single-task system. Under 50 cases, your pass rate is too noisy to detect real regressions: a two-case failure looks like a 4 percent drop or a 40 percent drop depending on where it falls. Start with 150, grow to 500 over the first quarter, and prioritize coverage over volume. Ten cases per intent cluster is a reasonable floor.
Can I use the same model to both generate outputs and judge them?
You can, but it introduces a systematic bias: models tend to rate their own outputs more favorably than a different model would. For daily CI evals where speed matters, same-family models with slightly different judge prompts are acceptable. For calibration runs and any eval result you report externally, use a different model family as the judge. The disagreement rate between same-model and cross-model judges is usually 10 to 20 percent on subjective criteria.
How do I handle non-determinism when I need reproducible eval results?
Set temperature to 0 for eval runs. This does not guarantee identical outputs across model versions or API updates, but it minimizes run-to-run variance within a version. Store both the model version and the full model response for every eval run so you can reproduce and audit past results. When the model provider upgrades the underlying model (which happens silently on some APIs), your stored responses let you detect the shift.
What is the difference between evals and monitoring, and do I need both?
Evals are offline checks you run before a change reaches production. Monitoring is online observation of the live system after deployment. You need both. Evals catch regressions before users see them. Monitoring catches distribution shifts, new failure modes, and edge cases your eval dataset did not cover. They are complementary, not alternatives. A team that has great evals but no monitoring is flying blind after deployment. A team with great monitoring but no evals is discovering problems reactively instead of preventing them.
How do I write a rubric for an LLM judge that actually produces consistent scores?
The rubric must define every point on the scale with a concrete example or a precise criterion, not an adjective. 'Good' is not a criterion. 'The response answers every sub-question the user asked, with no fabricated details' is a criterion. Include one or two few-shot examples in the judge prompt showing a score-1 response and a score-5 response for your specific task. Measure inter-rater agreement on a sample of 50 cases: if two runs of the same judge prompt on the same inputs disagree more than 15 percent of the time, the rubric is ambiguous. Tighten it before using it at scale.
When should I stop tuning and accept the eval scores I have?
Stop tuning when marginal prompt changes produce less than one percentage point of improvement on your eval suite and the current scores meet your product threshold. Over-optimizing evals is a real risk: you can fit your prompt to the eval dataset the same way a model can overfit to a training set. Treat your eval suite as a test set, not a training signal. If you are making changes specifically because they improve eval scores rather than because they improve real behavior, you are overfitting. The right signal is user satisfaction in production, with evals as the leading indicator.
Ready to Build an Eval Pipeline That Actually Catches Regressions?
Eval-driven development is not a research practice reserved for teams with ML engineers on staff. It is an engineering discipline any team shipping an LLM feature needs from week one. The teams that skip it spend their sprints firefighting production surprises instead of shipping. The teams that invest in it early move faster, not slower, because they can change prompts and models with confidence.
I work directly with founding teams and engineering leads as an independent AI consultant to design eval frameworks, set up CI pipelines for LLM systems, and build the observability layer that keeps production systems trustworthy. If your team is shipping AI features and prompt changes still feel like gambling, I can help you fix that. See more about how I work on my about page and the projects I have shipped at projects.
Work with me as your AI systems architect or reach out directly at the contact page to talk through your current eval setup.







