How to Evaluate an AI Vendor Quote (and Spot the Padding)

How to Tell if an AI Agency Quote is Legit

A legitimate AI proposal prices discrete, verifiable deliverables: evals, a retrieval pipeline, a defined tool surface, guardrails, and a handoff plan. If the quote is heavy on 'agentic orchestration,' 'multi-model routing,' and 'AI transformation' but light on acceptance criteria and post-launch costs, you are looking at padding.

I am Mahmoud Zalt, an independent senior AI systems architect with 16 years of production software behind me since 2010. Running Sista AI, the company I founded, has had me building and pricing autonomous agents in production for a year, so I can read a vendor quote and see what it really costs to deliver. I do this work solo, not as an agency, which means I have no incentive to inflate scope. If you want a straight read on a proposal before you sign, I offer that as part of my AI consultancy and strategy work. The rest of this article is the framework I use.

The Anatomy of a Padded Six-Figure Proposal

I have reviewed dozens of AI proposals, and the padding concentrates in four places. Let me walk through a composite example: a real-estate company that received a $180,000 proposal for an 'AI-powered property assistant.' Here is a stripped-down version of what the line items looked like.

Line item	Quoted	Reality check
Discovery and architecture design	$22,000 / 4 weeks	Legitimate if it produces an ADR and a test plan. Padding if it is a slide deck.
Multi-agent orchestration layer	$45,000 / 6 weeks	Almost always unnecessary at this scale. A single LLM with well-scoped tools beats three agents 90% of the time.
Vector database setup and RAG pipeline	$28,000 / 3 weeks	Reasonable if the corpus is large and dirty. $28k for clean, small data is 3x the fair rate.
UI integration	$18,000 / 2 weeks	Legitimate line item. Often underquoted, not overquoted.
Testing and QA	$12,000 / 2 weeks	No mention of evals. This is unit tests on wrapper code, not actual model quality measurement.
Deployment and 'hypercare'	$35,000 / 3 weeks	Vague. Legitimate if it includes observability setup, cost dashboards, and runbooks. Padding if it is 'we will watch it.'
Ongoing retainer (optional)	$6,000/month	No SLA, no scope. Walk away from month-to-month retainers with no defined deliverables.

Total: $160,000 in build plus an open-ended retainer. The multi-agent line alone is $45,000 for a capability the system did not need.

The Multi-Agent Inflation Trap

Multi-agent architecture is the single most reliable signal of an inflated proposal in 2024 and 2025. Vendors sell it because it sounds sophisticated, justifies weeks of orchestration work, and is hard for a non-technical buyer to challenge.

Here is the honest rule: use multiple agents when you have genuinely parallel, independent tasks that cannot share a context window without degrading performance. A customer-service bot does not meet that bar. A document-intake system that processes 50,000 PDFs concurrently might.

What legitimate multi-agent work looks like

Parallel subagent calls with a defined aggregation step (fan-out, fan-in).
Separate agents for separate domains with incompatible system prompts (a legal-review agent and a tone-rewrite agent should not share a context).
Human-in-the-loop checkpoints between agent handoffs, with explicit approval gates.

What architecture theatre looks like

An 'orchestrator agent' that calls a 'retrieval agent' that calls a 'response agent.' That is a pipeline with extra API calls and extra failure modes.
Agent A and Agent B both have access to the same tool set. If they share tools, there is no reason to split them.
The proposal says 'LangGraph' or 'AutoGen' without specifying why the simpler alternative was ruled out.

If the proposal cannot explain what breaks if the system had a single agent instead of three, the complexity is decorative.

What Real Evals Cost (and Why They Are Always Missing)

This is the most important section for a buyer to read. Evals are the mechanism by which you know whether the AI system is working. They are almost never in a first-draft proposal because they are unglamorous, they require domain knowledge from your team, and they expose whether the vendor can actually define 'good.'

A legitimate eval framework for a production RAG system has three layers.

Retrieval quality. Precision@K, recall@K, mean reciprocal rank. You need a labeled question set (100 to 500 queries is realistic) and a measurement harness. Budget: 1 to 2 weeks of engineering plus time from a domain expert on your team.
Generation quality. Faithfulness (does the answer contradict the retrieved context?), answer relevance, and citation accuracy. Tools like RAGAS or a custom LLM-as-judge prompt can automate this. Budget: 1 week to set up, ongoing cost of running the judge model (typically $50 to $200/month for a medium corpus).
Regression testing. A fixed golden set of 50 to 100 (question, expected answer) pairs that runs on every deployment. If the vendor does not include this, you have no way to know whether a model upgrade breaks your system. Budget: half a week to build, near-zero to run.

Total fair cost for evals on a mid-size project: $15,000 to $25,000. Total cost in most proposals: $0, or buried inside 'QA' with no specifics.

Ask every vendor: 'What is your eval plan, and what does a regression look like?' If the answer is 'we will monitor it in production,' that is not an eval plan.

The Missing Maintenance Math

AI systems have a cost structure that is unlike traditional software, and proposals routinely hide or misprice it. Here is the math a buyer needs to do before signing.

Model inference cost

Get the vendor to give you an estimated token budget per user interaction. For a typical RAG assistant: 2,000 input tokens (system prompt plus retrieved context) and 400 output tokens. At GPT-4o pricing as of mid-2025, that is roughly $0.003 per call. At 10,000 calls/month, that is $30/month. At 200,000 calls/month, that is $600/month. This is manageable. The number that catches buyers off guard is the eval and re-embedding cost when you update the corpus.

Re-embedding cost

If your knowledge base changes frequently, you pay to re-embed. A 100,000-document corpus at text-embedding-3-small costs about $1.30 to embed once. Full re-embeds monthly are cheap. But if the vendor has proposed a custom fine-tuned embedding model, that changes the math entirely, and the maintenance cost belongs in the proposal.

Prompt drift and model upgrades

Models change. GPT-4o mini behaves differently from GPT-3.5-turbo. OpenAI and Anthropic deprecate models on 6-to-12 month cycles. Every deprecation requires a re-eval run and potentially prompt rework. That is 1 to 4 days of engineering per upgrade cycle. A responsible proposal includes a line item for this or explicitly calls it out of scope.

Observability

Production AI systems need traces. At minimum: input/output logging (with PII redaction), latency per call, token usage per call, and error rates by failure mode. Tools like Langfuse, Helicone, or a custom OpenTelemetry setup cost $0 to $200/month depending on volume. If the proposal does not mention observability, the system will be a black box in production. That is a support cost you will pay later.

Security and Data Handling Gaps to Check

AI proposals from agencies with roots in front-end or product work frequently skip security architecture entirely. These are the questions to ask.

Where does user input go? Is it sent directly to OpenAI or Anthropic, or does it pass through a proxy? If you are in a regulated industry (healthcare, finance, legal), you need a data processing agreement with the model provider, and you need to know whether training on your data is opted out.
Is PII stripped before it hits the LLM? A responsible pipeline has a pre-processing step that redacts or tokenizes PII before embedding or prompting. If this is not in the proposal, it needs to be.
What is the prompt injection surface? Any system that takes user input and inserts it into a prompt is a prompt injection target. Ask the vendor how they validate tool calls that result from model output. 'We trust the model' is not an answer.
Are API keys in environment variables only? Basic, but ask. Leaked keys in a public repo are the most common AI security incident I see.
What is the data retention policy for logs? If you are logging inputs and outputs for observability (you should be), those logs may contain sensitive data. The proposal should specify retention limits and access controls.

None of this is exotic. A senior engineer should be able to answer these in a 30-minute call. If the vendor cannot, that is a signal about the quality of the rest of the work.

How to Benchmark a Quote: A Five-Point Checklist

Here is the concrete checklist I use when reviewing a proposal for a client or evaluating a vendor for my own work.

Deliverables, not activities. Every line item should map to a shipped artifact: an eval harness, a deployed retrieval pipeline, a system prompt document with version history, a runbook. 'Architecture design' is an activity. 'Architecture decision record covering retrieval strategy, model selection rationale, and fallback behavior' is a deliverable.
Acceptance criteria exist. How do you know when the retrieval pipeline is good enough? The proposal should name a metric and a threshold ('precision@5 greater than 0.82 on the golden eval set').
Complexity is justified. For every architectural component, ask the vendor to explain what problem it solves and what simpler alternative they considered. If they cannot name a simpler alternative, they have not thought it through.
Post-launch costs are itemized. Model inference, re-embedding, observability tooling, prompt maintenance, and eval re-runs should all appear as estimated ongoing costs, even if they are out of scope for the build contract.
The team is named. Not 'a team of senior engineers.' Named individuals with verifiable work history. AI is a small field. You can check GitHub, LinkedIn, and prior project work before signing.

What You Probably Need Less of Than You Think

This is the part that does not appear in agency proposals, because saying it costs them revenue. Most buyers at the 'exploring AI' stage need far less than a six-figure build.

If your core use case is document Q and A, internal knowledge retrieval, or a customer-facing assistant over a bounded corpus, a well-configured RAG pipeline on a managed embedding service plus a carefully written system prompt will cover 80% of the value. That is a $15,000 to $40,000 project, not a $150,000 one. The remaining 20% of value (nuanced multi-step reasoning, complex tool chains, real-time data integration) is where spend scales up, and it should scale up because the problem actually requires it, not because the proposal template requires it.

The test I use: could a senior engineer who has not used an AI framework before build this in two weeks with the OpenAI API, a Postgres vector extension (pgvector), and a deployment script? If yes, the proposal should reflect that scope. If a vendor says otherwise, ask them to point to the specific requirement that breaks that simpler path.

I am not arguing for cutting corners on evals, security, or observability. Those belong in any serious project. I am arguing that the compute and orchestration layers are where the unnecessary complexity lives, and a confident vendor will tell you when you need less, not more.

Frequently Asked Questions

How do I know if an AI vendor is overcharging?

Compare the deliverables in the proposal to what a senior engineer could produce in the quoted time at a $150-to-$200/hour rate. If the math does not close, ask the vendor to itemize hours per deliverable. Padding usually becomes visible when you ask for a time breakdown. Also check whether the multi-agent and 'orchestration' layers are justified by a concrete problem statement, or whether they are there to fill weeks.

What is a reasonable price for an AI RAG system?

For a well-scoped RAG system (single corpus, one user-facing interface, standard retrieval, basic evals), expect $25,000 to $60,000 depending on corpus size and integration complexity. Projects that quote above $80,000 for this scope should be able to justify the additional complexity clearly. Projects below $15,000 are skipping evals, observability, or both.

Should I pay for a multi-agent AI system?

Only if the vendor can describe, in plain terms, what task each agent handles that the others cannot, and why a single agent with multiple tools would fail. Most enterprise use cases in 2025 do not require multi-agent architecture. The main legitimate use cases are high-throughput parallel processing, long-horizon tasks with genuinely distinct sub-tasks, and systems that must maintain separate contexts for security or compliance reasons.

What should an AI proposal always include?

At minimum: named deliverables with acceptance criteria, an eval plan with named metrics, a post-launch cost estimate (inference, re-embedding, observability), a security and data-handling section, and a named team. If any of these are missing, ask for them before signing. A vendor who cannot produce them is not ready to do the work.

How do I evaluate an AI agency before hiring?

Ask for a prior project where they can describe: the eval metrics they used, a failure they caught in production and how they resolved it, and what they decided not to build and why. Production judgment shows in restraint, not in feature count. Also ask to speak directly with the engineer who will do the work, not the account manager who wrote the proposal.

What red flags are in AI consulting proposals?

The clearest red flags: vague line items with no deliverables ('AI strategy and planning'), multi-agent complexity with no justification, zero mention of evals, a retainer clause with no defined scope, and references to frameworks (LangChain, AutoGen, LangGraph) without a problem statement that requires them. None of these are automatically wrong, but each one deserves a direct question before you commit.

Get a Straight Read Before You Sign

If you have a proposal in hand and you are not sure whether it is priced fairly, scoped correctly, or missing the pieces that will cause pain after launch, I can review it. I do this as part of my AI consultancy and strategy work. I will give you a line-item assessment: what is justified, what is inflated, what is missing, and what the realistic project cost and scope should look like.

I have no agency overhead, no sales team, and no incentive to recommend more complexity than your problem needs. If the proposal is fair, I will tell you. If it is not, I will show you exactly where and by how much.

You can read more about how I work on my about page and see prior projects at /projects. When you are ready to talk, reach out at /contact.

Request a proposal review or AI strategy consultation

Zalt Blog

How to Evaluate an AI Vendor Quote (and Spot the Padding)

Are you a software engineer moving into AI?

AI Personal Assistant

AI Marketing Manager

AI Sales Representative

AI Support Specialist