Build, Buy, or Wait: How to Decide on Any AI Capability

Build, Buy, or Wait: The Short Answer

If your AI capability is a commodity task (summarization, classification, extraction, basic Q&A), wait or buy. If it is a genuine competitive differentiator tied to proprietary data or workflow, build a thin layer over a foundation model. Almost nobody needs a fully custom model in 2025, and the 'build' option costs far more than vendors quote and delivers far less than benchmarks suggest.

I am Mahmoud Zalt, an independent senior AI systems architect with 16+ years building production software since 2010. For the past year I have run Sista AI, the company I founded, keeping a fleet of autonomous agents alive in production. As an independent AI strategy consultant, I have helped engineering teams avoid six-figure mistakes by applying the framework below. You can read more about me here.

Why 'Wait' Is the Most Underrated Option

Vendors do not sell waiting. Analysts do not write reports titled 'do nothing yet.' But model capability is inflating at a rate that deflates custom work monthly. GPT-4 Turbo in late 2023 required a retrieval pipeline, prompt engineering, and a fine-tuning run to hit 85% accuracy on legal clause extraction. By mid-2025 a well-prompted call to a frontier model hits that baseline out of the box at a tenth of the cost.

Concretely: if the capability you want is on a frontier model roadmap (multimodal reasoning, longer context, structured outputs, code execution), waiting 3 to 6 months has a real dollar value. Estimate the engineering weeks to build it now, multiply by your fully-loaded eng cost, then subtract what you would spend in model API fees after waiting. That delta is often $80k to $200k for a mid-size team.

Three signals that 'wait' is the right answer:

The task is purely language-based with no proprietary data advantage.
You cannot write a deterministic eval suite today. If you cannot measure it, you cannot maintain it.
The business timeline for ROI is longer than 9 months. Model costs will drop further; your custom infra costs will not.

The Decision Framework: Four Questions Before You Commit

Run every proposed AI capability through these four questions in order. The first 'no' terminates the build path.

Question	If yes	If no
1. Does this require proprietary data that no vendor can access?	Continue to Q2	Buy or wait
2. Is the workflow differentiated enough that off-shelf tools cannot be composed?	Continue to Q3	Buy a composable tool and configure it
3. Can you write a repeatable eval suite before you write a line of model code?	Continue to Q4	Wait until you understand the problem well enough to measure it
4. Is the expected ROI positive within 6 months at realistic (not best-case) performance?	Build a thin integration layer	Wait or run a 2-week spike first

Notice what is not in the table: 'is this technically interesting,' 'did a competitor announce something,' and 'can we use this in a press release.' Those are the three most common reasons teams build when they should wait.

What 'Build' Actually Means in 2025

When the framework says build, it does not mean train a model. It means write a thin, observable integration layer over a foundation model API. The components are: retrieval (RAG or structured DB lookup), a tool-calling / MCP layer for actions, guardrails for output validation, an eval harness, and observability (traces, latency, cost per call). That is the full stack for 95% of production AI features.

A short worked example

A B2B SaaS client wanted AI-assisted contract review. The initial proposal was a fine-tuned model trained on their historical contracts. I ran the four-question framework:

Proprietary data? Yes, they had 5,000 annotated contracts.
Differentiated workflow? Yes, their clause taxonomy was non-standard.
Can you write evals? Yes, they had a gold set of 200 reviewed contracts with known outputs.
ROI in 6 months? Marginal. Reviewers spent 2 hours per contract; AI assistance needed to save at least 45 minutes to justify cost.

Decision: build a RAG pipeline over the contract corpus plus a structured extraction prompt with a JSON schema output contract, not a fine-tuned model. We skipped fine-tuning entirely. Total build: 3 weeks. Accuracy on their eval set: 91%. Fine-tuning would have taken 8 weeks and was unlikely to exceed 94% on the same eval. The time savings paid back in week 7.

The key architectural decisions in any 'build' engagement:

Evals first. Write the eval harness before the prompt. This is non-negotiable.
Tool-calling over prompt stuffing. Give the model tools (MCP or function calling) for actions; do not encode workflow logic in a 3,000-token system prompt.
Guardrails at the output boundary. Schema validation, hallucination probes, and a human-in-the-loop escalation path for low-confidence outputs.
Cost instrumentation from day one. Log tokens in and out per call, per user, per feature. You cannot optimize what you do not measure.

What 'Buy' Actually Means and Where It Goes Wrong

Buying an AI tool is not just paying a SaaS invoice. It is a configuration, integration, and evaluation project. Teams consistently underestimate three costs:

Eval cost. You still need to write an eval suite for a bought tool. If the vendor upgrades their underlying model, you need to know immediately whether your use case regressed. Teams that skip this discover regressions in production.
Integration cost. Most AI tools have APIs that are designed for demos, not for production workflow integration. Budget 2 to 4 weeks of senior eng time for any non-trivial integration, plus ongoing maintenance.
Lock-in cost. Vendor-specific prompt formats, proprietary retrieval indexes, and non-exportable fine-tunes create switching costs that make the TCO calculation look very different at renewal.

The buy option makes clear sense when: the vendor's core loop solves the whole problem (not 70% of it), the data you feed it is not a competitive asset, and the vendor has production SLAs you can hold them to. Good current examples: document OCR with structure extraction, meeting transcription and summarization, code review suggestions in CI. All commodity, all better bought.

Cost, Security, and the Hidden Tax of Early Movers

Two factors that almost never appear in 'build vs buy' analyses but dominate the real TCO:

Cost deflation

Model API costs have dropped roughly 10x every 18 months since GPT-4 launched. That means a capability that costs $0.50 per call today will cost approximately $0.05 in 18 months. If your build decision is justified at $0.50 per call, rerun the numbers at $0.05. Does the business case survive? If not, wait.

Security and data governance

Every AI integration is a new data flow. Before approving any build or buy, answer: where does user/customer data go, does it leave your cloud boundary, is it used for vendor model training, and what is the breach notification SLA? These are not paranoid questions. They are basic due diligence that three of my clients discovered too late, after signing contracts with AI vendors whose default data retention policies were incompatible with GDPR or SOC 2 requirements. Read the data processing addendum, not just the main agreement.

On the build side: if you are routing sensitive data through a foundation model API, you need a data classification layer upstream of the LLM call. Never route PII, PHI, or secrets into a prompt without explicit stripping and audit logging.

Observability and Evals: The Work That Makes Everything Else Work

Production AI without evals is not a product, it is a demo. The minimum viable observability stack for any AI feature in production:

Tracing: every LLM call gets a trace ID, captures the full prompt, model version, latency, token counts, and cost. Use LangSmith, Langfuse, or a simple structured log to your existing observability stack.
Eval harness: a golden dataset of 50 to 200 input/output pairs, run automatically on every model version change or prompt change. Alert on regressions above a 2% threshold.
Human-in-the-loop escalation: low-confidence outputs (measured by your guardrail layer, not model logprobs) route to a human queue. Track the escalation rate as a product health metric. A rising escalation rate is an early warning before user complaints.
Cost dashboard: daily spend by feature, by user cohort. Set a hard budget cap per user session. Runaway prompt injection or a misconfigured agent loop can produce $10k API bills overnight.

This stack takes one sprint to build. Every team that skips it regrets it by month three.

Human-in-the-Loop: When to Automate and When to Gate

The clearest heuristic I use: if the cost of a wrong output to a user or downstream system exceeds the cost of a human review, gate on human approval. This is not a temporary measure while the model matures. It is a permanent architectural decision for high-stakes outputs.

A practical segmentation:

Automate fully: summarization, classification, tagging, draft generation for human review, internal search ranking. Wrong outputs are annoying, not damaging.
Human-in-the-loop: customer-facing decisions (loan pre-qualification, medical triage routing, legal advice drafts), any action that modifies production data, any output that triggers a financial transaction.
Human-only: final approval on regulated outputs. The AI assists, it does not sign.

Teams that skip the middle tier because 'the model is good enough' are typically the ones I get called in to fix 6 months later.

Frequently Asked Questions

Should we build our own AI model or use an existing one?

Almost certainly use an existing foundation model. Training your own model requires tens of millions of high-quality labeled examples, significant GPU infrastructure, and an ongoing fine-tuning pipeline. The realistic scenario where this pays off is a large enterprise with a truly unique domain (genomics, novel materials, proprietary financial signals) and 12+ months of runway to reach production quality. For everyone else, a well-engineered RAG and tool-calling layer over a frontier model API will outperform a custom-trained model at a fraction of the cost.

How do we know if our use case justifies AI at all?

Write the eval suite first. If you cannot define success in measurable terms (precision, recall, task completion rate, time saved per user), you do not yet understand the problem well enough to build anything. A well-defined eval set also doubles as your business case: run the eval on a frontier model with a basic prompt before committing to a build. If the baseline is already 80%+, the ROI of further investment is often marginal.

What does an AI strategy consultant actually do versus an AI vendor?

A vendor sells you their product. An independent consultant like me helps you decide whether to buy it, build it yourself, or wait. I have no inventory to move. My job is to save you money and compress your timeline by applying a framework built on real production experience, including knowing which capabilities are genuinely differentiating and which are commodities that will be free in 18 months.

Is waiting really a valid business strategy for AI?

Yes, for the majority of commodity capabilities. The companies that rushed to build custom summarization pipelines in 2023 spent 3 to 6 months of engineering time on something that GPT-4o handles natively today for $0.01 per call. Waiting is not the same as ignoring AI. It means investing that time in building proprietary data assets, writing eval suites, and identifying the 20% of your use cases that are genuinely differentiating, so you are ready to build fast when the moment is right.

How much should we budget for an AI integration project?

A production-grade AI feature (RAG pipeline, eval harness, guardrails, observability, human-in-the-loop queue) built by a small experienced team runs $80k to $200k in total engineering cost for the first feature, dropping to $20k to $60k for subsequent features that reuse the infrastructure. Bought SaaS tools range from $2k to $50k per year depending on volume, but add 2 to 4 weeks of integration work at senior eng rates. Fine-tuning projects start at $150k and rarely finish on time or budget. Use these as sanity checks against vendor proposals.

What is the biggest mistake teams make when evaluating AI tools?

Evaluating on demos instead of on their own data. Every AI vendor demo uses cherry-picked inputs. The only way to evaluate a tool honestly is to run it against your golden dataset, the same one you should have written before starting any evaluation. Teams that skip this step sign 12-month contracts and discover the tool performs at 60% on their actual data, not the 95% shown in the demo. Build the eval first. Always.

Work With an Independent AI Strategist

Most AI decisions are made under vendor pressure, competitor anxiety, or executive enthusiasm, not under a clear framework. I work with engineering teams and founders to apply the build-buy-wait analysis before a dollar is committed, identify the proprietary data and workflow advantages that actually justify building, and design the minimum viable AI stack (evals, guardrails, observability, retrieval, tool-calling) that ships to production without becoming a maintenance liability.

If you are facing an AI build decision and want an independent opinion grounded in 16+ years of production engineering and real AI deployments, get in touch here or read more about the AI consultancy and strategy service.

Get an independent AI strategy review before you commit the budget.

Zalt Blog

Build, Buy, or Wait: How to Decide on Any AI Capability

Are you a software engineer moving into AI?

AI Personal Assistant

AI Marketing Manager

AI Sales Representative

AI Support Specialist