Skip to main content
المدونة

Zalt Blog

Deep Dives into Code & Architecture

AT SCALE

When Building Your Own AI Is a Mistake (and the Cheaper Alternative)

By محمود الزلط
Insights
12m read
<

Most companies building custom AI are wasting money. Buy the wrapper, own your data and evals, and only build custom when your unit economics or data moat actually require it.

/>
When Building Your Own AI Is a Mistake (and the Cheaper Alternative) - Featured blog post image
Mahmoud Zalt

1:1 Mentor

Are you a software engineer moving into AI?

Let's have a call. I'll help you modernize your skills and learn the tools, systems, and architecture behind real AI products. One session or ongoing.

Hire AI Employees

Hire AI Employees that work 24/7. No code.

The Short Answer: Buy the Wrapper, Own Your Data

For most companies, building custom AI software from scratch is a mistake. Buy the wrapper, own your data layer and your evals, and treat the model and the inference infrastructure as a commodity. The few situations where custom development is genuinely defensible come down to three things: a proprietary data moat your competitors cannot replicate, latency requirements that no hosted provider can meet, or unit economics that break at your volume.

I am Mahmoud Zalt, an independent senior AI systems architect with 16 years building production software since 2010. My own company, Sista AI, has run a workforce of autonomous agents in production for the past year, so I have made the build-versus-buy call with my own money on the line. I advise founders and product teams through my AI Consultancy practice on exactly this question. See my background and project history. I have seen both sides: teams that wasted six months and $200k building what a $500/month SaaS would have done, and teams that correctly identified a data advantage and built something defensible. This article gives you the framework to tell the difference before you commit.

What 'Buy the Wrapper' Actually Means

When I say buy the wrapper, I do not mean subscribe to a generic chatbot tool and call it your AI strategy. I mean: use a hosted foundation model (OpenAI, Anthropic, Gemini, or an open-weight model via a managed inference provider), drop your domain context in via retrieval-augmented generation (RAG), write prompt templates your team controls, and put your effort into the one thing that actually compounds over time: your evaluation harness.

The architecture looks like this: a thin application layer your engineers own, a retrieval pipeline over your proprietary documents and data, a set of frozen eval cases drawn from real user traffic, and a CI step that runs your evals on every prompt change. That is your real IP. The model is a utility; it will be 30 percent cheaper and 20 percent better in eighteen months regardless of what you do today.

The off-the-shelf stack that covers 90 percent of use cases

  • Hosted LLM: OpenAI GPT-4o, Anthropic Claude, or Gemini. All have function-calling, tool use, and multi-modal input. Pick based on your eval results, not hype.
  • RAG layer: PostgreSQL with pgvector, Pinecone, or Weaviate. Chunking strategy and metadata filters matter more than which vector DB you choose.
  • Orchestration: LangChain, LlamaIndex, or a thin hand-rolled router. For most teams, hand-rolled is maintainable; frameworks add abstraction before you understand what you are abstracting.
  • Observability: Langfuse or Braintrust for trace-level logging, cost attribution, and eval scoring.
  • MCP / tool-calling: Model Context Protocol servers if your agents need to interact with external systems. This is commodity infrastructure now.

A team of two engineers can wire this up in three to four weeks and have a system that handles real user traffic. That is the baseline you should compare custom development against.

What Teams Get Wrong When They Decide to Build

The most expensive mistake I see is confusing 'we have a unique use case' with 'we need a custom model.' Almost every use case is unique in the prompt, not in the weights. You do not need a fine-tuned model to handle your specific document format or your industry terminology. You need a well-structured system prompt, a retrieval pipeline over your domain corpus, and a few dozen eval cases that encode what good looks like for your users.

The three wrong reasons teams build custom

Wrong reasonWhy it is wrongWhat actually fixes it
'We need it to understand our jargon'A good system prompt and retrieval layer handles this for 95 percent of domainsRAG over your knowledge base, domain-specific prompt templates
'We need it to be private'Hosted providers offer enterprise data agreements and zero-retention optionsAnthropic Business, OpenAI Enterprise, or a managed self-hosted model (vLLM on your VPC)
'We need it to be cheaper at scale'Unit economics almost never favor custom before 10M tokens/day in most verticalsModel routing (Haiku/Flash for simple tasks, larger models only when needed), prompt caching, batch inference

A fourth wrong reason deserves its own mention: 'our competitors are building their own models so we should too.' This is the most expensive form of mimicry in tech. Your competitor may have a training data corpus, an ML team, and an evaluation infrastructure you do not. Building a model without those is not competitive parity, it is a $500k distraction.

The Three Cases Where Custom Is Genuinely Defensible

I said three cases. Here they are precisely. If your situation does not map cleanly to one of these, you are probably building custom for the wrong reasons.

1. Proprietary data moat

You have labeled training data that no one else can replicate at your scale and quality, and that data directly encodes a judgment that is commercially valuable. Legal contract risk scoring with 50,000 annotated contracts from your firm's case history. Medical triage routing with 10 years of clinical outcome data tied to specific presentations. Financial fraud detection with your institution's proprietary transaction graph. In these cases, fine-tuning or continued pre-training on that corpus can yield a model that is materially better than a prompted general model, and the gap is durable because competitors cannot acquire the same data.

The threshold I use: if your eval harness shows a prompted GPT-4o achieving 85 percent accuracy on your task, and your proprietary fine-tuned model achieves 94 percent, and that 9-point gap translates to a measurable business outcome (fewer escalations, lower claim payout, higher conversion), then fine-tuning is justified. If the gap is 2-3 points and you cannot tie it to revenue impact, you are fine-tuning for engineering satisfaction, not business value.

2. Latency that no hosted provider can meet

Hosted inference round-trip latency sits at roughly 300 to 800 milliseconds for a typical GPT-4o call. For most applications this is fine. For real-time voice assistants, sub-100ms response loops, or latency-sensitive trading applications where model reasoning is in the critical path, hosted inference is genuinely insufficient. In these cases, self-hosted open-weight models (Llama 3, Qwen, Mistral) on dedicated GPU infrastructure, optimized with vLLM and speculative decoding, can get sub-100ms for 7B to 13B parameter models. This is a legitimate technical requirement, not a preference.

3. Unit economics that break at volume

If you are processing 50 million tokens per day on a narrow, well-defined task (classification, extraction, structured output generation), the math on hosted inference can become prohibitive. A $0.003 per 1k token input cost multiplied by 50M tokens/day is $4,500/day or roughly $1.6M/year. A well-tuned 7B model on three A100 GPUs running 24/7 costs around $180k/year fully loaded including engineering overhead. At that volume and task specificity, the infrastructure investment pays back in under four months. Below 5M tokens/day on most tasks, the crossover point does not exist.

Your Real IP Is Your Eval Harness, Not Your Model

This is the single most important reframe in this article. The thing that makes your AI product defensible is not the model. It is your evaluation infrastructure, your labeled test cases, and your understanding of what 'good' means for your specific users on your specific tasks. That is the asset that competitors cannot copy and that keeps your product quality high as models and providers change.

A production eval harness has three components. First, a frozen test set: 100 to 500 input/output pairs drawn from real user sessions, annotated by subject matter experts for quality on the dimensions that matter (correctness, helpfulness, safety, formatting). Second, automated scoring: LLM-as-judge rubrics, embedding similarity checks, or structured output validators that can run the full test set in under five minutes on a CI server. Third, a regression gate: a CI step that blocks deploys when eval score drops more than two percentage points from the prior baseline.

A worked example: at Sista AI, before shipping any prompt change to the voice assistant, the eval suite ran against a 200-case frozen set covering edge cases in interruption handling, topic switching, and low-confidence disambiguation. Regressions that looked like improvements in demo conditions would show up immediately in eval scores on the frozen cases. That infrastructure, not the underlying model choice, is what kept quality predictable across a dozen provider and model changes over eighteen months.

Building this harness takes two to three weeks for a focused engineer. It pays back on the first prompt regression it catches. Teams that skip it are flying blind, regardless of whether they built custom or bought a wrapper.

Guardrails and Observability Are Non-Negotiable in Production

Whether you build custom or buy the wrapper, two engineering disciplines are non-negotiable before you call a system production-ready: guardrails and observability. I see teams skip both, ship, and then spend three months firefighting incidents they could have predicted.

Guardrails

Guardrails are the constraints you put on model behavior to prevent outputs that are harmful, off-brand, or simply wrong in ways your users will notice. They operate at three levels. Input guardrails filter or transform user input before it reaches the model (PII redaction, prompt injection detection, topic restriction). Output guardrails validate model output before it reaches the user (structured output schema enforcement, confidence thresholding, content policy checks). Behavioral guardrails limit what an agent can do (read-only tool use by default, human-in-the-loop gates on irreversible writes, token and cost caps per session).

For most applications, a combination of a model-level system prompt, a schema validator on structured outputs (Pydantic or Zod), and a simple content filter covers 90 percent of what you need. Dedicated guardrail libraries like Guardrails AI or Nemo Guardrails are worth evaluating if you have complex content policies, but do not add a framework dependency before you understand the failure modes you are defending against.

Observability

Every production LLM call should log: the full prompt and completion, the model and version, latency in milliseconds, token counts and cost, the user session or trace ID, and any tool calls with their arguments and results. This is not optional. Without it, you cannot debug failures, quantify regressions, or explain behavior to a non-technical stakeholder. Langfuse and Braintrust both offer open-source self-hosted options if data residency is a concern. The integration is a one-day effort for any standard orchestration setup.

Cost attribution deserves a specific callout. Instrument cost at the feature level, not just the account level. When your monthly inference bill is $8,000, 'model spend' as a single number tells you nothing. 'Document summarization feature: $4,200, customer support copilot: $2,800, internal search: $1,000' tells you where to optimize and whether individual features have positive unit economics.

Security and Data: The Questions Boards Actually Ask

When a company moves from 'we are experimenting with AI' to 'AI is in our production product,' the questions from legal, security, and the board change rapidly. Two topics dominate: where does our data go, and what happens when the model says something wrong.

Data residency and vendor agreements

Both OpenAI (Enterprise) and Anthropic (Business API) offer zero-data-retention agreements where prompt and completion data is not used for training and is deleted after the API call completes. If you are in a regulated industry (healthcare, finance, legal), verify the specific DPA terms and BAA availability before architecture decisions, not after. For EU-based operations, check data processing geography. Anthropic processes in the US by default; if EU data residency is required, a self-hosted open-weight model on EU infrastructure (via a provider like OVH, Hetzner, or your own GCP eu-west) is the current practical option.

The common mistake is treating 'we need data privacy' as automatically requiring custom infrastructure. Hosted enterprise agreements with appropriate DPAs cover most real compliance requirements. Run the analysis before committing to self-hosting; self-hosting has real operational costs (GPU instance management, model upgrades, scaling events) that are not free.

Liability and the human-in-the-loop question

When your AI system makes a consequential decision (a medical recommendation, a financial suggestion, a legal document draft), you need a clear policy on human review. The engineering pattern is simple: flag outputs above a confidence threshold for human approval before they take effect, log every decision with its model rationale, and make the override mechanism obvious in the UI. The harder question is organizational: who reviews, at what volume, and what happens when no one is available. These are not engineering questions, they are product and legal questions. Design the human-in-the-loop gate before you ship to production, not after the first incident.

Frequently Asked Questions

Is it worth building custom AI software or should we use off-the-shelf?

For most companies, off-the-shelf is the right starting point. Use a hosted foundation model, add a retrieval layer over your proprietary data, build an eval harness, and ship. Only move toward custom development (fine-tuning or self-hosting) when you have a specific, quantified reason: a data moat that produces measurable quality improvement, a latency requirement that hosted inference cannot meet, or unit economics that break at your actual production volume. The majority of teams that build custom do so before they have data to justify it.

When does fine-tuning an LLM actually make sense?

Fine-tuning makes sense when you have high-quality labeled examples (typically 500 to 10,000 annotated input/output pairs) for a specific narrow task, and your eval harness shows a prompted general model has a quality ceiling you cannot overcome with better prompting or retrieval. Good candidates: style transfer to a specific brand voice with many labeled examples, structured extraction from a document format that is highly domain-specific, or classification tasks where the label space is narrow and fixed. Bad candidates: anything where retrieval would work, anything where you have fewer than a few hundred quality examples, anything where the task evolves frequently (fine-tuned models require re-training when the task changes).

How much does it cost to build a custom AI system versus using an API?

A production-ready application built on hosted APIs (OpenAI, Anthropic) with RAG, guardrails, observability, and an eval harness typically costs $30,000 to $80,000 in engineering time for initial delivery, plus $500 to $5,000 per month in inference costs depending on volume. A custom fine-tuned model adds $20,000 to $80,000 in training and evaluation effort before it is ready for production, plus ongoing GPU infrastructure costs of $2,000 to $15,000 per month depending on scale. Self-hosted open-weight inference on dedicated GPUs runs $5,000 to $20,000 per month for meaningful production capacity. The API-first path is almost always faster and cheaper to a first production deployment; the question is whether the custom path pays back at your specific volume and quality requirements.

What do companies actually own when they use off-the-shelf AI?

You own your data, your retrieval and indexing pipeline, your prompt templates, your evaluation harness, your application logic, your user experience, and your operational runbooks. These are the durable assets. The model itself is a commodity that will be replaced by something better and cheaper; your eval harness is what lets you migrate safely when that happens. Companies that treat the model as their IP are building on sand. Companies that treat their evals and their data pipeline as their IP are building on rock.

Should we use RAG or fine-tuning to get the model to know our business domain?

Start with RAG. Retrieval-augmented generation lets you inject domain knowledge at inference time without any model training, and the knowledge is immediately updatable when your documents change. Fine-tuning encodes knowledge in model weights, which means stale information requires a retraining cycle and updates have a latency of days to weeks. The pattern I recommend: use RAG for factual domain knowledge (product documentation, internal policies, case history), use fine-tuning only for behavioral adaptation (tone, output format, reasoning style) after RAG is already working. Do not fine-tune for knowledge; retrieve it.

How do I measure whether our AI feature is actually working?

Define your success metric before you build, not after. For generative features: eval score on a frozen test set (LLM-as-judge with explicit rubrics, human spot-check on 10 percent of cases). For retrieval features: recall at K and mean reciprocal rank on a labeled query set. For agent features: task completion rate, error rate, and cost per successful completion. Instrument cost, latency, and error rate per feature from day one. A dashboard that shows these four numbers per feature, updated daily, is worth more than any amount of post-hoc analysis. Teams that skip measurement ship features with unknown quality and cannot prioritize improvements rationally.

What to Do Next

If you are facing a build-vs-buy decision for AI, the cheapest thing you can do is spend one day with someone who has seen both paths fail and succeed. Not a vendor with a product to sell you, and not a large agency with an incentive to maximize scope. Most companies need less than they think, and the decisions that matter (eval design, data ownership, vendor agreements, when to fine-tune) can be resolved in a focused strategy session.

I work with founders and product teams through my AI Consultancy practice. A one-day advisory session covers architecture review, build-vs-buy analysis, vendor selection, and a written decision brief your team can act on. If you want to talk through your specific situation first, reach out directly.

Book an AI strategy session and get a clear answer on what to build and what to buy.

Thanks for reading! I hope this was useful. If you have questions or thoughts, feel free to reach out.

Content Creation Process: This article was generated via a semi-automated workflow using AI tools. I prepared the strategic framework, including specific prompts and data sources. From there, the automation system conducted the research, analysis, and writing. The content passed through automated verification steps before being finalized and published without manual intervention.

Mahmoud Zalt

About the Author

I’m Zalt, a technologist with 16+ years of experience, passionate about designing and building AI systems that move us closer to a world where machines handle everything and humans reclaim wonder.

Let's connect if you're working on interesting AI projects, looking for technical advice or want to discuss anything.

Support this content

Share this article