How to Validate an AI Idea Is Feasible Before You Build It

Is Your AI Idea Actually Feasible? Here Is How to Find Out in One Week

Your AI idea is technically feasible if a frontier model can solve the core task at acceptable accuracy on a 50-sample eval set, your data exists and is licensable, and the cost-per-call math closes at production volume. If any of those three fail, you do not have a feasibility problem you can sprint past. You have a project to cancel or reshape before you burn real money.

I am Mahmoud Zalt, an independent senior AI systems architect with 16+ years building production software since 2010. Founding Sista AI and running autonomous agents in production for the last year has sharpened my instinct for which AI ideas are feasible and which only look good on a slide. I work with product teams and founders as an AI strategy consultant to prevent expensive AI misfires. What follows is the exact one-week feasibility spike protocol I run before any real build begins. Learn more about me here.

Why Most AI Projects Fail Before the First Sprint Ends

The most common failure mode I see is teams that skip feasibility entirely. They hire engineers, stand up infrastructure, and discover in week six that the model cannot reliably extract the structured fields they need from messy PDFs. Or they find out the training data they assumed existed is actually locked in a vendor contract. Or the per-query cost at their projected volume is four times the revenue per transaction.

None of those surprises require six weeks to surface. They surface in one week if you run the right spike. The spike is not a prototype. It is a narrow, disposable investigation designed to answer one binary question per dimension of risk: can the model do this task, does the data exist, and does the unit economics work?

The Three Dimensions of AI Feasibility

Capability: Can a frontier model perform the core task at a quality threshold that makes the product useful?
Data: Do you have, or can you legally obtain, the data required to ground, fine-tune, or evaluate the system?
Economics: Do the cost-per-call and latency numbers close at your realistic volume and price point?

All three must pass. A two-out-of-three result is a no-go, not a partial green light.

The One-Week Feasibility Spike Protocol

This protocol fits inside five working days. It requires one engineer (or a senior individual contributor) and access to at least one frontier model API. No infrastructure, no databases, no product code.

Day 1: Write the Eval Set First

Before you touch a model, write 50 representative input-output pairs for the core task. Do this manually. If you cannot write 50 examples, you do not understand the task well enough to build it. The eval set is the single most valuable artifact of the entire spike. It becomes your regression suite, your acceptance criteria, and your benchmark when comparing models or prompting strategies later.

Good eval sets for a document extraction task look like: 10 clean inputs, 15 moderately noisy inputs, 10 edge cases, 10 adversarial or out-of-distribution inputs, 5 known-hard cases. Assign a binary pass/fail per output plus a severity label for failures (cosmetic, functional, critical).

Day 2: Capability Probe on a Frontier Model

Run your 50-sample eval set against the best available frontier model (currently GPT-4o, Claude 3.7 Sonnet, or Gemini 1.5 Pro, depending on the task modality). Use zero-shot first, then one-shot, then a structured system prompt. Log every output. Measure pass rate on your eval set.

Interpret results this way: above 85% pass rate on zero/one-shot means the capability exists and you are building a product, not solving a research problem. Between 60-85% means the capability exists but you will need retrieval, fine-tuning, or better prompt engineering. Below 60% means the task is either poorly defined (rewrite the eval) or the capability is not there yet (do not build).

This is also when you probe latency. If the task requires real-time interaction and p95 latency on day-2 tests is already 8 seconds, you have a UX problem baked in before you write a line of product code.

Day 3: Data Availability Audit

Map every data source the production system would require. For each source, answer four questions: Does it exist? Who owns it? Can you use it under your intended license or terms of service? Is it in a format the model can consume without heroic preprocessing?

Common failures here: CRM data that is legally owned by the customer, not your client. Scraped data that violates terms of service. Historical data that exists but is stored in a format (scanned images, proprietary binary) that adds three months of preprocessing work. Internal documents that contain PII and cannot be fed to a third-party API without a DPA.

Output of day 3 is a data matrix: each source mapped to availability, license status, format, estimated preprocessing effort, and a red/amber/green status.

Day 4: Unit Economics Model

Build a simple spreadsheet. Columns: projected monthly active users, average queries per user per day, average tokens per query (input plus output), cost per 1k tokens for the model you probed on day 2, cost per call for any retrieval infrastructure (vector search, reranking), and total monthly AI cost at three volume scenarios (low, mid, 10x mid).

Now compare that to your revenue model. If you are charging $20/month per user and your AI cost at median usage is $18/user/month, the idea is not viable at this model tier. Your options are: cache aggressively, switch to a cheaper model for common queries and escalate to frontier only for hard ones, or reprice. Model-tier switching (a small model handles 80% of queries, frontier handles the 20% that fail) typically reduces cost by 60-70% at production volume.

Day 5: Write the Go/No-Go Memo

One page. Three sections: capability verdict (pass rate, model used, prompt strategy, identified failure modes), data verdict (sources, blockers, effort), economics verdict (cost per user at median, break-even volume). Attach the eval set as an appendix. The memo should take 30 minutes to write because the previous four days produced all the inputs.

A go-decision means all three sections are green and you have identified the highest-risk unknowns to address in the first build sprint. A no-go is not a failure. It is the spike doing its job. It saved you months of work.

What Teams Get Wrong When They Skip the Spike

The most expensive mistake is conflating a demo with a feasibility result. A demo is cherry-picked. It shows the model working on the five inputs the engineer chose because they looked good. An eval set is the opposite. It is deliberately hard. It includes the cases that make the model fail. If your feasibility argument is 'we showed it to the CEO and it looked great,' you do not have a feasibility result.

The second mistake is running the spike on a toy dataset that does not reflect production distribution. Teams building document processing systems test on clean PDFs when production will be faxed invoices scanned at 150 DPI. The eval set must reflect the actual input distribution, including noise, edge cases, and adversarial inputs.

The third mistake is ignoring the data audit entirely and discovering mid-build that the assumed data source is unavailable. I have seen teams build six weeks of retrieval pipeline before realizing the internal knowledge base they planned to index requires sign-off from legal in three countries. The data audit takes one day. The legal review takes three months. Run it first.

Baking Guardrails and Observability Into the Feasibility Assessment

The feasibility spike is also the right moment to identify where guardrails and observability are non-negotiable. If you cannot instrument the model's outputs for quality during the spike, you cannot instrument them in production. Observability is not an operational concern you add after launch. It is a technical capability you validate during feasibility.

For every failure mode your eval set surfaces, classify it: is this a model failure (wrong output), a retrieval failure (wrong context retrieved), a prompt failure (ambiguous instruction), or a data failure (missing information)? This taxonomy becomes your logging schema. In production, every query gets tagged with failure type on the way out so you can triage regressions without reading individual logs.

Guardrails to validate during the spike: output schema enforcement (does the model reliably return the JSON structure you need, or does it hallucinate extra fields), confidence proxies (does lower self-reported confidence correlate with actual failures in your eval set), and refusal behavior (does the model refuse edge cases you need it to handle, or handle edge cases you need it to refuse). These are binary questions you can answer with your 50-sample eval before writing a single line of product code.

Retrieval, Tool-Calling, and MCP: Validate the Architecture, Not Just the Model

If your AI idea requires retrieval-augmented generation (RAG), tool-calling, or integrations via MCP (Model Context Protocol), the feasibility spike must probe these specifically. A model that scores 88% on a pure language task may drop to 62% when it has to retrieve relevant context from a noisy knowledge base and synthesize an answer. Those are different tasks and they need separate eval sets.

For retrieval, the spike question is: does the retriever surface the right chunks for the hard cases in your eval set? Run your 50 queries against a small prototype index (200-500 documents is enough for a spike). Measure recall at k=3 and k=10. If recall at k=10 is below 70% for your hard cases, your retrieval architecture needs work before your model architecture matters at all.

For tool-calling, the spike question is: does the model reliably select the right tool and form valid parameters for the cases in your eval set? Test tool-selection accuracy separately from tool-execution accuracy. A model that picks the right tool 90% of the time but forms malformed parameters 30% of the time has a prompt engineering problem, not a capability problem.

For MCP integrations, validate that the external systems your agent needs to call are actually callable with the latency and reliability your product requires. An MCP server wrapping a legacy internal API that times out 15% of the time at peak load is a feasibility blocker, not an implementation detail.

When to Design Human-in-the-Loop Into the Architecture From Day One

The spike tells you where the model fails. For every failure category that is both frequent and high-consequence, the first-version architecture should route to a human, not retry the model. This is not a compromise. It is a design decision that ships a reliable product faster than trying to solve every hard case with more prompting.

A concrete rule: if a failure type appears in more than 10% of your eval set and the consequence of that failure is a user-visible error or a compliance risk, design a human review queue for that failure type in v1. Automate it in v2 once you have production data about what the failures look like at scale.

This also affects your go-decision. A 72% pass rate on a high-stakes task is a no-go for a fully automated pipeline. It is a green light for a human-assisted pipeline where the model handles the 72% and queues the rest. Whether that architecture fits the product vision and unit economics is a product decision, not a technical one. The spike surfaces the choice. The team makes it.

Security and Compliance Checks That Belong in the Spike

Three security questions must be answered during the feasibility spike, not deferred to a later phase.

First: does the task require sending sensitive or regulated data to a third-party model API? If yes, you need a DPA (Data Processing Agreement) with the provider, and you need to confirm the provider's data residency and retention policies are compatible with your obligations (GDPR, HIPAA, SOC2, or sector-specific). This is a blocker. It cannot be addressed with better engineering.

Second: does the model's output get rendered anywhere that could create XSS, injection, or prompt injection risk? If a user can influence the model's input and the model's output is rendered as HTML or executed as code, the spike must include adversarial prompt injection tests in the eval set. This is not a security audit. It is a basic check that the architecture is not fundamentally unsafe.

Third: what is your data handling policy for eval set data? If you built your eval set from real user data or production documents, you need to handle it under the same policies as production data. Eval sets are routinely treated as throwaway scratch data and stored insecurely. They are not. They contain your most sensitive inputs by design.

Frequently Asked Questions

How do I know if our AI idea is actually technically feasible?

Run a one-week feasibility spike: build a 50-sample eval set on day one, probe a frontier model against it on day two, audit your data sources on day three, model the unit economics on day four, and write a go/no-go memo on day five. If the model clears 85% pass rate, your data is available and licensable, and the cost-per-user math closes at your price point, the idea is feasible. If any of those three fail, you do not have a build problem. You have a requirements problem to resolve first.

What pass rate on an eval set means an AI task is feasible?

Above 85% on a well-constructed 50-sample eval set (including edge cases and adversarial inputs) using zero-shot or one-shot prompting means the capability exists and you are building a product. Between 60-85% means the capability exists but requires retrieval, fine-tuning, or significant prompt engineering. Below 60% is a no-go unless you suspect the eval set itself is poorly written, in which case rewrite it before drawing conclusions.

Can I validate AI feasibility without a data science team?

Yes. The one-week spike requires one engineer with API access and the ability to write a spreadsheet. You do not need a data scientist, a GPU, or any infrastructure. The eval set is hand-written. The probe is API calls. The data audit is a spreadsheet. The economics model is arithmetic. The value is in the rigor of the questions, not the sophistication of the tooling.

How much does a proper AI feasibility assessment cost?

The direct costs are minimal: frontier model API calls for a 50-sample eval rarely exceed $10-20 in tokens, plus one week of a senior engineer's time. The indirect cost of skipping it is typically 8-16 weeks of wasted build effort and the organizational credibility hit of a failed AI initiative. I run these as focused engagements for clients as part of my AI consultancy work, typically scoped to two to five days of structured investigation.

What are the most common reasons an AI idea fails the feasibility spike?

In order of frequency: (1) the core task has a pass rate below 60% because it requires reasoning the current model generation cannot reliably do, (2) the data source turns out to be legally unavailable or practically inaccessible, (3) the unit economics do not close at the intended price point and volume, and (4) latency at the required interaction modality (real-time, streaming) is incompatible with the UX requirement. Data availability failures are the most consistently underestimated.

Is a working demo the same as a feasibility result?

No. A demo is cherry-picked. It shows the model on inputs chosen to look good. A feasibility result is based on a representative eval set that includes noisy, edge-case, and adversarial inputs and measures pass rate across all of them. Treating a successful demo as a go-decision is the most expensive mistake I see teams make. Build the eval set first, always.

Work With Me to Validate Your AI Idea Before You Build

If your team is sitting on an AI idea and the honest answer to 'is this feasible?' is 'we think so, the demo looked good,' you are one week away from a real answer. The spike protocol above is what I run with clients before any architecture gets drawn or any engineer gets assigned.

I work as an independent AI strategy and systems consultant, not an agency. That means you get direct judgment from someone who has built and shipped AI systems in production, not a team of junior consultants using your project as a learning exercise. If you want to run this spike together, reach out here.

Validate your AI idea before you build it

Zalt Blog

How to Validate an AI Idea Is Feasible Before You Build It

Are you a software engineer moving into AI?

AI Personal Assistant

AI Marketing Manager

AI Sales Representative

AI Support Specialist