How to Build an AI Roadmap That Survives Contact With Reality

Prioritize AI initiatives by scoring each one on value, feasibility, data-readiness, and reversibility, then ship the boring high-certainty wins first.

That single sentence is the whole answer. Everything below is the scaffolding that makes it stick in a real organization with real politics, legacy data, and a CTO who just saw a competitor demo GPT-4o on stage.

I am Mahmoud Zalt, an independent senior AI systems architect with 16+ years building production software since 2010. I founded Sista AI and run its workforce of autonomous agents in production, which is where I learned to ship the boring high-certainty wins first. I work as an independent AI consultant and strategist, not an agency, which means I have skin in the outcome of every roadmap I touch. You can read more about my background or browse what I have shipped.

Why AI Roadmaps Die Before Quarter Two

I have reviewed more than two dozen AI roadmaps in the last three years. The failure pattern is almost always the same: the roadmap was built backwards. Someone showed leadership a compelling demo, leadership said 'we need that,' and a project was funded before anyone asked three basic questions: what data do we actually have, how long will integration take, and what happens if the model is wrong?

The result is predictable. Six months in, the flagship initiative is stuck in a data-quality spiral. A smaller, more tractable project that would have shipped in eight weeks and proved ROI is sitting at the bottom of the backlog. Morale drops. Budget gets questioned. The next AI proposal gets ten times more scrutiny than it deserves.

The fix is a scoring model applied before a single line of code is written, not after the demo has already seduced the stakeholders.

The Four-Factor Scoring Model

Score each candidate initiative on four dimensions, each on a 1-to-5 scale. Multiply them together. The product is your priority score. Initiatives with scores above 200 go into the first planning cycle. Below 100, kill or defer without guilt.

Factor	What it measures	Score 1 (bad)	Score 5 (good)
Value	Revenue impact, cost reduction, or risk reduction if it works perfectly	Vanity metric, no clear dollar link	Direct revenue or quantified cost line
Feasibility	Engineering complexity given your current stack and team	Requires capabilities you do not have and cannot hire in 60 days	Off-the-shelf model, existing infra, team has done it before
Data-readiness	Is the training or retrieval data clean, labeled, and accessible right now?	Data is siloed, unlabeled, or legally blocked	Structured, labeled, accessible via existing API
Reversibility	How bad is the worst-case failure and can you roll back?	Irreversible customer-facing action, regulatory exposure	Internal tool, human review before output ships

Max score: 625. The distribution in practice is tight: most genuinely fundable initiatives land between 120 and 350. Anything below 80 is a bet, not a plan.

Worked Example: Killing Two Initiatives at a Mid-Size SaaS Company

A B2B SaaS company came to me with nine AI initiatives on their roadmap. Leadership wanted to start with two: an AI sales coach that would analyze call recordings and give reps real-time suggestions, and a GPT-powered contract redlining tool. Both had been demoed internally and both had executive champions.

Here is what the scoring looked like after a two-day discovery session:

Initiative	Value	Feasibility	Data-readiness	Reversibility	Score
AI sales coach (real-time)	4	2	1	2	16
Contract redlining (GPT)	5	3	2	1	30
Support ticket classifier	3	5	5	5	375
Churn-risk scoring (weekly batch)	5	4	4	5	400

The sales coach scored a 16. The call recordings were in three different formats across two vendors, had no consent framework for AI processing (legal blocker), and real-time inference at sub-300ms latency required infra the team had never operated. The data-readiness score of 1 alone should have killed it. The reversibility score of 2 reflected that bad real-time advice in a live sales call is visible to a customer and hard to walk back.

The contract redlining tool scored a 30. The core problem was reversibility: contract errors have legal liability, and the team had no hallucination-mitigation plan. A human-in-the-loop review layer could have raised reversibility from 1 to 4, which would have taken the score to 120 and made it fundable, but that review layer was not scoped and would have doubled the project cost. It was not a bad idea, it was just not ready.

The boring winners: a support ticket classifier (375) and a churn-risk scoring model (400). Both used structured internal data, both had human review before any action triggered, both had clear revenue links (reduced support headcount and targeted retention spend), and both could be rolled back by turning off a feature flag. The team shipped the classifier in six weeks and the churn model in ten. Both were in production before the sales coach would have finished its legal review.

Data-Readiness Is the Factor Teams Lie to Themselves About

Value and feasibility are easy to score honestly because they feel abstract. Data-readiness is where wishful thinking creeps in. I have seen teams score their data a 4 because 'we have the data in the warehouse,' only to discover in week two that half the records are missing, the schema changed three times and nobody documented it, and the column they planned to use as the label was filled in inconsistently by five different sales ops people over four years.

Before scoring data-readiness above a 3, confirm all five of the following:

Volume: you have at least the minimum viable sample size for the task (for a classifier, that typically means 1,000 to 5,000 labeled examples per class; for RAG, at least 50 to 100 high-quality source documents per domain).
Labeling: the target variable exists and was produced by a consistent process, not inferred or back-filled.
Access: you can query the data today without a data-governance ticket that takes six weeks to resolve.
Legal clearance: there is no consent, privacy, or contractual barrier to using this data for model training or inference.
Freshness: the data distribution today resembles the data distribution you will see in production; a model trained on 2021 behavior and deployed into a market that shifted in 2024 will degrade silently.

Fail any one of these and your data-readiness score drops to 2 or below, regardless of how much data you technically have.

Reversibility Is Your Insurance Policy: Build It In from Day One

Reversibility is not just about rollback flags. It is about the blast radius when the model is confidently wrong, which it will be. The scoring factor captures three things: the severity of a bad output, the speed of detection, and the cost of correction.

A practical reversibility checklist for any AI initiative:

Human-in-the-loop gate: is there a human review step before the output triggers an irreversible action (sending an email, updating a contract, canceling an account)?
Confidence thresholding: does the system abstain or escalate when the model confidence falls below a calibrated threshold, rather than always returning an answer?
Observability: are you logging inputs, outputs, and confidence scores with enough context to debug a failure three weeks after it happens?
Eval suite: do you have a golden-set evaluation that you can run in under five minutes to detect regression before a deploy goes to production?
Kill switch: can you disable the AI layer with a single feature flag and fall back to the previous behavior?

Initiatives that score a 5 on reversibility typically have all five. Initiatives that score a 1 typically have none and their authors have not thought about failure at all.

Sequencing: How to Turn Scores Into a Quarterly Roadmap

Once every initiative has a score, sequencing follows three rules:

Rule 1: Ship a win in the first 90 days. Pick the highest-scoring initiative that can reach production (not demo, production) within 90 days. This builds organizational credibility and funds the next cycle. If no initiative can ship in 90 days, the roadmap is too ambitious and needs to be cut.

Rule 2: Run no more than two AI initiatives in parallel per engineering team. AI projects have a compounding context cost. Each additional parallel initiative degrades the team's ability to run proper evals, monitor production behavior, and respond to model drift. Two is a hard ceiling until you have a dedicated ML platform team.

Rule 3: Schedule a kill review at 30 days. For every running initiative, hold a 30-day checkpoint with the original scoring sheet. Re-score based on what you now know. If the score has dropped below 100 because a data assumption was wrong or a feasibility assumption was wrong, kill it and move to the next item in the queue. Sunk cost is not a reason to continue.

The output is a simple three-column table: this quarter (scored above 200, ships in 90 days), next quarter (scored 100-200, needs prep work), and deferred (scored below 100, revisit in six months with fresh data).

Cost Discipline and Vendor Selection in the Scoring Model

Cost does not appear as a separate scoring factor because it is already embedded in value (lower cost raises net value) and feasibility (prohibitive cost reduces feasibility). But two cost traps are worth naming explicitly.

Trap 1: Confusing API cost with total cost. A GPT-4o call costs fractions of a cent. The engineering cost to build reliable prompt chains, eval harnesses, fallback logic, observability, and human review workflows costs months of senior engineering time. I have seen teams approve an AI initiative based on a $200/month API estimate and then discover the true first-year cost is $300,000 once engineering is counted properly.

Trap 2: Defaulting to the most capable model. For a support ticket classifier, you do not need GPT-4o. A fine-tuned smaller model or even a well-prompted GPT-4o-mini equivalent will outperform a more expensive model on a narrow, well-defined task with good training data, at one-tenth the inference cost. The right question is not 'which model is best' but 'which model is sufficient for this task and this latency requirement at this cost point.'

On vendor selection: build your scoring model, pick your top-three initiatives, then select tooling. Not the other way around. Committing to a vendor or a model before you know what you are building is one of the most common and most expensive mistakes I see.

Frequently Asked Questions

how do I prioritize which AI initiatives to do first?

Score each initiative on value (business impact), feasibility (engineering complexity), data-readiness (is the data clean, labeled, and accessible today), and reversibility (how bad is a wrong answer and can you roll back). Multiply the four scores together. Ship the highest-scoring initiative that can reach production in 90 days. Kill anything below 100 without guilt.

what makes an AI initiative high priority vs low priority?

High-priority initiatives have a direct revenue or cost link, use structured data you already own, require capabilities your team already has or can acquire in days, and fail gracefully when the model is wrong. Low-priority initiatives are high on impressiveness and low on all four of those dimensions. The flashiest demo is almost never the highest-priority initiative.

how do I build an AI strategy without getting distracted by hype?

Apply the scoring model before any demo reaches leadership. If an initiative scores below 100, do not let it into the roadmap regardless of how compelling the demo was. The scoring model exists precisely to create a defensible, repeatable reason to say no that is not personal and is not political.

how long should the first AI initiative take to ship?

If it cannot reach production (not a demo, production with real users and real monitoring) in 90 days, it is too large for a first initiative. Break it down or pick a smaller scope. The 90-day target is not arbitrary: it is the minimum cycle for organizational credibility. Miss it and the next AI proposal gets twice the skepticism.

when should I use RAG vs fine-tuning vs a pre-trained model out of the box?

Start with a pre-trained model and good prompting. If accuracy on your specific task is still below your threshold after prompt engineering, add RAG if the gap is knowledge-related (the model does not know your domain). Fine-tune only if the gap is style or format related (the model knows the facts but cannot produce the right output shape) and you have at least 500 to 1,000 high-quality labeled examples. Most production use cases are solved at the RAG layer or earlier. Fine-tuning is expensive and maintenance-heavy; reserve it for when you have evidence it is needed.

what does an AI roadmap engagement with a consultant actually deliver?

In my AI strategy engagements, the deliverable is a scored initiative backlog, a 90-day first-ship plan, a data-readiness report for the top three initiatives, and a technical architecture sketch for each. The scoring process itself surfaces assumptions that would otherwise become expensive surprises at week six. Most teams leave with fewer initiatives than they came in with, which is the point.

Ready to Build a Roadmap That Ships?

If your organization has a list of AI ideas and no disciplined way to choose between them, the scoring model above is where to start. Apply it to your current list this week. If every initiative scores below 200, that is important information: you need better data infrastructure or a more constrained scope before any of them are worth funding.

If you want a senior independent perspective on your specific initiatives, I do focused AI strategy and roadmap engagements as an independent consultant. No agency overhead, no upsell pressure, direct access to someone who has built and shipped production AI systems. You can reach me at the contact page or review my work at /projects.

Get an independent AI roadmap review.

Zalt Blog

How to Build an AI Roadmap That Survives Contact With Reality

Are you a software engineer moving into AI?

AI Personal Assistant

AI Marketing Manager

AI Sales Representative

AI Support Specialist

Prioritize AI initiatives by scoring each one on value, feasibility, data-readiness, and reversibility, then ship the boring high-certainty wins first.

Why AI Roadmaps Die Before Quarter Two

The Four-Factor Scoring Model

Worked Example: Killing Two Initiatives at a Mid-Size SaaS Company

Data-Readiness Is the Factor Teams Lie to Themselves About

Reversibility Is Your Insurance Policy: Build It In from Day One

Sequencing: How to Turn Scores Into a Quarterly Roadmap

Cost Discipline and Vendor Selection in the Scoring Model

Frequently Asked Questions

how do I prioritize which AI initiatives to do first?

what makes an AI initiative high priority vs low priority?

how do I build an AI strategy without getting distracted by hype?

how long should the first AI initiative take to ship?

when should I use RAG vs fine-tuning vs a pre-trained model out of the box?

what does an AI roadmap engagement with a consultant actually deliver?

Ready to Build a Roadmap That Ships?

Read More

When You Should NOT Automate a Workflow With AI

What Does It Cost to Build a Custom AI Agent in 2026?

Free AI Tools

About the Author

Support this content

Share this article