Skip to main content
المدونة

Zalt Blog

Deep Dives into Code & Architecture

AT SCALE

How to Build an AI Roadmap That Survives Contact With Reality

By محمود الزلط
Insights
12m read
<

Most AI roadmaps die because teams chase the flashiest demo, not the highest-certainty win. I use a 4-factor scoring model to kill bad bets early and ship the boring initiatives that actually compound.

/>
How to Build an AI Roadmap That Survives Contact With Reality - Featured blog post image
Mahmoud Zalt

1:1 Mentor

Are you a software engineer moving into AI?

Let's have a call. I'll help you modernize your skills and learn the tools, systems, and architecture behind real AI products. One session or ongoing.

Hire AI Employees

Hire AI Employees that work 24/7. No code.

Prioritize AI initiatives by scoring each one on value, feasibility, data-readiness, and reversibility, then ship the boring high-certainty wins first.

That single sentence is the whole answer. Everything below is the scaffolding that makes it stick in a real organization with real politics, legacy data, and a CTO who just saw a competitor demo GPT-4o on stage.

I am Mahmoud Zalt, an independent senior AI systems architect with 16+ years building production software since 2010. I founded Sista AI and run its workforce of autonomous agents in production, which is where I learned to ship the boring high-certainty wins first. I work as an independent AI consultant and strategist, not an agency, which means I have skin in the outcome of every roadmap I touch. You can read more about my background or browse what I have shipped.

Why AI Roadmaps Die Before Quarter Two

I have reviewed more than two dozen AI roadmaps in the last three years. The failure pattern is almost always the same: the roadmap was built backwards. Someone showed leadership a compelling demo, leadership said 'we need that,' and a project was funded before anyone asked three basic questions: what data do we actually have, how long will integration take, and what happens if the model is wrong?

The result is predictable. Six months in, the flagship initiative is stuck in a data-quality spiral. A smaller, more tractable project that would have shipped in eight weeks and proved ROI is sitting at the bottom of the backlog. Morale drops. Budget gets questioned. The next AI proposal gets ten times more scrutiny than it deserves.

The fix is a scoring model applied before a single line of code is written, not after the demo has already seduced the stakeholders.

The Four-Factor Scoring Model

Score each candidate initiative on four dimensions, each on a 1-to-5 scale. Multiply them together. The product is your priority score. Initiatives with scores above 200 go into the first planning cycle. Below 100, kill or defer without guilt.

FactorWhat it measuresScore 1 (bad)Score 5 (good)
ValueRevenue impact, cost reduction, or risk reduction if it works perfectlyVanity metric, no clear dollar linkDirect revenue or quantified cost line
FeasibilityEngineering complexity given your current stack and teamRequires capabilities you do not have and cannot hire in 60 daysOff-the-shelf model, existing infra, team has done it before
Data-readinessIs the training or retrieval data clean, labeled, and accessible right now?Data is siloed, unlabeled, or legally blockedStructured, labeled, accessible via existing API
ReversibilityHow bad is the worst-case failure and can you roll back?Irreversible customer-facing action, regulatory exposureInternal tool, human review before output ships

Max score: 625. The distribution in practice is tight: most genuinely fundable initiatives land between 120 and 350. Anything below 80 is a bet, not a plan.

Worked Example: Killing Two Initiatives at a Mid-Size SaaS Company

A B2B SaaS company came to me with nine AI initiatives on their roadmap. Leadership wanted to start with two: an AI sales coach that would analyze call recordings and give reps real-time suggestions, and a GPT-powered contract redlining tool. Both had been demoed internally and both had executive champions.

Here is what the scoring looked like after a two-day discovery session:

InitiativeValueFeasibilityData-readinessReversibilityScore
AI sales coach (real-time)421216
Contract redlining (GPT)532130
Support ticket classifier3555375
Churn-risk scoring (weekly batch)5445400

The sales coach scored a 16. The call recordings were in three different formats across two vendors, had no consent framework for AI processing (legal blocker), and real-time inference at sub-300ms latency required infra the team had never operated. The data-readiness score of 1 alone should have killed it. The reversibility score of 2 reflected that bad real-time advice in a live sales call is visible to a customer and hard to walk back.

The contract redlining tool scored a 30. The core problem was reversibility: contract errors have legal liability, and the team had no hallucination-mitigation plan. A human-in-the-loop review layer could have raised reversibility from 1 to 4, which would have taken the score to 120 and made it fundable, but that review layer was not scoped and would have doubled the project cost. It was not a bad idea, it was just not ready.

The boring winners: a support ticket classifier (375) and a churn-risk scoring model (400). Both used structured internal data, both had human review before any action triggered, both had clear revenue links (reduced support headcount and targeted retention spend), and both could be rolled back by turning off a feature flag. The team shipped the classifier in six weeks and the churn model in ten. Both were in production before the sales coach would have finished its legal review.

Data-Readiness Is the Factor Teams Lie to Themselves About

Value and feasibility are easy to score honestly because they feel abstract. Data-readiness is where wishful thinking creeps in. I have seen teams score their data a 4 because 'we have the data in the warehouse,' only to discover in week two that half the records are missing, the schema changed three times and nobody documented it, and the column they planned to use as the label was filled in inconsistently by five different sales ops people over four years.

Before scoring data-readiness above a 3, confirm all five of the following:

  • Volume: you have at least the minimum viable sample size for the task (for a classifier, that typically means 1,000 to 5,000 labeled examples per class; for RAG, at least 50 to 100 high-quality source documents per domain).
  • Labeling: the target variable exists and was produced by a consistent process, not inferred or back-filled.
  • Access: you can query the data today without a data-governance ticket that takes six weeks to resolve.
  • Legal clearance: there is no consent, privacy, or contractual barrier to using this data for model training or inference.
  • Freshness: the data distribution today resembles the data distribution you will see in production; a model trained on 2021 behavior and deployed into a market that shifted in 2024 will degrade silently.

Fail any one of these and your data-readiness score drops to 2 or below, regardless of how much data you technically have.

Reversibility Is Your Insurance Policy: Build It In from Day One

Reversibility is not just about rollback flags. It is about the blast radius when the model is confidently wrong, which it will be. The scoring factor captures three things: the severity of a bad output, the speed of detection, and the cost of correction.

A practical reversibility checklist for any AI initiative:

  • Human-in-the-loop gate: is there a human review step before the output triggers an irreversible action (sending an email, updating a contract, canceling an account)?
  • Confidence thresholding: does the system abstain or escalate when the model confidence falls below a calibrated threshold, rather than always returning an answer?
  • Observability: are you logging inputs, outputs, and confidence scores with enough context to debug a failure three weeks after it happens?
  • Eval suite: do you have a golden-set evaluation that you can run in under five minutes to detect regression before a deploy goes to production?
  • Kill switch: can you disable the AI layer with a single feature flag and fall back to the previous behavior?

Initiatives that score a 5 on reversibility typically have all five. Initiatives that score a 1 typically have none and their authors have not thought about failure at all.

Sequencing: How to Turn Scores Into a Quarterly Roadmap

Once every initiative has a score, sequencing follows three rules:

Rule 1: Ship a win in the first 90 days. Pick the highest-scoring initiative that can reach production (not demo, production) within 90 days. This builds organizational credibility and funds the next cycle. If no initiative can ship in 90 days, the roadmap is too ambitious and needs to be cut.

Rule 2: Run no more than two AI initiatives in parallel per engineering team. AI projects have a compounding context cost. Each additional parallel initiative degrades the team's ability to run proper evals, monitor production behavior, and respond to model drift. Two is a hard ceiling until you have a dedicated ML platform team.

Rule 3: Schedule a kill review at 30 days. For every running initiative, hold a 30-day checkpoint with the original scoring sheet. Re-score based on what you now know. If the score has dropped below 100 because a data assumption was wrong or a feasibility assumption was wrong, kill it and move to the next item in the queue. Sunk cost is not a reason to continue.

The output is a simple three-column table: this quarter (scored above 200, ships in 90 days), next quarter (scored 100-200, needs prep work), and deferred (scored below 100, revisit in six months with fresh data).

Cost Discipline and Vendor Selection in the Scoring Model

Cost does not appear as a separate scoring factor because it is already embedded in value (lower cost raises net value) and feasibility (prohibitive cost reduces feasibility). But two cost traps are worth naming explicitly.

Trap 1: Confusing API cost with total cost. A GPT-4o call costs fractions of a cent. The engineering cost to build reliable prompt chains, eval harnesses, fallback logic, observability, and human review workflows costs months of senior engineering time. I have seen teams approve an AI initiative based on a $200/month API estimate and then discover the true first-year cost is $300,000 once engineering is counted properly.

Trap 2: Defaulting to the most capable model. For a support ticket classifier, you do not need GPT-4o. A fine-tuned smaller model or even a well-prompted GPT-4o-mini equivalent will outperform a more expensive model on a narrow, well-defined task with good training data, at one-tenth the inference cost. The right question is not 'which model is best' but 'which model is sufficient for this task and this latency requirement at this cost point.'

On vendor selection: build your scoring model, pick your top-three initiatives, then select tooling. Not the other way around. Committing to a vendor or a model before you know what you are building is one of the most common and most expensive mistakes I see.

Frequently Asked Questions

how do I prioritize which AI initiatives to do first?

Score each initiative on value (business impact), feasibility (engineering complexity), data-readiness (is the data clean, labeled, and accessible today), and reversibility (how bad is a wrong answer and can you roll back). Multiply the four scores together. Ship the highest-scoring initiative that can reach production in 90 days. Kill anything below 100 without guilt.

what makes an AI initiative high priority vs low priority?

High-priority initiatives have a direct revenue or cost link, use structured data you already own, require capabilities your team already has or can acquire in days, and fail gracefully when the model is wrong. Low-priority initiatives are high on impressiveness and low on all four of those dimensions. The flashiest demo is almost never the highest-priority initiative.

how do I build an AI strategy without getting distracted by hype?

Apply the scoring model before any demo reaches leadership. If an initiative scores below 100, do not let it into the roadmap regardless of how compelling the demo was. The scoring model exists precisely to create a defensible, repeatable reason to say no that is not personal and is not political.

how long should the first AI initiative take to ship?

If it cannot reach production (not a demo, production with real users and real monitoring) in 90 days, it is too large for a first initiative. Break it down or pick a smaller scope. The 90-day target is not arbitrary: it is the minimum cycle for organizational credibility. Miss it and the next AI proposal gets twice the skepticism.

when should I use RAG vs fine-tuning vs a pre-trained model out of the box?

Start with a pre-trained model and good prompting. If accuracy on your specific task is still below your threshold after prompt engineering, add RAG if the gap is knowledge-related (the model does not know your domain). Fine-tune only if the gap is style or format related (the model knows the facts but cannot produce the right output shape) and you have at least 500 to 1,000 high-quality labeled examples. Most production use cases are solved at the RAG layer or earlier. Fine-tuning is expensive and maintenance-heavy; reserve it for when you have evidence it is needed.

what does an AI roadmap engagement with a consultant actually deliver?

In my AI strategy engagements, the deliverable is a scored initiative backlog, a 90-day first-ship plan, a data-readiness report for the top three initiatives, and a technical architecture sketch for each. The scoring process itself surfaces assumptions that would otherwise become expensive surprises at week six. Most teams leave with fewer initiatives than they came in with, which is the point.

Ready to Build a Roadmap That Ships?

If your organization has a list of AI ideas and no disciplined way to choose between them, the scoring model above is where to start. Apply it to your current list this week. If every initiative scores below 200, that is important information: you need better data infrastructure or a more constrained scope before any of them are worth funding.

If you want a senior independent perspective on your specific initiatives, I do focused AI strategy and roadmap engagements as an independent consultant. No agency overhead, no upsell pressure, direct access to someone who has built and shipped production AI systems. You can reach me at the contact page or review my work at /projects.

Get an independent AI roadmap review.

Thanks for reading! I hope this was useful. If you have questions or thoughts, feel free to reach out.

Content Creation Process: This article was generated via a semi-automated workflow using AI tools. I prepared the strategic framework, including specific prompts and data sources. From there, the automation system conducted the research, analysis, and writing. The content passed through automated verification steps before being finalized and published without manual intervention.

Mahmoud Zalt

About the Author

I’m Zalt, a technologist with 16+ years of experience, passionate about designing and building AI systems that move us closer to a world where machines handle everything and humans reclaim wonder.

Let's connect if you're working on interesting AI projects, looking for technical advice or want to discuss anything.

Support this content

Share this article