From "We Should Use AI" to a 90-Day Roadmap: A Step-by-Step Plan

Turn 'We Should Use AI' Into a 90-Day Plan

The answer is a three-phase sequence: one week of ruthless scoping, four weeks of building a single thin vertical slice into production, two weeks of evals and observability, and then a scaling decision grounded in real data. You do not need a steering committee, a vendor bake-off, or a pilot programme that never ships. You need one use case live, measured, and defensible by day 45.

I am Mahmoud Zalt, an independent senior AI systems architect with 16+ years building production software. I created Laradock (Docker tooling used by millions of developers) and Apiato, and I founded Sista AI. I run a solo AI consultancy that helps engineering teams and founders go from 'we should use AI' to working systems in production, without the six-month discovery theatre. You can read more about my background on the about page.

Why Most AI Roadmaps Never Leave the Whiteboard

The failure mode is always the same: a company holds three workshops, produces a twenty-slide strategy deck, picks five use cases, and tries to run them in parallel. Six months later, nothing is live and someone is proposing a new round of discovery.

The root causes are predictable:

Too many use cases open at once. Teams spread thin across five ideas deliver zero production value. One focused team delivers one working system.
No forcing function. Without a hard deadline for something real in production, 'the pilot' becomes a permanent state. Pilots do not create organizational learning. Production systems do.
Evals written after the fact. Teams build, demo, get excited, and only then ask 'how do we know it is working?' By that point, the goalposts have moved and there is no baseline to compare against.
Wrong first use case. Teams pick the highest-value use case, which is almost always the hardest. The right first use case is the one you can instrument, evaluate, and ship in four weeks with the team you already have.

The 90-day plan below is designed to eliminate all four failure modes. It is not a template. It is a forcing function.

The Week-by-Week 90-Day Sequence

Days 1 to 7: Scoping Sprint

The goal of week one is not ideation. It is elimination. You arrive at the end of week one with exactly one use case selected, one measurable success metric defined, and one team member who owns it.

The scoping criteria I use:

Data ready today. If you need six weeks to get data access, that use case is not first.
Evaluable automatically. You must be able to write an eval harness before you write a line of model code. If you cannot define correctness without a human reading every output, that use case is not first.
User-facing or operator-facing, not internal research. The feedback loop needs to be tight. Internal tools where no one will notice if it is slightly wrong are graveyard bait.
Bounded scope. One input type, one output type, one workflow step. Not 'AI-powered onboarding.' Something like: classify inbound support tickets into seven categories with a confidence score, and route the low-confidence ones to a human queue.

Week one deliverable: a one-page scoping document with the use case, the eval metric (precision/recall, BLEU, LLM-as-judge score, human review rate, whatever is appropriate), the data source, and the 'done' definition for day 45.

Days 8 to 21: Eval Harness Before Model Code

Before you call a single API, you build the eval harness. This is the step teams skip and always regret.

A minimal eval harness for a classification task looks like this: a golden dataset of 200 to 500 labelled examples drawn from real production data, a script that runs the model against that dataset and reports precision, recall, and a confusion matrix, and a threshold definition: what score is 'good enough to ship' versus 'needs human review' versus 'do not use.'

If your task is generative (summarisation, drafting, extraction), your eval harness includes an LLM-as-judge scorer with a rubric you have validated against 50 human-judged examples. The rubric needs to be specific enough that two annotators agree on 85% of cases. If they do not agree, your rubric is not specific enough, and your use case is probably not scoped tightly enough.

Week two to three deliverable: eval harness running, baseline score established using a simple heuristic or a fine-tuned smaller model as the lower bound, and the 'ship threshold' documented and agreed on with your stakeholders.

Days 22 to 45: Thin Vertical Slice Into Production

Now you build. The mandate is a thin vertical slice: the simplest possible version of the feature that exercises the full stack from input to model to output to the user, with logging, with guardrails, with a fallback path.

What 'production' means here is not 'everyone uses it.' It means real users, real data, real load, with observability. A 5% traffic slice is production. A shadow mode where the model runs but the output goes to a dashboard for human review is production. An internal team using the tool for their actual work is production.

What it does not mean: a demo environment, a Jupyter notebook, a static screenshot, a Slack channel where you post outputs manually.

The engineering checklist for this phase:

Structured logging on every model call: input hash, model version, latency, token count, cost, eval score if you can run it cheaply inline
Guardrails at input (length limits, PII stripping, content classification if user-generated) and output (length, format validation, refusal detection)
A human-in-the-loop escape hatch: any output below your confidence threshold routes to a human queue, not to the user
A rollback switch: a feature flag that cuts the model out of the path entirely with one config change, no deploy required

Day 45 deliverable: the feature is live on real traffic, the eval harness is running against a sample of production outputs daily, and you have a cost-per-task number.

Days 46 to 60: Evals, Observability, and the First Honest Retrospective

This is the phase most teams skip on the way to 'scaling.' Do not skip it.

You now have two weeks of production data. The questions you answer in this phase:

Does the eval score on production data match your golden dataset score? If not, why? Distribution shift is the most common answer: your golden set was not representative of real inputs.
What is the actual human review rate? If it is higher than you projected, you need to understand whether the threshold is wrong, the model is wrong, or the use case is harder than it looked.
What is the cost per task versus the value per task? For a support ticket classifier routing 1000 tickets per day, you need a number like '$0.003 per ticket' and a comparison to the manual triage cost.
What are the failure modes you did not anticipate? Run a sample of human-reviewed outputs through a failure analysis. Cluster the errors. The top two or three error clusters become your improvement backlog.

Day 60 deliverable: a two-page honest retrospective with the real eval numbers, the real cost, the real human review rate, and a typed backlog of improvements ranked by impact.

Days 61 to 90: Scaling Decision and Next Use Case Selection

By day 60, you have enough data to make a real decision. The options are:

Scale this use case. If eval scores are at target, cost is acceptable, and human review rate is below threshold, you expand traffic, harden the integration, and potentially fine-tune to reduce cost or improve edge case handling.
Improve before scaling. If there is a clear, bounded improvement that would move a specific metric, you do that first. One sprint, one metric, re-evaluate.
Retire and move on. If the use case is fundamentally harder than the data suggested, or the economics do not work, you stop. This is not failure. This is the system working correctly. You learned cheaply. A six-month pilot graveyard would have cost ten times as much to reach the same conclusion.

In parallel, starting around day 75, you run a second scoping sprint for the next use case. This time it is easier because you have a working eval harness pattern, a logging infrastructure, guardrail patterns, and organizational credibility from the first shipped system.

Day 90 deliverable: a scaling decision documented with the supporting data, and a scoping document for the second use case.

Worked Example: Support Ticket Classifier

Here is the full 90-day plan applied to a concrete case: a B2B SaaS company with 800 to 1200 inbound support tickets per day, a five-person support team, and an average first-response time of four hours.

Day 1 to 7: Scoping

Use case selected: classify tickets into nine categories (billing, login, integration, data export, API error, feature request, abuse/spam, onboarding, other) and assign a confidence score. Low-confidence tickets go to a 'needs human triage' queue. Success metric: human triage rate below 15% (meaning 85% of tickets are classified with enough confidence to route automatically), with precision above 92% on the auto-routed tickets.

Day 8 to 21: Eval Harness

500 tickets labelled by the support team lead. A Python script that calls the model with a structured prompt, logs the predicted category and confidence, and computes precision, recall, and the confusion matrix per category. Baseline using keyword matching: 61% precision. Target: 92% precision at 85% auto-route rate.

Day 22 to 45: Production Slice

A webhook on the ticketing system calls a lightweight service that classifies the ticket, attaches the category and confidence as metadata, and routes it if confidence exceeds 0.78. Below 0.78, the ticket goes to the human triage queue with the model's top two guesses shown as suggested categories. Logs go to Datadog with a custom dashboard. Cost: $0.0028 per ticket at GPT-4o-mini pricing at the time, or $2.24 per 800 tickets per day.

Day 46 to 60: Honest Retrospective

Production precision: 89% (below the 92% target). Human triage rate: 19% (above the 15% target). Root cause: 'integration' tickets split into two clusters the labelling missed: Zapier integrations versus native API integrations. The model was confused by the overlap. Fix: relabel 80 examples and add a subcategory split. After one week of relabelling and a prompt update, precision moves to 93%, triage rate to 13%.

Day 61 to 90: Scale and Next Use Case

Traffic expanded to 100%. First-response time drops from 4 hours to 38 minutes for auto-routed tickets. Second use case scoped: draft a suggested reply for the most common category (login issues, 22% of tickets), using the ticket text and the user's account data as context. Scoping sprint starts day 75.

What Teams Get Wrong at Each Phase

Phase	Common Mistake	Consequence
Scoping	Picking the highest-value use case first	Three months of work, nothing shipped, stakeholder trust gone
Eval harness	Skipping it and using 'vibe checks'	No way to detect regression, no basis for the scaling decision
Production slice	Calling a demo environment 'production'	No real feedback, no cost data, no distribution shift signal
Retrospective	Skipping it to go straight to phase two	Scaling a broken system, compounding the errors
Scaling decision	Scaling on stakeholder enthusiasm instead of eval data	Reliability incidents, cost overruns, loss of user trust

Retrieval, Tool Calling, and MCP: When to Add Them

A mistake I see constantly: teams decide they need RAG and an MCP server before they have a working baseline. Retrieval and tool calling are complexity multipliers. Add them only when the baseline system has a measured, specific gap that they fix.

The decision tree is simple: if your model is failing because it lacks access to information that exists in your systems (knowledge bases, live data, user-specific context), add retrieval. If it is failing because it needs to take an action (write to a database, call an external API, update a record), add tool calling. If it is failing for a reason you have not diagnosed yet, go back to the eval harness.

For MCP specifically: it is valuable when you have multiple agents or tools that need to share context and capabilities in a standardised way. It is not valuable as a first step. Get the single-use-case system working and evaluated before you standardise the infrastructure for ten use cases.

The same applies to fine-tuning. Fine-tuning is a late-stage optimisation, not a starting point. The sequence is: prompt engineering first, then retrieval augmentation if needed, then fine-tuning if you have a large labelled dataset and a specific, measurable gap that prompting cannot close.

Cost, Security, and Guardrails in the First 90 Days

Cost

Track cost per task from day one, not cost per month. 'We spent $400 this month on the AI feature' is not actionable. '$0.003 per ticket classified, and manual triage costs $0.85 per ticket, and we are auto-routing 85% of 30,000 tickets per month' is a business case. Build the cost-per-task metric into your logging from the first day of the production slice.

Model selection matters more than most teams realise. For a classification task with a well-engineered prompt and a golden dataset for few-shot examples, a smaller, faster model (GPT-4o-mini, Haiku, Gemini Flash) will often match a frontier model at one-tenth the cost. Test on your eval harness. Let the numbers decide.

Security and Guardrails

For user-facing AI features in production, the minimum viable guardrail set is:

Input length and format validation. Hard limits on input length. Structural validation where the input type is known (JSON schema validation, for example).
PII detection on inputs. If your use case involves user-generated text going to an external model API, you need to either strip PII before sending or use a model deployed in your own infrastructure.
Output format validation. If the model is supposed to return structured data, validate the structure. Do not pass unvalidated model output to downstream systems.
Refusal and off-topic detection. For any use case where the model could be prompted to behave in ways that are off-scope, add a lightweight classifier or a prompt-based check on the output before it reaches the user.
Rate limiting. Per-user and per-tenant limits on model calls. An unprotected AI feature is an open cost sink.

Human-in-the-Loop Is Not a Fallback, It Is the Design

The instinct is to treat human review as the failure case: the model failed, so a human steps in. That framing is backwards. Human-in-the-loop is the architecture, especially in the first 90 days.

Every AI system in production should have a defined set of conditions under which it escalates to a human: low confidence, edge case patterns, high-stakes outputs (anything involving money, legal language, medical context, account changes). The human queue is not the consolation prize. It is the mechanism by which the system gets better over time.

The outputs your human reviewers handle are your most valuable training data. Log every human-reviewed output with the reviewer's decision. That log is the next iteration of your golden dataset. If you are not collecting it, you are leaving the most important signal on the floor.

In the 90-day plan, I budget 5 to 10% of total scope for building the human review interface. It is usually a simple queue with the model's output, the confidence score, the top alternative guesses, and a one-click accept/edit/reject. Nothing fancy. But it has to exist from day one of the production slice.

Frequently Asked Questions

How long does the scoping phase actually take and can I skip it?

One week, and no. The scoping week is the highest-leverage week of the entire 90 days. Teams that skip it and go straight to building almost always build the wrong thing or build the right thing without the eval harness, which means they cannot measure whether it is working. One week of structured scoping eliminates months of wasted build time. If leadership pressure is forcing you to skip it, that is a governance problem, not a timeline problem, and it needs to be surfaced explicitly.

What if we do not have enough labelled data to build an eval harness?

200 examples is enough for a first eval harness on most classification or extraction tasks. If you genuinely cannot get 200 labelled examples in two weeks, that is a signal that your data access problem is the real blocker, not the AI work. Fix the data access first. For generative tasks where labelling is expensive, an LLM-as-judge approach with a validated rubric can substitute for large human-labelled datasets, but you still need 50 human-judged examples to validate the rubric itself.

Do I need a dedicated ML engineer to run this plan?

No. A senior backend engineer with Python skills and API integration experience can execute this plan. The eval harness is a Python script. The production slice is an API integration with logging. The hard part is not the engineering, it is the discipline: writing the evals before the model code, doing the retrospective honestly, and making the scaling decision on data rather than enthusiasm. Those are process and judgment issues, not machine learning engineering issues.

How do I get stakeholder buy-in for the 90-day timeline?

Show the failure mode you are avoiding. Most stakeholders have seen at least one AI pilot that ran for six months and produced a demo. The 90-day plan promises something different: a real system on real traffic with a real cost number and a real eval score by day 45, and a documented scaling decision by day 90. That is a concrete, verifiable commitment. Compare it to the alternative: a multi-use-case strategy programme with no production system until month six. The 90-day plan is the faster path to a defensible outcome.

What happens if the first use case fails?

If 'fails' means the eval scores never reach target and the economics do not work, you stop the use case at the day-60 retrospective, document what you learned, and select a different first use case. You have spent 60 days and a small amount of compute cost. That is a cheap lesson. The alternative, running a six-month programme on a use case that was never viable, is far more expensive. The 90-day plan is designed to make failure fast and cheap, not to guarantee success on the first pick.

Should I use an off-the-shelf AI platform or build the integration myself?

For the first 90 days, integrate directly with a model API and own the integration code. Off-the-shelf platforms add abstraction layers that make it harder to instrument, debug, and understand what is happening. Once you have a working, evaluated system and a clear picture of where the complexity lives, you can make an informed decision about whether a platform layer saves you time or adds cost without proportional value. Do not make that decision on day one based on vendor demos.

Ready to Build Your 90-Day AI Roadmap?

The difference between a 90-day roadmap that ships and one that dies in committee is someone who has done this before, keeping the scoping honest, the evals rigorous, and the stakeholder pressure from forcing premature scale decisions. If you want a working AI system in production by day 45 and a scaling plan grounded in real data by day 90, that is exactly what I do through my AI consultancy practice.

I work with engineering teams and founders as a solo independent architect, not as an agency. Engagements are direct and hands-on. You can see examples of what I have built on the projects page and read more about how I work on the about page. If you are ready to scope your first use case, get in touch and we can run the scoping sprint together.

Work With Me on Your AI Roadmap

Zalt Blog

Are you a software engineer moving into AI?

AI Personal Assistant

AI Marketing Manager

AI Sales Representative

AI Support Specialist