The First 30-60-90 Days of a Fractional AI Officer: A Concrete Deliverables Plan

What a Fractional AI Officer Should Deliver in the First 90 Days

A fractional AI officer should deliver three concrete things in the first 90 days: a documented audit and at least one quick win by day 30, a costed and prioritized roadmap with guardrails in place by day 60, and at least one shipped pilot with evals, observability, and a governance charter by day 90. If you reach day 90 and you have only strategy documents, the engagement has failed.

I am Mahmoud Zalt, an independent AI systems architect with 16+ years building production software. As founder of Sista AI, I have spent the past year keeping a workforce of autonomous agents running in production, which is the same discipline a 30-60-90 plan demands. I have worked as a Fractional AI Officer for companies ranging from 20-person startups to 500-person scale-ups. This plan is what I actually execute, not what sounds good in a proposal. Learn more about my background or see what I have shipped.

Why Most Fractional AI Engagements Fail Early

The failure pattern is almost always the same: the officer spends the first month in meetings, produces a 40-slide strategy deck, and then discovers no one has budget approval authority or a working data pipeline. Month two becomes unblocking month one. By month three the company questions the ROI.

The fix is sequencing. You must complete the audit before you propose roadmap items, because the audit will kill half the ideas you walked in with. You must ship one working pilot before you ask leadership to fund a second one, because a live demo destroys more organizational resistance than any presentation.

What Teams Get Wrong at the Start

Skipping the data audit. Proposing RAG pipelines before knowing whether the document corpus is versioned, access-controlled, or even clean is the fastest way to build something that cannot go to production.
Treating 'AI strategy' as the deliverable. Strategy is an input to the plan. The deliverable is a working system with measured outcomes.
Choosing the wrong first pilot. High-visibility, high-complexity pilots fail publicly. The first pilot should be low-stakes, high-frequency, and already have a baseline metric to beat.
No eval framework on day one. If you cannot measure the model's output quality before you ship, you cannot defend the system when something goes wrong.

Days 1 to 30: Audit, Context, and One Quick Win

The first 30 days are about learning fast and demonstrating that the role is not just advisory. Every conversation should produce a documented artifact. Every artifact should feed the roadmap that ships in week six.

Week 1 and 2: The Technical and Organizational Audit

I run a structured audit across five dimensions. Each one becomes a scored section in the audit report that gets presented to leadership at day 30.

Dimension	What I Assess	Output
Data readiness	Schema quality, access control, versioning, PII presence, volume and freshness	Data readiness score (1-5) with blockers listed
Current AI usage	Existing tools, vendor contracts, shadow AI usage, prompt engineering maturity	Inventory of live AI touchpoints and cost per month
Infrastructure	Cloud provider, MLOps tooling, CI/CD maturity, secrets management, observability stack	Gap list with effort estimates
Team capability	Who can prompt, who can fine-tune, who owns production incidents	Skills matrix and hiring/training needs
Risk and compliance	Data residency, GDPR/SOC2 scope, third-party model data handling, acceptable-use policy existence	Risk register with severity ratings

Week 3: Quick Win Selection and Execution

By day 15 I have enough context to pick one quick win. The selection criteria: it must touch a process that happens more than 20 times per week, the team currently spends more than 30 minutes per instance, and a working prototype can be built in under 40 hours of engineering time. A support ticket triage classifier, a first-pass code review summarizer, or an internal document Q and A over a bounded corpus all fit this profile.

The quick win ships as a real, if limited, system. Not a demo. It connects to real data, runs in a staging environment, and has at least one eval: a human-reviewed sample of 50 outputs rated good or bad. That eval baseline is used every sprint from this point forward.

Week 4: Audit Presentation

The day-30 audit report contains: the five-dimension audit scorecard, the quick win in staging with its eval results, a list of 8 to 12 candidate roadmap items with a rough effort and impact matrix, and the three biggest blockers that need executive action. The presentation is 20 minutes, not a deck marathon. Decisions are made in the room.

Days 31 to 60: Costed Roadmap and Guardrails

Month two is about turning the audit findings into a plan that can survive budget approval and about putting the technical and organizational guardrails in place before any pilot goes live. You cannot add guardrails after a model is in production without a rewrite.

The Costed Roadmap Format

Every roadmap item gets four fields: the business outcome it improves and its current baseline metric, the technical approach in one sentence, the total cost estimate broken into model inference cost per month, engineering days, and any third-party tooling, and the risk tier (low/medium/high) based on data sensitivity and user-facing surface area. A roadmap without cost estimates is a wish list, not a plan.

For a typical 50-person B2B SaaS company, the month-two roadmap looks like: two low-risk pilots approved to proceed, one medium-risk item moved to quarter two pending a data cleanup prerequisite, and two items killed because the ROI does not survive the inference cost math.

Guardrails That Must Be in Place Before Any Pilot Ships to Production

Input validation and output filtering. Every prompt going to a hosted model passes through a content classifier that blocks PII and off-topic injection attempts. Output is checked for hallucination markers specific to your domain before it reaches the user.
Observability. Every LLM call is logged with: timestamp, model version, prompt hash, token count, latency, and the eval score if one runs inline. I use a structured log format that feeds into whatever the team already uses, whether that is Datadog, CloudWatch, or a Postgres table. No proprietary observability vendor lock-in in month two.
Cost alerting. A hard budget cap at the API provider level and a Slack alert at 60% of monthly budget. Teams consistently underestimate inference cost at scale. A RAG pipeline that processes 500 queries per day at 4k tokens per query and a frontier model costs roughly USD 45 per day at June 2026 pricing. That is USD 1,350 per month before any caching. Budget this before you demo to the board.
Model version pinning. Every deployment specifies an exact model version, not 'latest'. Provider model updates have broken production evals without warning. Pin the version. Schedule a quarterly review to upgrade deliberately.
Human-in-the-loop gates. Any output that crosses a confidence threshold below 0.75 (on your internal eval rubric) routes to a human queue. This is not optional for customer-facing systems in month two. You do not have enough eval data yet to trust autonomous operation.

The Governance Charter

By day 60, one document exists and is signed by the CTO or equivalent: the AI governance charter. It covers acceptable use, prohibited use cases, data handling rules for AI systems, the incident response process for model failures, and who has authority to approve new AI deployments. One page. Not a committee report. This document becomes the yes/no gate for every future AI initiative.

Days 61 to 90: Shipped Pilots, Evals in CI, and Governance Live

Month three is when the engagement proves its value. At least one pilot ships to real users with a measurement framework. The eval suite runs in CI so regressions are caught before deployment. The governance charter is operationalized, not just signed.

What 'Shipped' Means

Shipped means real users are using it, there is a feedback loop, and someone owns the on-call for it. A pilot in a sandbox with five internal testers is not shipped. Shipped has: a rollout plan (percentage-based or segment-based), a rollback procedure documented and tested, a live dashboard showing eval scores and cost, and a defined success threshold, for example, 'support ticket first-response time drops from 4 hours to under 30 minutes for 80% of tickets in category A'.

Evals in CI: The Minimum Viable Setup

By day 90 the eval pipeline runs automatically on every pull request that touches a prompt, a retrieval config, or a model version. The pipeline: samples 100 representative inputs from the production log, runs the new prompt or model version against them, scores outputs using the same rubric the human reviewers used in week three, and blocks the merge if any of three eval metrics drop more than 5% from the current production baseline.

This is not expensive to build. The eval runner is 200 lines of Python. The rubric is a JSON file in the repo. The CI step adds under 3 minutes to the pipeline. Teams that skip this discover regressions from users, not from tests.

Retrieval and Tool-Calling Quality

If either shipped pilot uses RAG or tool calling (MCP or otherwise), two additional eval metrics are tracked. For retrieval: context recall at k=5 and context precision at k=5, measured against a golden dataset of 50 question-answer pairs the team assembled in week two. For tool calling: tool selection accuracy (did the model call the right tool?) and argument validity rate (were the arguments parseable and in-range?). A pilot that passes the output quality eval but fails at retrieval recall is surfacing the wrong documents to the model and will degrade invisibly over time.

The Day-90 Readout

The 90-day readout is a 30-minute business review with three artifacts: the live pilot dashboard showing real metrics against the success threshold, the eval scorecard showing the trend from week three through week twelve, and the quarter-two roadmap updated with what the pilots taught you. Every item on the Q2 roadmap should trace back to a finding from the audit or a lesson from the pilots. Items that cannot make that trace get cut.

Worked Example: B2B SaaS Support Triage

Here is how the 90-day plan played out for a 60-person B2B SaaS company with a 6-person support team handling 200 tickets per day. The company was spending 40% of support engineering time on ticket routing and first-response drafting.

Day 30 audit finding: Zendesk data was clean and tagged with category labels going back 18 months. No PII in ticket bodies (confirmed by a one-hour scan). Current AI usage: one engineer using ChatGPT manually, no API integration. Data readiness score: 4 out of 5. Quick win chosen: a category classifier that auto-tags incoming tickets and routes them to the correct queue, replacing a 47-step manual decision tree.

Day 60 roadmap item approved: First-response draft generation for the top 3 ticket categories (billing, onboarding, API errors), which together account for 65% of volume. Cost estimate: USD 280 per month at projected volume with GPT-4o-mini, which was 60% less than the manual engineering time cost per month. Guardrails: output filtered for any text matching financial claim patterns (legal requirement), confidence gate at 0.80 sending low-confidence drafts to human review, all prompts version-pinned.

Day 90 result: Classifier live for 3 weeks. First-response drafts live for 10 days. Classifier accuracy: 91% on held-out test set, 88% in production (within tolerance). First-response eval score: 4.1 out of 5 on a human-reviewed sample of 200 drafts, up from 3.6 at launch after two prompt iterations. Support team routing time reduced by 74%. First-response time for the top 3 categories: from 3.8 hours average to 22 minutes average. The pilot paid for the entire 90-day engagement in the first month of operation.

How to Hold the Role Accountable

If you are the buyer, you should expect to review progress against four metrics at each monthly checkpoint. These are the numbers I commit to tracking from week one.

Audit completion rate by day 30. All five audit dimensions completed and scored, with a written report delivered and presented. Binary: done or not done.
Pilot in staging with a baseline eval by day 45. Not a demo. A system in staging, connected to real or representative data, with a recorded eval baseline. If this slips to day 55, ask why.
Governance charter signed by day 60. Not drafted, signed. If the charter is not signed by day 60, the organization is not ready to scale AI safely, and that is a finding, not an excuse.
One pilot in production by day 90 with a live dashboard. Real users, real data, real metric visible to the leadership team without a screen-share request.

The role should not be evaluated on the number of meetings attended, the length of the strategy document, or the number of tools evaluated. It should be evaluated on shipped systems with measured outcomes. If you are getting anything else, renegotiate the scope or end the engagement.

Frequently Asked Questions

What does a fractional AI officer actually deliver in the first 30 days?

A structured audit across data readiness, current AI usage, infrastructure, team capability, and compliance risk, plus one working quick win deployed to a staging environment with a human-reviewed eval baseline of at least 50 outputs. Not a strategy deck. A scored report and a live, if limited, system.

How many hours per week does a fractional AI officer typically work?

Engagements I run are structured at 2 to 3 days per week. Week one and two tend to run at the higher end because the audit requires breadth. Week five through eight shift toward deep technical work on pilot architecture. The key is that the engagement contract specifies deliverables, not just hours. If you are paying for hours with no output milestones, restructure the contract.

What is the difference between a fractional CTO and a fractional AI officer?

A fractional CTO owns the full engineering organization: hiring, architecture, processes, vendor relationships, and product-engineering alignment. A fractional AI officer has a narrower mandate: identify where AI creates measurable leverage, build the systems to capture that leverage, put governance in place so the company scales AI safely, and upskill the team. The roles can overlap but the AI officer role does not require authority over the engineering org. In many companies the fractional AI officer reports to the CTO and augments rather than replaces that function.

Can a fractional AI officer work at a company with no existing AI team?

Yes, and this is often the highest-value engagement. A company with no AI team has no bad habits to undo and no competing internal priorities on AI tooling. The audit phase is faster because the baseline is zero. The main risk is that there is no internal engineer who can own the pilot after the engagement ends. The 90-day plan must include a knowledge transfer component in weeks ten through twelve: documented architecture decisions, runbooks, and at least two internal engineers who have been hands-on with the system before the engagement concludes.

What should a fractional AI officer NOT be doing in the first 90 days?

Fine-tuning a model (almost never necessary and almost always a distraction from the real problem, which is retrieval quality and prompt engineering), building a custom MLOps platform (use managed services until you have 10+ models in production), committing to a single AI vendor for all use cases (keep optionality until you know your workload), and presenting roadmaps without cost estimates (if you cannot price the inference, you cannot defend the investment).

Ready to Start Your 90-Day AI Plan?

If you are evaluating a Fractional AI Officer engagement, the plan above is exactly what I deliver. No slide-deck strategy. No vendor evaluations that go nowhere. Audit, quick win, costed roadmap, guardrails, shipped pilot, evals in CI, governance charter. All in 90 days with clear accountability at each checkpoint.

I work with a small number of companies at a time so the engagement gets real attention, not a junior team executing a template. If your company is at the point where AI leverage is real but the path is not clear, reach out directly.

See how the Fractional AI Officer engagement works

The First 30-60-90 Days of a Fractional AI Officer: A Concrete Deliverables Plan

Are you a software engineer moving into AI?

AI Personal Assistant

AI Marketing Manager

AI Sales Representative

AI Support Specialist