The AI Engineering Skills Roadmap: What Order Actually Matters
Start with prompting, evals, and retrieval. Skip model training entirely until you have shipped at least two production AI features. That single reordering will save you six to twelve months of chasing skills that almost no company needs you to have on day one.
I am Mahmoud Zalt, an independent senior AI systems architect with 16 years building production software since 2010. I learned what dependency-ordered skill-building looks like the hard way, shipping Apiato, an open-source framework other engineers build on, long before founding Sista AI, where I run a workforce of autonomous agents in production. I now work with engineers one-on-one through my AI engineer mentoring program to help them make this transition without burning months on the wrong material. Read more about me or see my projects.
Why Most AI Roadmaps Send You in the Wrong Direction
The most popular roadmaps online are written by academics or ML researchers. They start with linear algebra, then statistics, then Python for data science, then PyTorch, then CNNs, then transformers from scratch. That is a fine path if your goal is a research role at a lab. It is the wrong path if your goal is to build AI-powered products at a company.
The confusion happens because 'AI engineer' conflates two very different jobs:
- ML researcher / model trainer: writes training loops, designs loss functions, works on pretraining and fine-tuning at scale. Rare role. Requires deep math.
- AI systems engineer (what most companies are actually hiring): integrates foundation models into products, builds retrieval pipelines, writes evals, manages latency and cost, wires up tool-calling and MCP, handles guardrails and observability. Does not require training a single model.
If you are reading this article, you almost certainly want the second role. Almost every job posting labeled 'AI engineer' in 2024 and 2025 is describing the second role. The roadmap below is for that role.
The Dependency-Ordered Roadmap (Do These in Order)
Each layer depends on the one before it. Do not skip ahead. The order is not arbitrary.
Layer 1: Prompting Fundamentals (Week 1-3)
Before anything else, you need to understand how LLMs actually respond to instructions. This is not soft knowledge. Prompt structure directly determines output quality, and badly structured prompts cannot be fixed by switching to a bigger model.
- System prompt vs user turn structure
- Zero-shot vs few-shot prompting with concrete examples
- Chain-of-thought (CoT) and when it helps versus when it costs tokens for nothing
- Role prompting, output format constraints (JSON mode, structured outputs)
- Context window budgeting: what goes in system, what goes in user, what you never include
Worked example: A team I worked with was getting inconsistent JSON from GPT-4o. The root cause was putting format instructions in the user turn, not the system prompt, so the model treated them as optional context. Moving format constraints to the system prompt and adding one few-shot example reduced malformed output from 12% to under 0.5%.
Layer 2: Evals (Week 3-6)
This is the most under-taught skill in every roadmap I have seen. You cannot improve a system you cannot measure. Ship nothing without evals.
- Deterministic evals: exact-match, regex, JSON schema validation
- LLM-as-judge evals: when to use them, how to calibrate the judge, how to avoid judge gaming
- Regression evals: catching when a prompt change breaks previously passing cases
- Building a golden dataset of 50 to 200 hand-labeled examples
- Tools:
promptfoo,braintrust,langsmitheval harnesses
The dependency is strict: you need prompting (Layer 1) to write the system being evaluated, and you need evals before you change any prompt or model, otherwise you are guessing.
Layer 3: Retrieval and RAG (Week 6-10)
Most real AI features require grounding the model in your data. RAG (retrieval-augmented generation) is the dominant pattern. This is where most engineers get stuck because they implement the naive version, see poor results, and think RAG does not work. RAG works. Naive RAG does not.
- Chunking strategy: fixed vs semantic vs document-structure-aware
- Embedding models: OpenAI
text-embedding-3-smallvslargevs open models. Know the tradeoffs. - Vector databases: pgvector (start here), Pinecone, Qdrant. Do not over-engineer the DB choice early.
- Retrieval quality: top-k selection, MMR (maximal marginal relevance), hybrid search (BM25 + dense)
- Re-ranking:
cohere-rerankor a cross-encoder before passing chunks to the LLM - Eval loop for retrieval: measure recall@k before you ever measure answer quality
What teams get wrong: They tune the generation prompt obsessively while leaving retrieval broken. A badly retrieved chunk cannot be recovered in the generation step. Fix retrieval first, measure it with recall@k, then worry about the generation prompt.
Layer 4: Tool-Calling and Agent Patterns (Week 10-14)
Once you can prompt well and retrieve reliably, agents become tractable. Before that, they are chaos.
- OpenAI function-calling / tools API: schema design, required vs optional params
- MCP (Model Context Protocol): how servers expose tools to models, client-server contract
- ReAct loop: reason, act, observe, repeat. Understand the failure modes (loops, hallucinated tool names)
- Deterministic vs LLM-routed tool selection: know when to let the model pick and when to hard-code routing
- Human-in-the-loop checkpoints: when to pause and confirm before a destructive tool call
Layer 5: Observability, Guardrails, Cost (Week 14-18)
This layer is what separates engineers who can demo from engineers who can operate.
- Tracing every LLM call:
langsmith,langfuse, orarize phoenix. Log prompt, completion, latency, cost, model version. - Input guardrails: prompt injection detection, PII stripping before the model sees user input
- Output guardrails: hallucination scoring, schema validation, content policy checks
- Cost modeling: tokens per request times price per million times daily volume. Know your burn rate before launch.
- Latency budgeting: streaming vs batch, where caching helps (semantic cache with embeddings)
Layer 6: Infrastructure and Deployment (Week 18-22)
Now you are ready for the infra layer. Not before.
- API gateway patterns for LLM traffic (rate limiting, key rotation, model fallback)
- Async job queues for long-running agent runs
- Model versioning and prompt versioning: treat prompts as code, version them in git
- Fine-tuning: only reach for this after RAG plus evals have failed to meet your quality bar. Fine-tuning is not a shortcut. It requires a labeled dataset, a training loop, and ongoing maintenance.
What to Skip (At Least for Now)
Being opinionated about the skip list is as important as the roadmap itself. Here is what I tell every engineer I mentor to defer until they have shipped something real.
| Topic | Why to Skip It Now | When to Revisit |
|---|---|---|
| Training your own LLM | Costs millions in compute. Not a skill gap for 99% of roles. | If you join a lab or a company with a model training team. |
| PyTorch from scratch | You will use APIs, not training loops. Time-to-value is terrible. | If you move into research or fine-tuning at scale. |
| MLOps (Kubeflow, MLflow, etc.) | Designed for the training pipeline, not the inference pipeline. | After you are running model training jobs in production. |
| Every new model on release day | Model-hopping wastes weeks. The prompting and eval skills transfer. | Use benchmarks. Upgrade on eval regression, not on hype. |
| AutoGen / CrewAI / every new agent framework | Abstractions change every quarter. Understand the primitives first. | After you have built at least one agent from primitives. |
| Diffusion model internals | Unless you are building image generation features specifically. | Domain-specific need only. |
The Toolkit That Actually Ships
These are the specific tools I see doing real work in production AI systems in 2025. Not exhaustive. Not every tool for every job. The smallest set that covers the most ground.
- LLM APIs: OpenAI (GPT-4o, o3), Anthropic (Claude Sonnet / Opus), Google (Gemini 1.5 Pro). Know all three. Lock-in is a real cost.
- Embeddings:
text-embedding-3-smallfor most workloads. Step up tolargeonly if evals show it helps on your data. - Vector storage: Start with pgvector on your existing Postgres. Migrate to Qdrant or Pinecone when you have scale evidence.
- Eval harness:
promptfoofor fast iteration,braintrustfor team-scale eval management. - Observability:
langfuse(open source, self-hostable) orlangsmith. Pick one and use it from day one. - Orchestration: Plain Python functions before any framework. Then LangChain if you need the integrations. Then custom if the abstraction fights you.
- MCP: Build at least one MCP server before reaching for a higher-level agent framework. It forces you to understand the tool-model contract.
Production Judgment: What Textbooks Do Not Teach
This is the gap between knowing the skills on paper and being trusted to own an AI system in production. It comes from shipping, not studying.
Evals before refactoring
Before you change a prompt, run the current prompt through your eval suite and record the baseline score. Then change the prompt. Then compare. If you skip the baseline, you have no evidence you improved anything, and you will introduce regressions you will not catch until a user reports them.
Human-in-the-loop is not a failure mode, it is a feature
The pressure to automate everything fully is real, but the right answer for consequential agent actions (sending emails, modifying databases, making API calls with side effects) is often a confirmation step. I wire human-in-the-loop checkpoints for any tool call that is not trivially reversible. This is not a limitation, it is how you keep the system trustworthy while the eval coverage grows.
Security: the attack surface LLM docs skip
Prompt injection is a first-class threat. If your agent processes user-controlled text and then acts on tool outputs, an attacker can embed instructions in a document or database record that hijack the agent's behavior. Mitigations: sanitize inputs before the model sees them, privilege-separate tool calls (the model requests, a separate layer validates and executes), and never let the model see raw outputs from tools it just called without a validation pass.
Cost surprises happen at scale, not at demo
A feature that costs $0.002 per request looks free in a demo. At 500,000 daily active users with 3 requests each, that is $3,000 per day. Model your cost per request before launch, not after. Caching common queries with a semantic cache (embed the query, look up near-duplicate completions) can cut 30-60% of LLM calls on high-repeat workloads.
Where to Actually Learn This (Without the Noise)
I am not going to list 40 resources. Here is the shortest path that covers the actual roadmap above.
- Prompting and structured outputs: OpenAI and Anthropic prompt engineering guides. Read them fully. They are authoritative and free.
- Evals: The
promptfoodocumentation is the best practical eval primer available. Read the concepts section, not just the quickstart. - RAG deep dive: Jerry Liu's (LlamaIndex) writings on advanced RAG patterns. Specific, production-oriented, not theoretical.
- Agent primitives: Build a ReAct agent from scratch in plain Python using the raw OpenAI tools API. Do this before using any framework. It takes two to four hours and teaches you more than a week of reading.
- MCP: The official MCP specification and the reference servers in the modelcontextprotocol GitHub org. Build one server before anything else.
- Observability: Langfuse docs and their blog. They cover the observability patterns that matter for LLM systems specifically.
Do not buy a $2,000 course before you have finished the free official documentation. The documentation is better than most courses for this stack.
Frequently Asked Questions
Do I need to know Python to become an AI engineer?
Yes, practically speaking. The entire LLM tooling ecosystem (LangChain, LlamaIndex, OpenAI SDK, HuggingFace) has Python as its first-class language. TypeScript/JavaScript is a legitimate second choice if you are coming from web development, and the OpenAI and Anthropic SDKs have strong TS support. But if you are starting from zero, Python is faster to reach productivity in this domain.
Do I need a math background for AI engineering?
Not for the role described in this roadmap. You need enough statistics to understand what an embedding is (a vector of numbers representing semantic meaning) and what precision and recall mean in an eval context. You do not need to derive backpropagation or understand transformer attention from first principles to ship production RAG systems and agents. The math requirement is genuine for model training roles. It is largely unnecessary for AI systems engineering roles.
How long does it take to become job-ready as an AI engineer?
With focused effort, 4 to 6 months if you already have software engineering experience. The skills in Layers 1 through 4 of this roadmap are achievable in that window if you are building, not just reading. If you are coming from a non-engineering background, add 3 to 6 months for Python fundamentals and basic software design. The fastest path is always building a real project alongside the learning, not finishing all the reading before writing code.
Should I learn LangChain or build from primitives?
Build from primitives first. Make at least one RAG pipeline and one agent using raw API calls and plain functions before reaching for a framework. LangChain solves real problems but it also hides what is actually happening, and when something breaks in production you need to know what is happening. Once you understand the primitives, use whatever framework saves you time on your specific project.
What is the difference between an AI engineer and an ML engineer?
In practice: an ML engineer builds and trains models. An AI engineer integrates and operates foundation models in products. The skills overlap at the edges (both care about evals, both need to understand model behavior) but the core skill sets are different. Most companies hiring aggressively right now are hiring AI engineers, not ML engineers. ML engineering roles are fewer, more specialized, and concentrated at labs and large tech companies.
Is fine-tuning a skill I should learn early?
No. Fine-tuning is a last resort, not a first tool. The order of operations is: prompt engineering, then RAG, then few-shot examples, then fine-tuning. Most quality problems that engineers blame on 'needing fine-tuning' are actually retrieval problems or prompt structure problems. Fine-tuning requires a high-quality labeled dataset, ongoing maintenance as the base model updates, and meaningful compute cost. Reach for it only after evals show that RAG and prompting cannot close the gap.
Work With Me Directly
If you are an engineer making this transition and you want a structured path instead of guessing what to learn next, that is exactly what I do in my AI engineer mentoring program. We work through the dependency-ordered roadmap above, you build real things, and I give you direct feedback on your evals, your RAG pipelines, and your agent designs from someone who has shipped these systems in production.
I work with a small number of engineers at a time. If you are serious about this transition, get in touch and tell me where you are on the roadmap and what you are trying to build.







