What an Engineer Should Actually Learn First About LLMs (Not Transformers Math)

What to Learn First When Building with LLMs

Start with context engineering, structured outputs, and writing a real eval. Those three skills will take you from zero to shipping a working LLM feature in production faster than any other learning path. Transformer architecture, attention math, and fine-tuning theory can wait; most production LLM work never needs them.

I am Mahmoud Zalt, an independent senior AI systems architect with 16-plus years building production software since 2010. Before LLMs, I built and open-sourced Apiato, a PHP framework that engineers still ship APIs on, so I know what a learning path that actually compounds looks like. Today I run Sista AI, the company I founded, where a workforce of autonomous agents operates in production. I have helped engineers ramp up on LLM systems through my AI Engineer Mentoring service. Every engineer I work with who starts by reading the transformers paper wastes at least three weeks before writing anything that runs. This article is the shortcut I give them instead. You can also read more about my background on the about page.

Why Starting with Transformer Math Is a Mistake

The transformer paper, 'Attention Is All You Need,' is brilliant computer science. It is also almost entirely irrelevant to writing production LLM features as an application engineer. You are not training models. You are calling an API, shaping inputs, and handling outputs. The mental model you need comes from systems thinking, not from linear algebra.

Here is what happens when engineers go theory-first. They spend two weeks on attention mechanisms, then another week on tokenization internals, then they read about RLHF. By week four they still have not written a prompt that calls a real model. They have optimized for feeling prepared rather than for shipping. That is a trap.

The correct framing: an LLM is a probabilistic text-completion function that accepts a context window and returns tokens. You need to understand that interface deeply. You do not need to understand the matrix multiplications that produce it. A senior backend engineer does not need to understand CPU branch prediction to write fast database queries. Same principle applies here.

What 'Understanding the Interface' Actually Means

How the context window works and why every token in it costs you something
How system prompts, user messages, and assistant turns are structured in the chat format
What temperature and top-p actually do to output distribution (testable in five minutes)
Why the model does not 'remember' anything between API calls by default
What a token is well enough to reason about prompt length and cost

Skill 1: Context Engineering

Context engineering is the practice of deciding what information goes into the context window, in what order, and in what form, to maximize the quality of the model's output. It is the highest-leverage skill in LLM application development. Get this wrong and no amount of prompt tweaking fixes it.

The Core Insight

The model can only reason about what is in its context. If a user asks 'is my order late?' and your context contains no order data, the model will hallucinate or refuse. The prompt is not the bottleneck. The missing retrieval step is. This is why RAG (retrieval-augmented generation) exists: not because models are bad at knowledge, but because you need to inject the right facts at call time.

A Concrete Worked Example

Say you are building a support bot. Naive implementation: system prompt with company name and tone, then raw user message. The model has nothing to work with beyond its training data. Better implementation: system prompt describing the assistant role, then a retrieved block of the three most relevant knowledge-base articles (ranked by embedding similarity to the user query), then the last four turns of the conversation for continuity, then the user message. That structure answers 80 percent of support questions without any fine-tuning at all.

The practical rule: before tuning the model, ask whether the answer is even in the context. Most 'the model is hallucinating' complaints are actually 'I never gave the model the data it needed.'

What Engineers Get Wrong

Stuffing the full knowledge base into every prompt. Use retrieval to be selective; do not blast the whole document.
Putting instructions after the data. Models attend more reliably to instructions near the start and near the end of the context. Put critical instructions in both places for long contexts.
Ignoring conversation history shape. Truncating history from the wrong end (cutting the most recent turns) destroys continuity. Always truncate from the oldest turns first.
Not counting tokens before deploying. A context that fits your dev dataset may overflow on real user data. Budget token counts explicitly.

Skill 2: Structured Outputs

Free-form text is hard to parse reliably. The moment you need to do anything programmatic with model output, you need structured outputs. This means instructing the model to return JSON (or another machine-readable format) and enforcing that schema at the API level.

Why This Matters More Than Prompt Clarity

You can write a beautifully clear prompt that asks for a JSON object and the model will occasionally return markdown code fences around it, or add an apologetic sentence before the JSON, or use slightly different key names than you specified. In production, any of those variations breaks your downstream code. The fix is not more prompt engineering. The fix is using the model's native structured-output or function-calling feature, which constrains the output to a schema before it ever hits your code.

The Practical Stack

OpenAI and Anthropic both support constrained JSON output (via response_format with a JSON schema, or via tool-calling). Use it. Pair it with a schema validation library in your application layer (Zod in TypeScript, Pydantic in Python) so that even if the model returns something unexpected, you get a loud, catchable error rather than silent bad data flowing through your system.

A minimal pattern in TypeScript:

// Define schema with Zod
const SentimentSchema = z.object({
  label: z.enum(['positive', 'negative', 'neutral']),
  confidence: z.number().min(0).max(1),
  reason: z.string()
});

// Call model with JSON mode enabled, then validate
const raw = await callLLM(prompt, { json: true });
const result = SentimentSchema.parse(JSON.parse(raw));

Now your downstream code gets a typed object, not a string. That is the difference between a prototype and a system you can maintain.

Common Mistake

Asking for JSON in plain text instructions without enabling structured output mode. It works 95 percent of the time in testing and breaks on edge cases in production. Always use the API-level constraint, not prose instructions alone.

Skill 3: Writing a Real Eval

An eval is a test suite for your LLM feature. It answers the question: 'Did this prompt change make things better or worse?' Without evals, you are flying blind. You push a prompt change, it feels better in two test cases, and you ship it. Then it silently regresses on the 20 percent of inputs you did not check.

The Minimum Viable Eval

You do not need a fancy framework to start. You need four things:

A dataset of 20 to 50 representative inputs covering normal cases, edge cases, and known failure modes. Collect these from real usage as fast as you can.
A scoring function for each input: either a human-written expected output with a comparison function, a rule-based check (does the output contain the required JSON key?), or an LLM-as-judge call scoring the output on a 1-to-5 rubric.
A script that runs all inputs, scores them, and reports an aggregate score (pass rate, average score).
A gate: if the score drops below your threshold, the prompt change does not ship.

LLM-as-Judge Is Legitimate

Using a model to evaluate model output sounds circular, but it works well in practice for subjective qualities like tone, completeness, and factual consistency, as long as you also have rule-based checks for objective properties. The pattern: write a grading prompt that gives the judge model a rubric (1 = wrong or harmful, 3 = acceptable, 5 = excellent) and ask it to score with a one-sentence justification. Sample 10 percent of results and spot-check the judge's scores against your own to calibrate it.

What Teams Get Wrong

The most common mistake is treating evals as a one-time task done at launch. Your eval dataset should grow continuously. Every production bug that reaches a user is an eval case you did not have. Add it immediately. The teams with reliable LLM features treat eval datasets like regression test suites: permanent, growing, and blocking on failure.

The Actual Learning Order I Recommend

Here is the sequence I walk engineers through in my mentoring work. Each step produces something real before moving to the next.

Week	What to Build	What You Learn
1	A CLI tool that takes user input and calls a model API with a structured prompt	API interface, token counting, basic prompt structure, cost per call
2	Add structured JSON output and validate it with a schema library	Structured outputs, schema design, error handling for malformed responses
3	Add a retrieval step: embed a small document set, retrieve top-k chunks, inject into context	Context engineering, embedding similarity, RAG fundamentals
4	Write an eval script with 30 test cases and run it against your feature	Evaluation design, scoring functions, how to detect prompt regressions
5	Add a tool-calling (MCP) step so the model can call a real function	Tool/function calling, multi-step agent patterns, surface area of risk
6	Add basic observability: log every prompt, output, latency, token count, and score	Production monitoring, cost tracking, debugging real failures

At the end of six weeks you have a real pipeline in production. You understand the failure modes from experience, not theory. That is the foundation everything else builds on.

Production Concepts That Matter Early

These are not advanced topics. They are things you will hit in your first production feature, so learn them alongside the basics rather than treating them as 'level 2.'

Guardrails

A guardrail is a check that runs on model input or output to catch harmful, off-topic, or policy-violating content before it reaches the user. Implement at minimum: input length limits, a topic filter for your use case (reject prompts clearly outside scope), and an output check for any content your platform cannot show (PII, harmful language, or confidential data patterns). Libraries like Guardrails AI and NeMo Guardrails exist, but a simple regex plus a fast classification model call covers most production needs at the start.

Cost and Latency Budgets

Set explicit budgets before you start building. What is the maximum acceptable cost per user action? What is the maximum acceptable latency? These constraints will drive every architecture decision: whether to use a small fast model or a large slow one, whether to cache completions, whether to stream responses. Engineers who skip this step build features that are technically correct but economically unshippable.

Human-in-the-Loop Checkpoints

Not every LLM decision should be automated. For consequential actions (sending an email, modifying a record, making a payment), route the model's proposed action to a human confirmation step before execution. The right question is: 'What is the blast radius if this goes wrong?' If it is large, require human approval. This is not a weakness in your system. It is correct engineering for the current state of LLM reliability.

Security: Prompt Injection

Prompt injection is the LLM equivalent of SQL injection. A user crafts input that overwrites your system instructions ('Ignore all previous instructions and do X'). Mitigate by: never concatenating raw user input directly into privileged instruction sections, using structural separators that mark user content clearly, and never giving the model access to tools that can exfiltrate data without a human approval step.

What to Skip in Your First Three Months

Being specific about what NOT to spend time on is as useful as the positive list. Here is what I tell engineers to defer:

Fine-tuning. Fine-tuning costs money, requires a clean labeled dataset you probably do not have, and is almost always beaten by better context engineering on the base model. Do not touch it until you have exhausted prompt and retrieval improvements. Most production systems never need it.
Transformer architecture deep-dives. Read a one-page conceptual overview so the vocabulary does not trip you up. That is enough. You are not writing a training loop.
Agent frameworks. LangChain, LlamaIndex, and similar frameworks add abstraction layers that hide failure modes. Learn the primitives directly first. Build a few manual chains. Then evaluate whether a framework earns its complexity.
Custom embedding models. OpenAI and Cohere embeddings are good enough for the vast majority of retrieval use cases. Do not train your own until you have measured that off-the-shelf embeddings are the bottleneck, which they rarely are.
Quantization and inference optimization. Unless you are self-hosting models at scale, this is not your problem. API providers handle it.

The pattern: skip anything that belongs to the model provider's side of the interface. Focus everything on the application layer where your leverage actually lives.

Frequently Asked Questions

Do I need to understand transformers to build LLM applications?

No. You need a conceptual model of the interface: context window, token limits, temperature, and the chat format. The underlying architecture is the model provider's concern. Application engineers who spend time on transformer internals are optimizing for the wrong layer.

What is context engineering and why does it matter more than prompt engineering?

Context engineering is deciding what data goes into the model's context window before each call: what documents to retrieve, how to format them, how much conversation history to include, and where to place instructions. Prompt engineering (the wording of instructions) matters, but the context is the foundation. A well-worded prompt on an empty context produces worse results than a plain instruction with the right retrieved facts. Fix the context before tuning the words.

How do I know if my LLM feature is working well enough to ship?

You need an eval. Run your feature against at least 20 to 30 representative inputs, score each one (either human review or LLM-as-judge with a rubric), and set a pass-rate threshold you would not ship below. If you cannot measure quality, you cannot improve it and you cannot safely ship it. 'It looks good in testing' is not a ship standard for production LLM features.

What is the difference between RAG and fine-tuning and when should I use each?

RAG (retrieval-augmented generation) injects relevant information into the context at inference time. Fine-tuning bakes information or behavior into the model weights via additional training. Use RAG first: it is cheaper, faster to iterate, and handles dynamic or private data naturally. Use fine-tuning only when you need the model to adopt a very specific style or format consistently, or when you have a large labeled dataset showing the model doing the right thing and context-based approaches have hit their ceiling.

How long does it take to get productive with LLM engineering as a software engineer?

Six weeks of deliberate practice building real features, not tutorials. By the end of week one you should have a working API call with structured output. By week four you should have an eval suite running. The engineers I mentor who follow a structured build-first path are shipping production LLM features by week six. The ones who do theory-first take three to four months to reach the same point.

What observability do I need for LLM features in production?

Log at minimum: the full prompt (system plus user), the model response, latency in milliseconds, token counts for input and output, cost per call, and any eval score you can compute automatically. Feed these into a dashboard so you can see cost trends, latency spikes, and quality drift over time. Without this data, debugging production failures is guesswork. Tools like LangSmith, Helicone, and Braintrust make the logging side easier, but even a structured log table in Postgres gives you most of what you need to start.

Work with Me Directly

If you are an engineer who wants to ramp up on LLM systems fast and build something that actually ships, this is exactly what I cover in my AI Engineer Mentoring program. We skip the theory detours and build a real eval-backed LLM pipeline together, from your first API call to a production-grade feature with observability, guardrails, and structured retrieval. I have done this hands-on for 16-plus years, and the mentoring is direct and specific, not a course you watch alone.

You can read more about how I work on the about page, see examples on the projects page, or get in touch directly to ask whether the program is a fit for where you are right now.

Start building LLM systems the right way. See the AI Engineer Mentoring program.

What an Engineer Should Actually Learn First About LLMs (Not Transformers Math)

Are you a software engineer moving into AI?

AI Personal Assistant

AI Marketing Manager

AI Sales Representative

AI Support Specialist