When Autoregressive Loops Stay Friendly

We're examining how llama/generation.py turns a massive sharded Transformer into a usable Llama interface without sacrificing performance. The core model and tokenizer live elsewhere; this file is the orchestration layer that drives inference.

I'm Mahmoud Zalt, an AI solutions architect, and we'll look at how this module keeps the autoregressive generation loop fast while still readable and extensible. The central lesson is simple: you can keep an autoregressive generation loop performant without turning it into an unmaintainable black box.

We’ll follow the path a request takes through this file: how a Llama instance is built, how the generation loop is structured, how chat dialogs are formatted into tokens, and where device/dtype and operational concerns show up. Along the way, we’ll call out patterns you can reuse and a few sharp edges to avoid.

Setting the scene: a tiny facade over a huge model

In the LLaMA codebase, the heavy lifting lives in model.py and tokenizer.py. generation.py sits on top of them as the service layer: it knows how to load checkpoints, talk to GPUs, batch work, apply sampling, and expose simple completion APIs.

llama/ (project root)
├─ model.py        # Defines ModelArgs, Transformer
├─ tokenizer.py    # Defines Tokenizer
└─ generation.py   # This file
   ├─ Llama
   │  ├─ build()           # loads checkpoints, creates model+tokenizer
   │  ├─ generate()        # core autoregressive loop
   │  ├─ text_completion() # text API
   │  └─ chat_completion() # chat API
   └─ sample_top_p()       # nucleus sampling helper

generation.py as the orchestration and facade layer.

The Llama class exposes three main entry points:

Llama.build(...), a factory that initializes distributed state, loads checkpoint shards, constructs Transformer and Tokenizer, and returns a ready-to-use Llama instance.
Llama.text_completion(...), "text in, text out" for standard completions.
Llama.chat_completion(...), dialog-shaped input in, assistant message out, with instruction-style formatting.

How Llama.build wires up distributed and model parallelism

The build method is where most of the setup work lands. It uses torch.distributed and FairScale to create a model-parallel world, then maps checkpoint shards onto ranks:

@staticmethod
def build(
    ckpt_dir: str,
    tokenizer_path: str,
    max_seq_len: int,
    max_batch_size: int,
    model_parallel_size: Optional[int] = None,
    seed: int = 1,
) -> "Llama":
    if not torch.distributed.is_initialized():
        torch.distributed.init_process_group("nccl")
    if not model_parallel_is_initialized():
        if model_parallel_size is None:
            model_parallel_size = int(os.environ.get("WORLD_SIZE", 1))
        initialize_model_parallel(model_parallel_size)

    local_rank = int(os.environ.get("LOCAL_RANK", 0))
    torch.cuda.set_device(local_rank)

    torch.manual_seed(seed)

    if local_rank > 0:
        sys.stdout = open(os.devnull, "w")

    start_time = time.time()
    checkpoints = sorted(Path(ckpt_dir).glob("*.pth"))
    assert len(checkpoints) > 0, f"no checkpoint files found in {ckpt_dir}"
    assert model_parallel_size == len(checkpoints), (
        f"Loading a checkpoint for MP={len(checkpoints)} "
        f"but world size is {model_parallel_size}"
    )
    ckpt_path = checkpoints[get_model_parallel_rank()]
    checkpoint = torch.load(ckpt_path, map_location="cpu")
    with open(Path(ckpt_dir) / "params.json", "r") as f:
        params = json.loads(f.read())

    model_args: ModelArgs = ModelArgs(
        max_seq_len=max_seq_len,
        max_batch_size=max_batch_size,
        **params,
    )
    tokenizer = Tokenizer(model_path=tokenizer_path)
    model_args.vocab_size = tokenizer.n_words
    torch.set_default_tensor_type(torch.cuda.HalfTensor)
    model = Transformer(model_args)
    model.load_state_dict(checkpoint, strict=False)
    print(f"Loaded in {time.time() - start_time:.2f} seconds")

    return Llama(model, tokenizer)

For a short method, this sets up process groups, selects the shard for the current rank, loads JSON config, seeds RNGs, constructs the model and tokenizer, and returns a facade. The factory keeps that complexity in one place, which is exactly what you want for model loading.

Rule of thumb: if model loading spans files, devices, distributed, and typing, hide it behind a single factory like Llama.build. Keep an eye on what global state it mutates; we’ll revisit that when we talk about devices and dtypes.

The core loop: a fast typist with a mask

Once Llama is built, everything flows through Llama.generate. This is the hot path and the part that determines both performance and how approachable the code feels.

Conceptually, generate is a very fast typist working over a batch:

They see all tokens so far for each sequence (prompt plus generated tokens).
They ask the model for logits for the next position.
They either take the argmax (greedy) or sample using temperature and top‑p.
They append the chosen token, advance the cursor, and repeat until done.

The typist has to handle padding, end-of-sequence tokens, optional log probabilities, and early stopping. The core looks like this:

pad_id = self.tokenizer.pad_id
tokens = torch.full((bsz, total_len), pad_id, dtype=torch.long, device="cuda")
for k, t in enumerate(prompt_tokens):
    tokens[k, : len(t)] = torch.tensor(t, dtype=torch.long, device="cuda")
if logprobs:
    token_logprobs = torch.zeros_like(tokens, dtype=torch.float)

prev_pos = 0
eos_reached = torch.tensor([False] * bsz, device="cuda")
input_text_mask = tokens != pad_id

for cur_pos in range(min_prompt_len, total_len):
    logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
    if temperature > 0:
        probs = torch.softmax(logits[:, -1] / temperature, dim=-1)
        next_token = sample_top_p(probs, top_p)
    else:
        next_token = torch.argmax(logits[:, -1], dim=-1)

    next_token = next_token.reshape(-1)
    next_token = torch.where(
        input_text_mask[:, cur_pos], tokens[:, cur_pos], next_token
    )
    tokens[:, cur_pos] = next_token

    if logprobs:
        token_logprobs[:, prev_pos + 1 : cur_pos + 1] = -F.cross_entropy(
            input=logits.transpose(1, 2),
            target=tokens[:, prev_pos + 1 : cur_pos + 1],
            reduction="none",
            ignore_index=pad_id,
        )

    eos_reached |= (~input_text_mask[:, cur_pos]) & (
        next_token == self.tokenizer.eos_id
    )
    prev_pos = cur_pos
    if all(eos_reached):
        break

The autoregressive loop: sliding window over tokens with masks and EOS tracking.

This loop dominates cost: complexity is roughly O(B * L * C) where B is batch size, L is generated length, and C is the cost of model.forward. Every structural choice here directly affects latency and throughput.

Batching and masks: keeping control explicit

Two tensors make this loop much easier to extend safely:

input_text_mask marks prompt vs. padding. Later, when deciding whether to overwrite a position, the code uses this mask so prompt tokens remain untouched. Whether you "echo" the prompt or not becomes a decoding concern, not a loop concern.
eos_reached tracks, per sequence, whether an eos_id has been generated beyond the prompt. Once every row has reached EOS, the loop breaks early and avoids work.

Sampling as a pluggable policy

The choice of the next token is cleanly factored into a policy:

Temperature zero: pure greedy decoding via argmax.
Temperature > 0: softmax plus a call to sample_top_p.

The loop itself doesn’t know anything about the details of top‑p; it just calls a helper. The helper stays small and focused:

def sample_top_p(probs, p):
    probs_sort, probs_idx = torch.sort(probs, dim=-1, descending=True)
    probs_sum = torch.cumsum(probs_sort, dim=-1)
    mask = probs_sum - probs_sort > p
    probs_sort[mask] = 0.0
    probs_sort.div_(probs_sort.sum(dim=-1, keepdim=True))
    next_token = torch.multinomial(probs_sort, num_samples=1)
    next_token = torch.gather(probs_idx, -1, next_token)
    return next_token

Top‑p (nucleus) sampling means: sort tokens by probability, keep the smallest prefix whose cumulative mass exceeds p, zero the rest, renormalize, and sample from the survivors. The key design decision is not the algorithm itself, but that it lives in a dedicated function. That makes it easy to drop in top‑k, penalties, or custom constraints without touching the loop.

Keeping complexity from creeping up

generate already has a non-trivial cyclomatic complexity. Every new feature you add here, new stopping conditions, penalty terms, streaming, competes for that mental budget.

A pragmatic refactor is to extract helpers for:

initializing token tensors and masks,
choosing the next token (sampling policy),
logprob bookkeeping.

Then the loop becomes "advance positions; stop when all sequences are done", which is far easier for the next engineer to reason about at a glance.

Chat formatting: scripting the conversation

On top of the raw generation loop sits chat_completion, which is responsible for turning role-based dialogs into instruction-style prompts and tokens. This is where format, contracts, and lightweight safety checks live.

Think of chat_completion as a script formatter. It takes a dialog such as:

system → user → assistant → user

and produces a single token sequence with special instruction and system tags. The core formatting logic looks like this:

prompt_tokens = []
unsafe_requests = []
for dialog in dialogs:
    unsafe_requests.append(
        any([tag in msg["content"] for tag in SPECIAL_TAGS for msg in dialog])
    )
    if dialog[0]["role"] == "system":
        dialog = [
            {
                "role": dialog[1]["role"],
                "content": B_SYS
                + dialog[0]["content"]
                + E_SYS
                + dialog[1]["content"],
            }
        ] + dialog[2:]
    assert all([msg["role"] == "user" for msg in dialog[::2]]) and all(
        [msg["role"] == "assistant" for msg in dialog[1::2]]
    ), (
        "model only supports 'system', 'user' and 'assistant' roles, "
        "starting with 'system', then 'user' and alternating (u/a/u/a/u...)"
    )
    dialog_tokens: List[int] = sum(
        [
            self.tokenizer.encode(
                f"{B_INST} {(prompt['content']).strip()} {E_INST} "
                f"{(answer['content']).strip()} ",
                bos=True,
                eos=True,
            )
            for prompt, answer in zip(dialog[::2], dialog[1::2])
        ],
        [],
    )
    assert (
        dialog[-1]["role"] == "user"
    ), f"Last message must be from user, got {dialog[-1]['role']}"
    dialog_tokens += self.tokenizer.encode(
        f"{B_INST} {(dialog[-1]['content']).strip()} {E_INST}",
        bos=True,
        eos=False,
    )
    prompt_tokens.append(dialog_tokens)

Chat dialog → instruction-style token sequence, with role and safety checks.

This code enforces a clear dialog contract:

Only system, user, and assistant roles are supported.
If present, a leading system message is folded into the first user turn using system tags.
Roles must alternate user/assistant/user/assistant...
The last message must be from the user.

Violations fail fast via assertions instead of surfacing later as odd model behavior, which is valuable when you’re debugging integration issues.

Safety as a formatting concern

The module also defends against prompt injection that tries to smuggle internal control tags into user text. It defines:

SPECIAL_TAGS = [B_INST, E_INST, "<>", "<>"]
UNSAFE_ERROR = "Error: special tags are not allowed as part of the prompt."

For each dialog, it checks whether any of these tags appear in message content. If they do, the dialog is marked "unsafe": generation still runs through generate, but the decoded assistant response is replaced with UNSAFE_ERROR instead of the model output.

Case	Contains SPECIAL_TAGS?	Result
Normal dialog	No	Formatted into tokens and passed to `generate`; decoded response returned.
Dialog with `[INST]` in content	Yes	Tokens still generated, but response content replaced by `UNSAFE_ERROR`.

The subtle but important point is that safety decisions sit at the formatting layer, where the structure is explicit, not buried inside the model. That keeps the core generation loop focused on tokens and probabilities, and makes it easier to adjust safety policies as your templates evolve.

Why a dedicated _format_dialog helper helps

Right now, chat_completion mixes unsafe-tag detection, role validation, system-message folding, string templating, and tokenization. Extracting these concerns into a helper makes them trivial to unit test with a stub tokenizer.

That pays off the moment you introduce new roles (for tools, functions, etc.) or change tag schemes between model versions. The generation loop and model stay untouched; only formatting tests and code move.

Devices, dtypes, and hidden globals

So far we’ve looked at how generation.py stays friendly while driving a large model. The main trade-offs appear around devices, dtypes, and validation: the code is optimized for a specific deployment shape, and that leaks into its interfaces.

Two choices stand out:

Hard-coded CUDA allocations: tensor creation in generate and related methods uses device="cuda" directly.
Global default tensor type: Llama.build calls torch.set_default_tensor_type(torch.cuda.HalfTensor).

Both are convenient if every process that imports this code is a GPU-only, single-purpose worker. They become liabilities in more complex services and tests.

Why global defaults are a smell

Changing the default tensor type effectively says: "any code in this process that creates tensors without specifying device/dtype will now get CUDA half-precision." That’s invisible global configuration.

If you're embedding Llama into a larger system, that can break unrelated components in surprising ways. The safer pattern is to carry device and dtype as configuration of the Llama instance and use them explicitly whenever you allocate tensors.

The suggested refactor is straightforward:

Add device and dtype parameters to Llama.build.
Store them on self.device and self.dtype in Llama.__init__.
Replace device="cuda" with device=self.device in generate and other allocations.
Remove the global torch.set_default_tensor_type call.

You keep the same performance characteristics, but you gain the ability to run CPU-only tests, experiment with other accelerators, and avoid polluting global PyTorch state.

Assertions vs. explicit errors

The file uses assert for several runtime checks:

Checkpoint existence and shard/world-size alignment.
Batch size and prompt length within model limits.
Dialog role ordering and last-message role.

Assertions are fine for developer-only invariants, but they disappear under Python’s optimization flags and don’t give operators much to work with. For user-facing contracts, API arguments, dialog structure, configuration, a descriptive ValueError or custom exception type makes integration failures faster to diagnose.

None of this changes performance, but it makes the same code noticeably friendlier when it’s used as a library instead of just a script.

Takeaways you can apply today

Looking at llama/generation.py as a case study, we can see how to balance a high-throughput autoregressive loop with code that engineers can still reason about and extend.

Treat the generation loop as an API surface, not a dumping ground. Keep masks, done flags, and sampling policies explicit. If generate starts to feel like a maze, extract helpers so the loop reads as "advance cursor and stop when done." That preserves both performance and maintainability.
Centralize formatting and safety at the edges. The dialog-to-token path in chat_completion enforces role contracts and guards against control-tag abuse in a single place. Mirroring that pattern in your own stack, one formatter per interface, pays off when you change templates or add new roles.
Be explicit about devices, dtypes, and validation. Avoid hidden globals like default tensor types and avoid leaning on assert for behavior that matters in production. Thread device/dtype through your facades and raise clear exceptions for bad inputs or configurations.

The primary lesson from this module is that performance and friendliness don’t have to be opposed. With a thin facade like Llama, a disciplined generation loop, and clear boundaries for formatting and configuration, you can drive large models at scale and keep the inference code approachable for the next engineer who has to touch it.

Zalt Blog

When Autoregressive Loops Stay Friendly

Are you a software engineer moving into AI?

AI Personal Assistant

AI Marketing Manager

AI Sales Representative

AI Support Specialist

Setting the scene: a tiny facade over a huge model

The core loop: a fast typist with a mask

Batching and masks: keeping control explicit

Sampling as a pluggable policy

Chat formatting: scripting the conversation

Safety as a formatting concern

Devices, dtypes, and hidden globals

Why global defaults are a smell

Assertions vs. explicit errors

Takeaways you can apply today

Full Source Code

Read More

In the Loop vs On the Loop: How I Decide How Much to Trust an Agent

Keeping API Keys Safe: The Vibecoder's Handbook Rules

Free AI Tools

About the Author

Support this content

Share this article