Batching Tokens Without Losing Your Mind

Every high-throughput AI system eventually runs into the same dilemma: do we keep the code simple, or do we squeeze every last drop of performance out of the hardware? In the Ollama llamarunner, we get to watch that trade-off play out in a single Go file that does everything from HTTP routing to GPU-bound batching. I'm Mahmoud Zalt, and in this walkthrough we'll use this runner as a case study in how to batch tokens efficiently without turning your core loop into an unmaintainable knot.

We'll unpack how the runner juggles concurrent sequences on a single llama context, where the design shines, and where complexity starts to leak. By the end, you'll have a concrete mental model for building your own batched inference loop—and a checklist to keep it healthy over time.

The Scene: One Runner, Many Requests

Before we dig into the batching logic, we need a clear picture of what this runner is responsible for. Conceptually, it's a small HTTP service that exposes four endpoints—/load, /completion, /embedding, and /health—and funnels all model work through a single llama context and KV cache.

runner/
  llamarunner/
    runner.go   <-- this file

Ollama Core Server
    |
    |  HTTP (localhost)
    v
+-----------------------+
|   Server (runner.go)  |
|  - modelPath          |
|  - model *llama.Model |
|  - lc *llama.Context  |
|  - cache *InputCache  |
|  - seqs []*Sequence   |
+-----------+-----------+
            |
            | manages sequences & batching
            v
     +-------------+
     |  Sequence   |  (per request)
     | - inputs    |
     | - cache slot|
     | - sampling  |
     | - channels  |
     +------+------+ 
            |
            | batched tokens/embeds
            v
      +-----------+
      | llama C++ |
      | backend   |
      +-----------+

High-level architecture: HTTP handlers feed into a single batching engine built around Server and Sequence.

Everything starts in Execute, the CLI entrypoint. It parses flags, initializes logging and the llama backend, and then spins up a Server with:

a *llama.Model and *llama.Context (once loaded),
an InputCache that wraps the KV cache,
a slice of *Sequence slots capped by parallel, and
a background goroutine run() that continuously calls processBatch.

In other words, this file is both the HTTP edge and the scheduling layer for GPU-bound inference. The core narrative is how it batches heterogeneous work across concurrent sequences, while keeping each request isolated.

Analogy: Think of the runner as a busy restaurant kitchen. The HTTP handlers are the waiters taking orders, Sequence is a ticket for one table, and processBatch is the head chef deciding which dishes to cook together in each pan to keep the stove (GPU) hot.

Sequence: The Per-Request Brain

With the scene set, let’s zoom into the Sequence type. This struct is where the runner encodes the lifecycle of a single request: its prompt, its KV cache slot, its sampling context, and its streaming state.

type Sequence struct {
	// batch index
	iBatch int

	// number of tokens predicted so far
	numPredicted int

	// prompt inputs left to evaluate
	inputs []input

	// inputs that have been added to a batch but not yet submitted to Decode
	pendingInputs []input

	// tokens that have been generated but not returned yet (e.g. for stop sequences)
	pendingResponses []string

	// logprobs for tokens that haven't been returned yet
	pendingLogprobs []llm.Logprob

	// input cache being used by this sequence
	cache *InputCacheSlot

	// channel to send responses over
	responses chan response

	// channel to stop decoding (such as if the remote connection is closed)
	quit chan bool

	// number of tokens to predict
	numPredict int

	samplingCtx *llama.SamplingContext

	// channel to send back the embedding if embedding only
	embedding chan []float32

	// stop sequences
	stop []string

	// number of inputs to keep at the beginning when shifting context window
	numKeep int

	// true if an embedding are to be returned instead of text generation
	embeddingOnly bool

	// shift if context window is exceeded
	shift bool

	doneReason llm.DoneReason

	// logprobs configuration
	logprobs    bool
	topLogprobs int

	// Metrics
	processingDuration time.Duration
	generationDuration time.Duration
	numDecoded         int
	numPromptInputs    int
}

Sequence encapsulates everything about a single request’s journey through the model and cache.

This is a nice example of request-level encapsulation. All the shared, global state lives on Server, but each request has its own:

Input queue (inputs and pendingInputs) that feeds the batcher,
KV cache slot (*InputCacheSlot) inside the shared InputCache,
Streaming channels (responses, embedding, quit), and
Stop & sampling configuration (stop sequences, logprobs, prediction limit, etc.).

The construction of a sequence happens in NewSequence, which quietly solves one of the hardest problems in LLM serving: context management.

func (s *Server) NewSequence(prompt string, images []llm.ImageData, params NewSequenceParams) (*Sequence, error) {
	s.ready.Wait()

	inputs, err := s.inputs(prompt, images)
	if err != nil {
		return nil, fmt.Errorf("failed to process inputs: %w", err)
	} else if len(inputs) == 0 {
		return nil, errors.New("no input provided")
	}

	if params.numKeep < 0 {
		params.numKeep = len(inputs)
	}

	if s.model.AddBOSToken() {
		params.numKeep += 1
	}

	// Ensure that at least 1 input can be discarded during shift
	params.numKeep = min(params.numKeep, s.cache.numCtx-1)

	if len(inputs) > s.cache.numCtx {
		discard := len(inputs) - s.cache.numCtx
		if !params.truncate {
			return nil, errorInputTooLong
		}

		newInputs := inputs[:params.numKeep]
		newInputs = append(newInputs, inputs[params.numKeep+discard:]...)

		slog.Warn("truncating input prompt", "limit", s.cache.numCtx, "prompt", len(inputs), "keep", params.numKeep, "new", len(newInputs))
		inputs = newInputs
	}

	var sc *llama.SamplingContext
	if params.samplingParams != nil {
		sc, err = llama.NewSamplingContext(s.model, *params.samplingParams)
		if err != nil {
			return nil, err
		}
		for _, input := range inputs {
			if input.embed == nil {
				sc.Accept(input.token, false)
			}
		}
	}

	return &Sequence{ /* ... fields ... */ }, nil
}

NewSequence enforces context length and initializes sampling state up front.

A few key lessons from this construction:

Context bounds are enforced early. If the prompt would exceed s.cache.numCtx and truncate is false, we fail fast with a clear errorInputTooLong. That error is mapped to HTTP 400 in the handler.
Truncation is explicit and logged. When truncation is allowed, the code keeps numKeep tokens from the start (including an optional BOS token) and discards the middle, logging the decision with sizes. This is a pragmatic way to preserve some initial context while fitting into the window.
Sampling state is warmed up with the prompt. For non-embedding inputs, the sampling context Accepts prompt tokens before generation starts. That way repetition penalties, temperature, and other dynamics are conditioned on the full prompt.

Design rule-of-thumb: If you manage a fixed-size context window, push as much logic as possible into a single place like NewSequence. It becomes the gatekeeper that all requests must pass through, reducing the number of places that need to “remember” context limits.

The Batch Loop: Where Complexity Hides

Now we’re ready to step into the heart of the runner: the batching engine. This is where the desire for maximum throughput meets the reality of shared mutable state and evolving feature requirements.

The long-lived run goroutine pre-allocates llama batches and calls processBatch in a tight loop:

func (s *Server) run(ctx context.Context) {
	s.ready.Wait()

	// allocate shared batches once
	tokenBatch, err := llama.NewBatch(s.batchSize, len(s.seqs), 0)
	// ... optional embedBatch ...

	for {
		select {
		case <-ctx.Done():
			return
		default:
			err := s.processBatch(tokenBatch, embedBatch)
			if err != nil {
				panic(err)
			}

			tokenBatch.Clear()
			embedBatch.Clear()
		}
	}
}

This is intentionally single-threaded around the llama context: one loop, one context, batched work from many sequences. The interesting part is how processBatch decides what to feed into each batch.

func (s *Server) processBatch(tokenBatch *llama.Batch, embedBatch *llama.Batch) error {
	s.mu.Lock()
	for s.allNil() {
		s.cond.Wait() // Wait until an item is added
	}
	defer s.mu.Unlock()

	var batch *llama.Batch
	var numOutputs int

	seqIdx := s.nextSeq - 1
	for range s.seqs {
		seqIdx = (seqIdx + 1) % len(s.seqs)
		seq := s.seqs[seqIdx]

		if seq == nil { continue }

		// if past the num predict limit
		if seq.numPredict > 0 && seq.numPredicted >= seq.numPredict {
			s.removeSequence(seqIdx, llm.DoneReasonLength)
			continue
		}

		for i, input := range seq.inputs {
			if len(seq.cache.Inputs)+len(seq.pendingInputs)+1 > s.cache.numCtx {
				// handle shift / eviction, or abort
			}

			embedding := input.embed != nil

			if batch == nil {
				if !embedding { batch = tokenBatch } else { batch = embedBatch }
			} else if embedding != batch.IsEmbedding() {
				s.nextSeq = seqIdx
				break
			}

			if i >= batch.Size() { break }

			output := i+1 == len(seq.inputs)
			batch.Add(input.token, input.embed, len(seq.cache.Inputs)+len(seq.pendingInputs), output, seq.cache.Id)
			if output { numOutputs++ }

			seq.pendingInputs = append(seq.pendingInputs, input)
			seq.iBatch = batch.NumTokens() - 1
		}

		seq.inputs = seq.inputs[len(seq.pendingInputs):]
	}

	if batch == nil || batch.NumTokens() == 0 { return nil }

	t := time.Now()
	if err := s.lc.Decode(batch); err != nil {
		return fmt.Errorf("failed to decode batch: %w", err)
	}

	if numOutputs > 0 { s.lc.Synchronize() }

	for i, seq := range s.seqs {
		if seq == nil { continue }
		// ... move pendingInputs into cache, sampling, stop detection, flushing ...
	}

	return nil
}

The core batching loop scans all sequences, fills either a token or embedding batch, calls Decode, then post-processes logits and responses.

This function drives nearly everything:

Fairness: It walks s.seqs in a round-robin fashion using s.nextSeq, avoiding starvation of later sequences.
Context safety: It checks whether adding another input would overflow s.cache.numCtx and either shifts the cache window via ShiftCacheSlot or terminates the sequence.
Heterogeneous batching: It alternates between token and embedding batches based on the actual input type, ensuring each batch is homogeneous (tokens-only or embeddings-only).
Output selection: It marks some tokens as output=true to tell llama when to emit logits.

From a throughput (how much work we do per unit time) perspective, this is exactly what we want: always keep the model busy with as large a batch as possible, across many concurrent sequences.

But here’s the trade-off: processBatch doesn’t just build a batch. It also does sampling, stop-sequence matching, metrics, context shifting, and sequence tear-down. That’s why its cyclomatic and cognitive complexity spike to 25 in the report.

This is the heart of the article’s lesson: performance-driven batching is powerful, but if you pack every concern into the same loop, you’ll pay for it in maintainability and testability.

Context Shifting and Reprocessing

One subtle part of this logic is how it handles context exhaustion:

If the next input would overflow the context and there are no pendingInputs, the code either terminates (if shift is false) or calls ShiftCacheSlot to slide the window forward by numKeep tokens.
ShiftCacheSlot may return an ErrReprocessInputs, in which case previous inputs are re-queued at the front of seq.inputs for another pass.

That’s a clever mechanism to handle shifting without losing logical continuity. But because it lives inside the core loop, changing or debugging this behavior requires understanding several interdependent invariants at once: cache.Inputs length, pendingInputs, numKeep, and shifting semantics.

If we were to refactor this, the report suggests introducing helpers like:

buildNextBatchLocked (allocate and fill a batch while holding the mutex), and
updateSequencesLocked (apply logits to sequences, handle sampling and stopping).

We’ll come back to that when we talk refactors.

Stop Tokens, Unicode, and Trustworthy Streams

Once logits are available, the loop switches from “fill the GPU” mode to “deliver a high-quality stream” mode. This is where stop sequences, logprobs, and UTF-8 handling enter the picture.

After sampling a token, the runner converts it to a piece (a string fragment), appends that to pendingResponses, and then treats the concatenation as the current partial output:

seq.pendingResponses = append(seq.pendingResponses, piece)
sequence := strings.Join(seq.pendingResponses, "")

if ok, stop := common.FindStop(sequence, seq.stop); ok {
	// truncate pendingResponses and logprobs to remove stop sequence
	// adjust cache length to match
	s.removeSequence(i, llm.DoneReasonStop)
	continue
}

if common.ContainsStopSuffix(sequence, seq.stop) {
	continue
}

if common.IncompleteUnicode(sequence) {
	continue
}

if !flushPending(seq) {
	s.removeSequence(i, llm.DoneReasonConnectionClosed)
}

There’s a lot going on in this small snippet:

Stop detection is string-based. FindStop scans the assembled string for any configured stop sequence and returns both a flag and the matched stop value.
Partial matches are respected. ContainsStopSuffix checks whether the current tail of the string could form a stop sequence if more tokens arrive, so the loop holds off on flushing.
Unicode integrity is enforced. IncompleteUnicode gate-keeps the stream to avoid sending invalid UTF-8 to clients.

The final safety net is flushPending:

func flushPending(seq *Sequence) bool {
	joined := strings.Join(seq.pendingResponses, "")
	logprobs := seq.pendingLogprobs
	seq.pendingResponses = []string{}
	seq.pendingLogprobs = []llm.Logprob{}

	// ensure valid UTF-8
	for !utf8.ValidString(joined) {
		joined = joined[:len(joined)-1]
	}

	if len(joined) == 0 {
		return true
	}

	select {
	case seq.responses <- response{content: joined, logprobs: logprobs}:
		return true
	case <-seq.quit:
		return false
	}
}

This function guarantees two properties that are extremely important for clients:

Every chunk is valid UTF‑8. Anything else could break downstream JSON parsers or terminal renderers.
Logprobs stay aligned with content. When stop sequences cause truncation, the code trims pendingLogprobs by the same number of tokens removed from pendingResponses.

Tip: If you implement streaming from an LLM, treating the output as a growing string and using helper functions like FindStop and IncompleteUnicode makes your API much more trustworthy. Clients should never see half a stop sequence or broken characters.

From a story perspective, this is where we see how much responsibility processBatch has accumulated. It’s not just scheduling GPU work; it’s enforcing protocol-level guarantees about what clients receive. That improves performance (no extra goroutines or channels) but makes changes—like adding a new stop condition or supporting alternative encodings—riskier.

Performance, Contention, and Operations

So far, we’ve looked at what the code does. Let’s connect that to how it behaves in production: where it might bottleneck, and what metrics we’d want to observe.

Single Dispatcher, Many Sequences

The runner uses a classic producer–consumer pattern: HTTP handlers produce sequences, and a single goroutine (run) consumes them by repeatedly calling processBatch. This has a few important implications:

Throughput is bounded by one llama context. All sequences share that context and its mutex, so scaling beyond one GPU pipeline requires multiple runner processes.
s.mu is a contention point. The mutex protects s.seqs, s.cache, and related state. Today it is held while we both build the batch and call Decode. That simplifies correctness but can block new requests from being admitted in the middle of a large batch.
seqsSem limits concurrency. Before a handler inserts a sequence into s.seqs, it acquires a weighted semaphore. This acts as a coarse backpressure mechanism: too many active sequences and new requests block.

The report calls out processBatch and llama.Context.Decode as the hot paths, which matches our mental model.

Which Metrics Actually Matter?

If we’re running this in production, we want quantitative feedback that our batching strategy is working. The report suggests several useful metrics; let’s highlight three that directly relate to our story:

Metric	Why it matters	What to look for
`runner_active_sequences`	Shows how full `s.seqs` is compared to `parallel`.	Under steady load, aim for 50–80% occupancy to keep headroom for spikes.
`runner_decode_batch_size`	Average `batch.NumTokens()` per `Decode` call.	If averages stay below ~30–40% of configured `batchSize`, batching isn’t effective.
`runner_request_latency_ms`	End-to-end latency for `/completion` and `/embedding`.	Track p95 time-to-first-token and total latency; spikes can signal contention or under-batching.

These metrics let us validate (or falsify) the assumptions built into processBatch. If latency is high but batch sizes are small, we may be idling the GPU. If active sequences are always at parallel and latency climbs, we likely need horizontal scaling.

Operational Rough Edges

Two operational choices are worth calling out:

Panics as error handling. Both run and loadModel use panic on decode or model-load errors. That’s convenient to implement but means a transient error will crash the whole runner, relying on external supervision to restart it.
No explicit HTTP timeouts. The http.Server is created without ReadTimeout, WriteTimeout, or IdleTimeout. Slow or misbehaving clients can tie up connections indefinitely.

Refactors That Preserve Speed

Given this tour, where would we improve the design without sacrificing the batching performance that makes the runner worthwhile? The report surfaces three concrete refactors that align closely with our narrative.

1. Split the Batch Loop into Orchestrator + Helpers

Right now, processBatch is responsible for:

Scanning sequences and building a batch,
Handling context shifts and reprocessing,
Calling Decode and Synchronize,
Moving inputs into the cache,
Sampling tokens and computing logprobs,
Detecting stop sequences and adjusting cache/logprobs, and
Flushing responses and removing finished sequences.

That’s a lot for one function. The suggested refactor keeps the batching behavior identical but separates concerns:

buildNextBatchLocked (requires s.mu): choose which tokens/embeddings to add to the next batch and update seq.pendingInputs, s.nextSeq, etc.
updateSequencesLocked (requires s.mu): after decode, apply logits to each sequence: embeddings, sampling, stop handling, metrics, and removal.

This has three concrete benefits:

You can unit-test batch construction separately from sampling logic.
You can reason about fairness and context shifting without mentally simulating post-decode behavior.
Future features—like alternate sampling strategies or richer stop conditions—can live in updateSequencesLocked without touching the hot, performance-sensitive batch construction loop.

2. Turn Model-Load Panics into Errors and Status

loadModel currently panics on any error while loading weights, creating the context, applying LoRA adapters, initializing the image projector, or creating the cache. The refactor proposes returning an error instead and updating s.status accordingly:

loadModel becomes func (...) error.
/load runs it in a goroutine and, on error, logs and sets ServerStatusError.

This doesn’t change the happy path at all, but it makes failure modes far friendlier: /health can reflect a persistent failure, logs carry the specific error, and supervisors don’t see opaque panics.

3. Add HTTP Timeouts and Graceful Shutdown

At the HTTP level, a small change to Execute can drastically improve robustness: configure ReadTimeout, WriteTimeout, and IdleTimeout, and treat http.ErrServerClosed as a normal shutdown instead of a fatal error.

Even with generous values, timeouts protect the runner from clients that read slowly or never consume streamed responses, and they make it easier to add a proper shutdown path later (for example, tied to a context or OS signal).

Practical Takeaways You Can Reuse

We’ve walked through the Ollama llama runner from HTTP entrypoints to GPU-bound batching and back out as a streamed response. The real story isn’t just how the code works; it’s the design lessons we can carry into our own systems.

1. Centralize Context and Sequence Management

If you’re serving an LLM with a fixed context window, treat context enforcement as a first-class concern. A constructor like NewSequence that owns tokenization, truncation, and sampling warmup vastly reduces the surface area for off-by-one and overflow bugs.

2. Separate “Keep the GPU Busy” from “Shape the Response”

Batch construction and decode scheduling care about throughput and fairness. Stop sequences, Unicode validity, and logprob alignment care about correctness at the API boundary. It’s tempting to collapse them into a single tight loop, but even extracting small helpers can make future changes much safer.

3. Prefer Explicit States Over Panics for Operational Errors

When a model fails to load or a decode call errors, you usually want:

a log entry with details,
a status flag that /health can expose, and
a path for a supervising system to decide whether to restart or reroute traffic.

Turning panics into structured errors plus a ServerStatusError state gives you all three.

4. Measure What Your Batcher Is Actually Doing

Exposing metrics like active sequences, average batch size, and request latency lets you validate that your clever batch loop is paying off. Without them, it’s easy to end up with complex code that doesn’t actually improve throughput in practice.

Most importantly, when you’re pushing for performance, remember that you (or someone on your team) will need to change this code in six months. Batching tokens doesn’t have to cost you your sanity. With clear boundaries, careful invariants, and a few well-placed helpers, you can keep both the GPU and the future maintainers happy.

Zalt Blog

The Scene: One Runner, Many Requests

Sequence: The Per-Request Brain

The Batch Loop: Where Complexity Hides

Context Shifting and Reprocessing

Stop Tokens, Unicode, and Trustworthy Streams

Performance, Contention, and Operations

Single Dispatcher, Many Sequences

Which Metrics Actually Matter?

Operational Rough Edges

Refactors That Preserve Speed

1. Split the Batch Loop into Orchestrator + Helpers

2. Turn Model-Load Panics into Errors and Status

3. Add HTTP Timeouts and Graceful Shutdown

Practical Takeaways You Can Reuse

1. Centralize Context and Sequence Management

2. Separate “Keep the GPU Busy” from “Shape the Response”

3. Prefer Explicit States Over Panics for Operational Errors

4. Measure What Your Batcher Is Actually Doing

Full Source Code

About the Author

Support this content

Share this article

Read More

Why Transformers Imports Feel Lightweight

When One Class Runs Your Cluster