Every high-throughput AI system eventually runs into the same dilemma: do we keep the code simple, or do we squeeze every last drop of performance out of the hardware? In the Ollama llamarunner, we get to watch that trade-off play out in a single Go file that does everything from HTTP routing to GPU-bound batching. I'm Mahmoud Zalt, and in this walkthrough we'll use this runner as a case study in how to batch tokens efficiently without turning your core loop into an unmaintainable knot.
We'll unpack how the runner juggles concurrent sequences on a single llama context, where the design shines, and where complexity starts to leak. By the end, you'll have a concrete mental model for building your own batched inference loop—and a checklist to keep it healthy over time.
The Scene: One Runner, Many Requests
Before we dig into the batching logic, we need a clear picture of what this runner is responsible for. Conceptually, it's a small HTTP service that exposes four endpoints—/load, /completion, /embedding, and /health—and funnels all model work through a single llama context and KV cache.
runner/
llamarunner/
runner.go <-- this file
Ollama Core Server
|
| HTTP (localhost)
v
+-----------------------+
| Server (runner.go) |
| - modelPath |
| - model *llama.Model |
| - lc *llama.Context |
| - cache *InputCache |
| - seqs []*Sequence |
+-----------+-----------+
|
| manages sequences & batching
v
+-------------+
| Sequence | (per request)
| - inputs |
| - cache slot|
| - sampling |
| - channels |
+------+------+
|
| batched tokens/embeds
v
+-----------+
| llama C++ |
| backend |
+-----------+
Server and Sequence.Everything starts in Execute, the CLI entrypoint. It parses flags, initializes logging and the llama backend, and then spins up a Server with:
- a
*llama.Modeland*llama.Context(once loaded), - an
InputCachethat wraps the KV cache, - a slice of
*Sequenceslots capped byparallel, and - a background goroutine
run()that continuously callsprocessBatch.
In other words, this file is both the HTTP edge and the scheduling layer for GPU-bound inference. The core narrative is how it batches heterogeneous work across concurrent sequences, while keeping each request isolated.
Sequence: The Per-Request Brain
With the scene set, let’s zoom into the Sequence type. This struct is where the runner encodes the lifecycle of a single request: its prompt, its KV cache slot, its sampling context, and its streaming state.
type Sequence struct {
// batch index
iBatch int
// number of tokens predicted so far
numPredicted int
// prompt inputs left to evaluate
inputs []input
// inputs that have been added to a batch but not yet submitted to Decode
pendingInputs []input
// tokens that have been generated but not returned yet (e.g. for stop sequences)
pendingResponses []string
// logprobs for tokens that haven't been returned yet
pendingLogprobs []llm.Logprob
// input cache being used by this sequence
cache *InputCacheSlot
// channel to send responses over
responses chan response
// channel to stop decoding (such as if the remote connection is closed)
quit chan bool
// number of tokens to predict
numPredict int
samplingCtx *llama.SamplingContext
// channel to send back the embedding if embedding only
embedding chan []float32
// stop sequences
stop []string
// number of inputs to keep at the beginning when shifting context window
numKeep int
// true if an embedding are to be returned instead of text generation
embeddingOnly bool
// shift if context window is exceeded
shift bool
doneReason llm.DoneReason
// logprobs configuration
logprobs bool
topLogprobs int
// Metrics
processingDuration time.Duration
generationDuration time.Duration
numDecoded int
numPromptInputs int
}
Sequence encapsulates everything about a single request’s journey through the model and cache.This is a nice example of request-level encapsulation. All the shared, global state lives on Server, but each request has its own:
- Input queue (
inputsandpendingInputs) that feeds the batcher, - KV cache slot (
*InputCacheSlot) inside the sharedInputCache, - Streaming channels (
responses,embedding,quit), and - Stop & sampling configuration (stop sequences, logprobs, prediction limit, etc.).
The construction of a sequence happens in NewSequence, which quietly solves one of the hardest problems in LLM serving: context management.
func (s *Server) NewSequence(prompt string, images []llm.ImageData, params NewSequenceParams) (*Sequence, error) {
s.ready.Wait()
inputs, err := s.inputs(prompt, images)
if err != nil {
return nil, fmt.Errorf("failed to process inputs: %w", err)
} else if len(inputs) == 0 {
return nil, errors.New("no input provided")
}
if params.numKeep < 0 {
params.numKeep = len(inputs)
}
if s.model.AddBOSToken() {
params.numKeep += 1
}
// Ensure that at least 1 input can be discarded during shift
params.numKeep = min(params.numKeep, s.cache.numCtx-1)
if len(inputs) > s.cache.numCtx {
discard := len(inputs) - s.cache.numCtx
if !params.truncate {
return nil, errorInputTooLong
}
newInputs := inputs[:params.numKeep]
newInputs = append(newInputs, inputs[params.numKeep+discard:]...)
slog.Warn("truncating input prompt", "limit", s.cache.numCtx, "prompt", len(inputs), "keep", params.numKeep, "new", len(newInputs))
inputs = newInputs
}
var sc *llama.SamplingContext
if params.samplingParams != nil {
sc, err = llama.NewSamplingContext(s.model, *params.samplingParams)
if err != nil {
return nil, err
}
for _, input := range inputs {
if input.embed == nil {
sc.Accept(input.token, false)
}
}
}
return &Sequence{ /* ... fields ... */ }, nil
}
NewSequence enforces context length and initializes sampling state up front.A few key lessons from this construction:
- Context bounds are enforced early. If the prompt would exceed
s.cache.numCtxandtruncateis false, we fail fast with a clearerrorInputTooLong. That error is mapped to HTTP 400 in the handler. - Truncation is explicit and logged. When truncation is allowed, the code keeps
numKeeptokens from the start (including an optional BOS token) and discards the middle, logging the decision with sizes. This is a pragmatic way to preserve some initial context while fitting into the window. - Sampling state is warmed up with the prompt. For non-embedding inputs, the sampling context
Accepts prompt tokens before generation starts. That way repetition penalties, temperature, and other dynamics are conditioned on the full prompt.
The Batch Loop: Where Complexity Hides
Now we’re ready to step into the heart of the runner: the batching engine. This is where the desire for maximum throughput meets the reality of shared mutable state and evolving feature requirements.
The long-lived run goroutine pre-allocates llama batches and calls processBatch in a tight loop:
func (s *Server) run(ctx context.Context) {
s.ready.Wait()
// allocate shared batches once
tokenBatch, err := llama.NewBatch(s.batchSize, len(s.seqs), 0)
// ... optional embedBatch ...
for {
select {
case <-ctx.Done():
return
default:
err := s.processBatch(tokenBatch, embedBatch)
if err != nil {
panic(err)
}
tokenBatch.Clear()
embedBatch.Clear()
}
}
}
This is intentionally single-threaded around the llama context: one loop, one context, batched work from many sequences. The interesting part is how processBatch decides what to feed into each batch.
func (s *Server) processBatch(tokenBatch *llama.Batch, embedBatch *llama.Batch) error {
s.mu.Lock()
for s.allNil() {
s.cond.Wait() // Wait until an item is added
}
defer s.mu.Unlock()
var batch *llama.Batch
var numOutputs int
seqIdx := s.nextSeq - 1
for range s.seqs {
seqIdx = (seqIdx + 1) % len(s.seqs)
seq := s.seqs[seqIdx]
if seq == nil { continue }
// if past the num predict limit
if seq.numPredict > 0 && seq.numPredicted >= seq.numPredict {
s.removeSequence(seqIdx, llm.DoneReasonLength)
continue
}
for i, input := range seq.inputs {
if len(seq.cache.Inputs)+len(seq.pendingInputs)+1 > s.cache.numCtx {
// handle shift / eviction, or abort
}
embedding := input.embed != nil
if batch == nil {
if !embedding { batch = tokenBatch } else { batch = embedBatch }
} else if embedding != batch.IsEmbedding() {
s.nextSeq = seqIdx
break
}
if i >= batch.Size() { break }
output := i+1 == len(seq.inputs)
batch.Add(input.token, input.embed, len(seq.cache.Inputs)+len(seq.pendingInputs), output, seq.cache.Id)
if output { numOutputs++ }
seq.pendingInputs = append(seq.pendingInputs, input)
seq.iBatch = batch.NumTokens() - 1
}
seq.inputs = seq.inputs[len(seq.pendingInputs):]
}
if batch == nil || batch.NumTokens() == 0 { return nil }
t := time.Now()
if err := s.lc.Decode(batch); err != nil {
return fmt.Errorf("failed to decode batch: %w", err)
}
if numOutputs > 0 { s.lc.Synchronize() }
for i, seq := range s.seqs {
if seq == nil { continue }
// ... move pendingInputs into cache, sampling, stop detection, flushing ...
}
return nil
}
Decode, then post-processes logits and responses.This function drives nearly everything:
- Fairness: It walks
s.seqsin a round-robin fashion usings.nextSeq, avoiding starvation of later sequences. - Context safety: It checks whether adding another input would overflow
s.cache.numCtxand either shifts the cache window viaShiftCacheSlotor terminates the sequence. - Heterogeneous batching: It alternates between token and embedding batches based on the actual input type, ensuring each batch is homogeneous (tokens-only or embeddings-only).
- Output selection: It marks some tokens as
output=trueto tell llama when to emit logits.
From a throughput (how much work we do per unit time) perspective, this is exactly what we want: always keep the model busy with as large a batch as possible, across many concurrent sequences.
This is the heart of the article’s lesson: performance-driven batching is powerful, but if you pack every concern into the same loop, you’ll pay for it in maintainability and testability.
Context Shifting and Reprocessing
One subtle part of this logic is how it handles context exhaustion:
- If the next input would overflow the context and there are no
pendingInputs, the code either terminates (ifshiftis false) or callsShiftCacheSlotto slide the window forward bynumKeeptokens. ShiftCacheSlotmay return anErrReprocessInputs, in which case previous inputs are re-queued at the front ofseq.inputsfor another pass.
That’s a clever mechanism to handle shifting without losing logical continuity. But because it lives inside the core loop, changing or debugging this behavior requires understanding several interdependent invariants at once: cache.Inputs length, pendingInputs, numKeep, and shifting semantics.
If we were to refactor this, the report suggests introducing helpers like:
buildNextBatchLocked(allocate and fill a batch while holding the mutex), andupdateSequencesLocked(apply logits to sequences, handle sampling and stopping).
We’ll come back to that when we talk refactors.
Stop Tokens, Unicode, and Trustworthy Streams
Once logits are available, the loop switches from “fill the GPU” mode to “deliver a high-quality stream” mode. This is where stop sequences, logprobs, and UTF-8 handling enter the picture.
After sampling a token, the runner converts it to a piece (a string fragment), appends that to pendingResponses, and then treats the concatenation as the current partial output:
seq.pendingResponses = append(seq.pendingResponses, piece)
sequence := strings.Join(seq.pendingResponses, "")
if ok, stop := common.FindStop(sequence, seq.stop); ok {
// truncate pendingResponses and logprobs to remove stop sequence
// adjust cache length to match
s.removeSequence(i, llm.DoneReasonStop)
continue
}
if common.ContainsStopSuffix(sequence, seq.stop) {
continue
}
if common.IncompleteUnicode(sequence) {
continue
}
if !flushPending(seq) {
s.removeSequence(i, llm.DoneReasonConnectionClosed)
}
There’s a lot going on in this small snippet:
- Stop detection is string-based.
FindStopscans the assembled string for any configured stop sequence and returns both a flag and the matched stop value. - Partial matches are respected.
ContainsStopSuffixchecks whether the current tail of the string could form a stop sequence if more tokens arrive, so the loop holds off on flushing. - Unicode integrity is enforced.
IncompleteUnicodegate-keeps the stream to avoid sending invalid UTF-8 to clients.
The final safety net is flushPending:
func flushPending(seq *Sequence) bool {
joined := strings.Join(seq.pendingResponses, "")
logprobs := seq.pendingLogprobs
seq.pendingResponses = []string{}
seq.pendingLogprobs = []llm.Logprob{}
// ensure valid UTF-8
for !utf8.ValidString(joined) {
joined = joined[:len(joined)-1]
}
if len(joined) == 0 {
return true
}
select {
case seq.responses <- response{content: joined, logprobs: logprobs}:
return true
case <-seq.quit:
return false
}
}
This function guarantees two properties that are extremely important for clients:
- Every chunk is valid UTF‑8. Anything else could break downstream JSON parsers or terminal renderers.
- Logprobs stay aligned with content. When stop sequences cause truncation, the code trims
pendingLogprobsby the same number of tokens removed frompendingResponses.
From a story perspective, this is where we see how much responsibility processBatch has accumulated. It’s not just scheduling GPU work; it’s enforcing protocol-level guarantees about what clients receive. That improves performance (no extra goroutines or channels) but makes changes—like adding a new stop condition or supporting alternative encodings—riskier.
Performance, Contention, and Operations
So far, we’ve looked at what the code does. Let’s connect that to how it behaves in production: where it might bottleneck, and what metrics we’d want to observe.
Single Dispatcher, Many Sequences
The runner uses a classic producer–consumer pattern: HTTP handlers produce sequences, and a single goroutine (run) consumes them by repeatedly calling processBatch. This has a few important implications:
- Throughput is bounded by one llama context. All sequences share that context and its mutex, so scaling beyond one GPU pipeline requires multiple runner processes.
s.muis a contention point. The mutex protectss.seqs,s.cache, and related state. Today it is held while we both build the batch and callDecode. That simplifies correctness but can block new requests from being admitted in the middle of a large batch.seqsSemlimits concurrency. Before a handler inserts a sequence intos.seqs, it acquires a weighted semaphore. This acts as a coarse backpressure mechanism: too many active sequences and new requests block.
The report calls out processBatch and llama.Context.Decode as the hot paths, which matches our mental model.
Which Metrics Actually Matter?
If we’re running this in production, we want quantitative feedback that our batching strategy is working. The report suggests several useful metrics; let’s highlight three that directly relate to our story:
| Metric | Why it matters | What to look for |
|---|---|---|
runner_active_sequences |
Shows how full s.seqs is compared to parallel. |
Under steady load, aim for 50–80% occupancy to keep headroom for spikes. |
runner_decode_batch_size |
Average batch.NumTokens() per Decode call. |
If averages stay below ~30–40% of configured batchSize, batching isn’t effective. |
runner_request_latency_ms |
End-to-end latency for /completion and /embedding. |
Track p95 time-to-first-token and total latency; spikes can signal contention or under-batching. |
These metrics let us validate (or falsify) the assumptions built into processBatch. If latency is high but batch sizes are small, we may be idling the GPU. If active sequences are always at parallel and latency climbs, we likely need horizontal scaling.
Operational Rough Edges
Two operational choices are worth calling out:
- Panics as error handling. Both
runandloadModelusepanicon decode or model-load errors. That’s convenient to implement but means a transient error will crash the whole runner, relying on external supervision to restart it. - No explicit HTTP timeouts. The
http.Serveris created withoutReadTimeout,WriteTimeout, orIdleTimeout. Slow or misbehaving clients can tie up connections indefinitely.
Refactors That Preserve Speed
Given this tour, where would we improve the design without sacrificing the batching performance that makes the runner worthwhile? The report surfaces three concrete refactors that align closely with our narrative.
1. Split the Batch Loop into Orchestrator + Helpers
Right now, processBatch is responsible for:
- Scanning sequences and building a batch,
- Handling context shifts and reprocessing,
- Calling
DecodeandSynchronize, - Moving inputs into the cache,
- Sampling tokens and computing logprobs,
- Detecting stop sequences and adjusting cache/logprobs, and
- Flushing responses and removing finished sequences.
That’s a lot for one function. The suggested refactor keeps the batching behavior identical but separates concerns:
buildNextBatchLocked(requiress.mu): choose which tokens/embeddings to add to the next batch and updateseq.pendingInputs,s.nextSeq, etc.updateSequencesLocked(requiress.mu): after decode, apply logits to each sequence: embeddings, sampling, stop handling, metrics, and removal.
This has three concrete benefits:
- You can unit-test batch construction separately from sampling logic.
- You can reason about fairness and context shifting without mentally simulating post-decode behavior.
- Future features—like alternate sampling strategies or richer stop conditions—can live in
updateSequencesLockedwithout touching the hot, performance-sensitive batch construction loop.
2. Turn Model-Load Panics into Errors and Status
loadModel currently panics on any error while loading weights, creating the context, applying LoRA adapters, initializing the image projector, or creating the cache. The refactor proposes returning an error instead and updating s.status accordingly:
loadModelbecomesfunc (...) error./loadruns it in a goroutine and, on error, logs and setsServerStatusError.
This doesn’t change the happy path at all, but it makes failure modes far friendlier: /health can reflect a persistent failure, logs carry the specific error, and supervisors don’t see opaque panics.
3. Add HTTP Timeouts and Graceful Shutdown
At the HTTP level, a small change to Execute can drastically improve robustness: configure ReadTimeout, WriteTimeout, and IdleTimeout, and treat http.ErrServerClosed as a normal shutdown instead of a fatal error.
Even with generous values, timeouts protect the runner from clients that read slowly or never consume streamed responses, and they make it easier to add a proper shutdown path later (for example, tied to a context or OS signal).
Practical Takeaways You Can Reuse
We’ve walked through the Ollama llama runner from HTTP entrypoints to GPU-bound batching and back out as a streamed response. The real story isn’t just how the code works; it’s the design lessons we can carry into our own systems.
1. Centralize Context and Sequence Management
If you’re serving an LLM with a fixed context window, treat context enforcement as a first-class concern. A constructor like NewSequence that owns tokenization, truncation, and sampling warmup vastly reduces the surface area for off-by-one and overflow bugs.
2. Separate “Keep the GPU Busy” from “Shape the Response”
Batch construction and decode scheduling care about throughput and fairness. Stop sequences, Unicode validity, and logprob alignment care about correctness at the API boundary. It’s tempting to collapse them into a single tight loop, but even extracting small helpers can make future changes much safer.
3. Prefer Explicit States Over Panics for Operational Errors
When a model fails to load or a decode call errors, you usually want:
- a log entry with details,
- a status flag that
/healthcan expose, and - a path for a supervising system to decide whether to restart or reroute traffic.
Turning panics into structured errors plus a ServerStatusError state gives you all three.
4. Measure What Your Batcher Is Actually Doing
Exposing metrics like active sequences, average batch size, and request latency lets you validate that your clever batch loop is paying off. Without them, it’s easy to end up with complex code that doesn’t actually improve throughput in practice.
Most importantly, when you’re pushing for performance, remember that you (or someone on your team) will need to change this code in six months. Batching tokens doesn’t have to cost you your sanity. With clear boundaries, careful invariants, and a few well-placed helpers, you can keep both the GPU and the future maintainers happy.



