We’re examining how Ollama turns a local LLM into a self-contained microservice. Ollama is a system for running large language models locally, with an emphasis on clean APIs and predictable performance. At the core of that experience is llm/server.go, which treats each model as its own isolated runner process with a small HTTP API on 127.0.0.1. I’m Mahmoud Zalt, an AI software engineer, and we’ll use this file as a case study in designing heavyweight components—like LLMs—as robust, resource-aware local services.
The core lesson: treat each model as a local microservice with a clear interface, explicit resource planning, and strong guardrails around behavior and observability. We’ll walk from the public LlamaServer interface, through GPU and memory planning, into the HTTP load protocol, streaming completions, and the operational patterns that make the whole thing manageable in production.
LLM as a Local Microservice
llm/server.go is not just a thin wrapper over a library call. It turns each model into a self-contained runner process, reachable over HTTP on localhost, with its own lifecycle, resource budget, and failure modes.
Project (ollama)
└── llm/
└── server.go (this file)
├── Interface: LlamaServer
├── Implementations:
│ ├── llamaServer (legacy llama.cpp runner + ggml)
│ └── ollamaServer (new Ollama engine + textProcessor)
├── Process management:
│ └── StartRunner() --> spawns `ollama runner` subprocess
├── HTTP protocol (to runner on 127.0.0.1:port):
│ ├── GET /health (getServerStatus)
│ ├── POST /load (initModel)
│ ├── POST /completion (Completion)
│ └── POST /embedding (Embedding)
└── GPU layout engine:
├── createLayout()
├── buildLayout()
├── assignLayers()
├── findBestFit()
└── greedyFit()
llm/server.go acts as a client SDK for a per-model runner subprocess.The central abstraction is the LlamaServer interface:
type LlamaServer interface {
ModelPath() string
Load(ctx context.Context, systemInfo ml.SystemInfo, gpus []ml.DeviceInfo, requireFull bool) ([]ml.DeviceID, error)
Ping(ctx context.Context) error
WaitUntilRunning(ctx context.Context) error
Completion(ctx context.Context, req CompletionRequest, fn func(CompletionResponse)) error
Embedding(ctx context.Context, input string) ([]float32, int, error)
Tokenize(ctx context.Context, content string) ([]int, error)
Detokenize(ctx context.Context, tokens []int) (string, error)
Close() error
VRAMSize() uint64
TotalSize() uint64
VRAMByGPU(id ml.DeviceID) uint64
Pid() int
GetPort() int
GetDeviceInfos(ctx context.Context) []ml.DeviceInfo
HasExited() bool
}
LlamaServer is a façade over an entire mini-system: process management, GPU planning, HTTP RPC, and memory accounting. Callers see simple operations—Load, Completion, Embedding, Tokenize, Detokenize, Close—while the messy details stay behind the interface.
A concrete llmServer struct backs this interface. It owns:
- the runner subprocess (
*exec.Cmdand adonechannel), - port allocation and health checks,
- a semaphore to cap concurrent requests per runner, and
- an
*ml.BackendMemorystructure that tracks VRAM and CPU usage for the loaded model.
Two implementations embed this base behavior:
llamaServerfor the legacy llama.cpp + GGML backend.ollamaServerfor the newer engine with aTextProcessor.
This is a straightforward Strategy pattern: a single interface, multiple concrete strategies that can be swapped at runtime. That pattern is what lets Ollama evolve the engine without changing the rest of the codebase.
Planning GPUs and Memory Explicitly
Once you treat the model as a local microservice, the next problem is resource planning: how to map model layers onto GPUs and CPU so the runner fits within the machine’s budget.
Computing layer costs
The GPU layout engine lives in functions like buildLayout, assignLayers, findBestFit, and greedyFit. It starts by computing how big each layer is in bytes:
func (s *llmServer) buildLayout(systemGPUs []ml.DeviceInfo, memory *ml.BackendMemory,
requireFull bool, backoff float32) (ml.GPULayersList, []uint64) {
gpus := append(make([]ml.DeviceInfo, 0, len(systemGPUs)), systemGPUs...)
sort.Sort(sort.Reverse(ml.ByFreeMemory(gpus)))
layers := make([]uint64, len(memory.CPU.Weights))
for i := range layers {
for j := range memory.GPUs {
layers[i] += memory.GPUs[j].Weights[i]
layers[i] += memory.GPUs[j].Cache[i]
}
layers[i] += memory.CPU.Weights[i]
layers[i] += memory.CPU.Cache[i]
logutil.Trace("layer to assign", "layer", i, "size", format.HumanBytes2(layers[i]))
}
// ... then calls assignLayers(...)
}
This builds a slice layers where layers[i] is the total bytes required for layer i across CPU and GPUs, including weights and cache. That gives a stable, backend-agnostic view of costs before making placement decisions.
Packing layers onto GPUs
With layer sizes and per-GPU free memory, the planner decides where to put each layer:
assignLayerschooses how many GPUs to involve and whether some layers (like the output layer) must stay on CPU when VRAM is tight.findBestFitbinary-searches a “capacity factor” to balance utilization across GPUs instead of overfilling one device.greedyFitimplements the actual packing, iterating layers (typically from the end) and dropping them onto GPUs until their free space is exhausted.
The algorithm is purposely heuristic: roughly O(L * G) for L layers and G GPUs, which is fine because model loads are rare relative to inference. The tradeoff favors predictable, debuggable behavior over optimality.
Verifying the plan against reality
After computing a candidate layout, verifyLayout checks whether the plan is actually safe for the machine:
- accumulate VRAM usage for graphs and offloaded layers per device,
- compute total CPU memory requirements,
- compare CPU usage to
systemInfo.FreeMemoryandFreeSwap(with a macOS-specific swap exception), and - when
requireFullis true, enforce that all layers must fit, otherwise returnErrLoadRequiredFull.
This is the city planner sanity check: even if the trunks (GPUs) can technically hold all suitcases (layers), the total load must still respect system-level constraints like RAM and swap.
From Layout to Load Protocol
Planning where layers should live is not enough. The server has to negotiate with the runner process to allocate memory, load weights, and react when reality doesn’t match estimates. That negotiation is encoded as a simple load state machine over HTTP.
A state machine for loading
The load lifecycle is expressed by a LoadOperation enum:
type LoadOperation int
const (
LoadOperationFit LoadOperation = iota // Return memory requirements but do not allocate
LoadOperationAlloc // Allocate memory but do not load the weights
LoadOperationCommit // Load weights - further changes cannot be made
LoadOperationClose // Close model and free memory
)
The protocol follows a Fit → Alloc → Commit → Close flow. Fit lets the runner report memory requirements without committing. Alloc reserves memory. Commit actually loads the model, and Close tears it down. This gives the server a safe way to probe and refine its layout before it locks in.
Two loading strategies, one abstraction
Both backends implement Load via this protocol, but differ in sophistication:
llamaServer.Loadperforms a single-pass layout based on GGML estimates, chooses GPU graph sizes, derives options likeUseMmapfrom OS/backend, then sends aLoadOperationCommitand waits for readiness.ollamaServer.Loadimplements an iterative negotiation loop. It sendsFitandAllocrequests, reads back actual usage, adjusts the layout with a backoff factor when allocations fail, and only then commits the final plan.
How the iterative negotiation behaves
The new engine tracks past allocations keyed by a layout hash, and uses a backoff factor to gradually shrink its assumptions about free VRAM when allocations fail. When it detects oscillation between layouts (for example, 39 vs. 41 layers offloaded), it explores intermediate options to break the cycle. The pattern is a feedback loop: measure, adapt, avoid retrying known-bad states.
The runner’s HTTP surface
The wire protocol between llmServer and the runner is intentionally boring REST on 127.0.0.1::
POST /loadwith aLoadRequestand aLoadOperation, returning aLoadResponse.GET /healthforServerStatusand load progress.POST /completionfor token streaming.POST /embeddingfor embeddings.
initModel wraps the /load call:
func (s *llmServer) initModel(ctx context.Context, req LoadRequest,
operation LoadOperation) (*LoadResponse, error) {
req.Operation = operation
data, err := json.Marshal(req)
if err != nil {
return nil, fmt.Errorf("error marshaling load data: %w", err)
}
r, err := http.NewRequestWithContext(ctx, http.MethodPost,
fmt.Sprintf("http://127.0.0.1:%d/load", s.port), bytes.NewBuffer(data))
if err != nil {
return nil, fmt.Errorf("error creating load request: %w", err)
}
r.Header.Set("Content-Type", "application/json")
resp, err := http.DefaultClient.Do(r)
// ... read body, handle status >= 400, unmarshal LoadResponse
}
All the interesting logic lives in layout and state management, not in the HTTP details. That separation is deliberate: it keeps the protocol simple and moves complexity into testable, in-process functions.
Bootstrapping the runner process
To turn a model into a microservice, StartRunner spawns a subprocess of the current binary with a runner subcommand on a chosen port:
func StartRunner(ollamaEngine bool, modelPath string, gpuLibs []string,
out io.Writer, extraEnvs map[string]string) (cmd *exec.Cmd, port int, err error) {
exe, err := os.Executable()
// ... find a free localhost port
params := []string{"runner"}
if ollamaEngine {
params = append(params, "--ollama-engine")
}
if modelPath != "" {
params = append(params, "--model", modelPath)
}
params = append(params, "--port", strconv.Itoa(port))
cmd = exec.Command(exe, params...)
// configure environment, GPU library paths, IO pipes
// start process and return (cmd, port)
}
The rest of the system only sees PIDs and ports behind the LlamaServer interface. That boundary—“runner as separate process on localhost with a tiny API”—is what makes the model feel like a true microservice, not just a linked library.
Streaming Completions with Guardrails
Once a model is loaded and healthy, completions dominate the hot path. The design here is to keep the API simple while surrounding it with cheap guardrails: concurrency caps, format validation, bounded output, and basic protection against pathological token streams.
From request struct to streaming loop
A completion request captures the prompt and configuration:
type CompletionRequest struct {
Prompt string
Format json.RawMessage
Images []ImageData
Options *api.Options
Grammar string
Shift bool
Truncate bool
Logprobs bool
TopLogprobs int
}
Format has special handling for JSON. If it is the string "json", the server injects a built-in JSON grammar. If it’s a JSON object, it is treated as JSON Schema and converted to a grammar via llama.SchemaToGrammar. Callers get structured outputs using a single parameter, without learning grammar internals.
The main Completion method does a small but important sequence of steps:
- Interpret
Formatand setGrammaraccordingly. - Acquire a semaphore slot (
s.sem) to limit per-runner concurrency. - Clamp
NumPredictto a multiple of the context window (for example10 * NumCtx) to avoid unbounded runs. - Wait for the runner to be
ReadyviagetServerStatusRetry. - Send
POST /completionand read a streaming response line by line. - On each chunk, unmarshal JSON and forward content to the user callback.
- Abort on context cancellation or when a token repetition heuristic fires.
The streaming loop looks like this (simplified):
scanner := bufio.NewScanner(res.Body)
buf := make([]byte, 0, maxBufferSize)
scanner.Buffer(buf, maxBufferSize)
var lastToken string
var tokenRepeat int
for scanner.Scan() {
select {
case <-ctx.Done():
return ctx.Err()
default:
line := scanner.Bytes()
if len(line) == 0 {
continue
}
evt, ok := bytes.CutPrefix(line, []byte("data: "))
if !ok {
evt = line
}
var c CompletionResponse
if err := json.Unmarshal(evt, &c); err != nil {
return fmt.Errorf("error unmarshalling llm prediction response: %v", err)
}
switch {
case strings.TrimSpace(c.Content) == lastToken:
tokenRepeat++
default:
lastToken = strings.TrimSpace(c.Content)
tokenRepeat = 0
}
if tokenRepeat > 30 {
slog.Debug("prediction aborted, token repeat limit reached")
return ctx.Err()
}
if c.Content != "" {
fn(CompletionResponse{Content: c.Content, Logprobs: c.Logprobs})
}
if c.Done {
fn(c)
return nil
}
}
}
One weakness the internal report surfaces: when the repetition limit triggers, the method returns ctx.Err(), making it indistinguishable from a client-side cancellation. A more precise design would return a dedicated error (for example ErrTokenRepeatLimit), so logs and callers can tell heuristic aborts from user-initiated cancellations.
Concurrency and resource control
Both Completion and Embedding share the same semaphore. That per-runner concurrency limit, set at construction time via numParallel, is a simple but effective control:
- it bounds the load each runner can generate on GPUs and CPU,
- backpressure shows up naturally as calls blocking on the semaphore, and
- higher layers can observe saturation via metrics and adjust
numParallel.
This fits the general theme of the file: keep the public API simple, but make resource usage and guardrails explicit inside the implementation.
Operational Lessons Beyond LLMs
The final piece of treating a model as a local microservice is operability: health checks, progress reporting, memory visibility, and a code structure that remains understandable as the system evolves.
Health and load progress
WaitUntilRunning is a good pattern for supervising a long startup:
- poll
/healthwith a short per-request timeout, - track
ServerStatusand only log when it changes to avoid noise, - monitor
loadProgress(0–100%) and reset a timer whenever it increases, and - fail when a configurable
LoadTimeoutelapses without progress, including the last progress value and any error message from the runner.
That gives operators two answers: “how far along are we?” and “did we stall?”. The internal performance report suggests turning this into metrics like load duration and success/failure rates, but even at the code level, the pattern is useful: poll, track transitions, detect stalls.
Memory and device visibility
VRAMSize, TotalSize, and VRAMByGPU expose how much memory the loaded model consumes, based on the ml.BackendMemory calculated during Load. These methods don’t change state; they provide the information higher-level schedulers or monitoring systems need to:
- decide which models to evict when GPUs are near capacity,
- balance models across devices, and
- set alerts when VRAM usage is consistently high.
This is an important design choice: the microservice abstraction doesn’t just hide implementation details; it also exposes the right knobs and metrics for operational decisions.
Security and privacy in logging
The runner API is bound to 127.0.0.1, leaving exposure and authentication to higher layers. Within this file, the main security concern is logging:
logutil.Tracecan log full prompts and embedding inputs, which may contain sensitive data.- Some error paths log raw response bodies from the runner, which might echo user content.
For production environments, a safer approach is to treat prompts like passwords: log metadata (sizes, model IDs, durations), not contents, except under tightly controlled debug flags.
Structural smells and refactors worth copying
Because llm/server.go has grown over time, it now mixes several concerns in one file: process management, HTTP client behavior, GPU layout, load negotiation, and the public API. The internal report calls out refactors that generalize well:
| Current smell | Impact | Refactor lesson |
|---|---|---|
| Single large file blending unrelated responsibilities | High cognitive load; hard to test layout logic or HTTP client in isolation. | Extract a runnerClient (HTTP + process), a dedicated layout package, and keep LlamaServer as a thin orchestration layer. |
| Conflated errors (token repetition vs. cancellation) | Unclear why completions stopped; hard to build precise alerts. | Define explicit error types for expected failure modes and map them cleanly to logs and metrics. |
Implicit dependency on http.DefaultClient |
No central control over timeouts, retries, or connection pools. | Inject a tuned *http.Client so behavior is explicit and testable. |
None of these are LLM-specific. They are the same patterns that make any microservice-based system easier to reason about and operate.
What to take back to your own systems
Stepping back, the main lesson from llm/server.go is architectural: heavy components behave better when you treat them as local microservices with explicit contracts and resource models, not as opaque libraries. Concretely:
- Isolate heavy dependencies in their own process. Give them a narrow API over localhost so they can crash, restart, and be upgraded without taking down your main service.
- Make resource planning a first-class concern. Compute per-unit costs (layers, shards, tenants), run a heuristic placement, then verify against real system constraints.
- Negotiate with reality. Use probe phases like
Fitbefore committing allocations. Assume estimates are wrong and let the system tell you what actually fits. - Add guardrails around streaming interfaces. Cap buffers, limit output, detect obvious loops, and surface distinct error types for distinct failure modes.
- Expose operational signals via your abstractions. Methods like
VRAMSizeandHasExitedare how SREs and higher-level schedulers keep the system healthy.
Treating an LLM as a local microservice forces you to confront lifecycle, resources, and observability head-on. If you apply the same discipline to other heavyweight pieces in your architecture—databases, search engines, batch workers—you’ll end up with systems that are not just functional, but predictable and operable under real-world load.



