Skip to main content

The Local Microservice Behind Every Model

What does “The Local Microservice Behind Every Model” really mean in practice? Think of each model as its own small service running right beside your app.

Code Cracking
30m read
#microservices#softwarearchitecture#machinelearning#backend
The Local Microservice Behind Every Model - Featured blog post image

CONSULTING

Got a specific LLM hosting question?

Model selection, GPU setup, inference optimization — bring your setup, get a direct answer in one focused session.

We’re examining how Ollama turns a local LLM into a self-contained microservice. Ollama is a system for running large language models locally, with an emphasis on clean APIs and predictable performance. At the core of that experience is llm/server.go, which treats each model as its own isolated runner process with a small HTTP API on 127.0.0.1. I’m Mahmoud Zalt, an AI software engineer, and we’ll use this file as a case study in designing heavyweight components—like LLMs—as robust, resource-aware local services.

The core lesson: treat each model as a local microservice with a clear interface, explicit resource planning, and strong guardrails around behavior and observability. We’ll walk from the public LlamaServer interface, through GPU and memory planning, into the HTTP load protocol, streaming completions, and the operational patterns that make the whole thing manageable in production.

LLM as a Local Microservice

llm/server.go is not just a thin wrapper over a library call. It turns each model into a self-contained runner process, reachable over HTTP on localhost, with its own lifecycle, resource budget, and failure modes.

Project (ollama)
└── llm/
    └── server.go  (this file)
        ├── Interface: LlamaServer
        ├── Implementations:
        │   ├── llamaServer   (legacy llama.cpp runner + ggml)
        │   └── ollamaServer  (new Ollama engine + textProcessor)
        ├── Process management:
        │   └── StartRunner()  --> spawns `ollama runner` subprocess
        ├── HTTP protocol (to runner on 127.0.0.1:port):
        │   ├── GET  /health      (getServerStatus)
        │   ├── POST /load        (initModel)
        │   ├── POST /completion  (Completion)
        │   └── POST /embedding   (Embedding)
        └── GPU layout engine:
            ├── createLayout()
            ├── buildLayout()
            ├── assignLayers()
            ├── findBestFit()
            └── greedyFit()
llm/server.go acts as a client SDK for a per-model runner subprocess.

The central abstraction is the LlamaServer interface:

type LlamaServer interface {
    ModelPath() string
    Load(ctx context.Context, systemInfo ml.SystemInfo, gpus []ml.DeviceInfo, requireFull bool) ([]ml.DeviceID, error)
    Ping(ctx context.Context) error
    WaitUntilRunning(ctx context.Context) error
    Completion(ctx context.Context, req CompletionRequest, fn func(CompletionResponse)) error
    Embedding(ctx context.Context, input string) ([]float32, int, error)
    Tokenize(ctx context.Context, content string) ([]int, error)
    Detokenize(ctx context.Context, tokens []int) (string, error)
    Close() error
    VRAMSize() uint64
    TotalSize() uint64
    VRAMByGPU(id ml.DeviceID) uint64
    Pid() int
    GetPort() int
    GetDeviceInfos(ctx context.Context) []ml.DeviceInfo
    HasExited() bool
}

LlamaServer is a façade over an entire mini-system: process management, GPU planning, HTTP RPC, and memory accounting. Callers see simple operations—Load, Completion, Embedding, Tokenize, Detokenize, Close—while the messy details stay behind the interface.

A concrete llmServer struct backs this interface. It owns:

  • the runner subprocess (*exec.Cmd and a done channel),
  • port allocation and health checks,
  • a semaphore to cap concurrent requests per runner, and
  • an *ml.BackendMemory structure that tracks VRAM and CPU usage for the loaded model.

Two implementations embed this base behavior:

  • llamaServer for the legacy llama.cpp + GGML backend.
  • ollamaServer for the newer engine with a TextProcessor.

This is a straightforward Strategy pattern: a single interface, multiple concrete strategies that can be swapped at runtime. That pattern is what lets Ollama evolve the engine without changing the rest of the codebase.

Planning GPUs and Memory Explicitly

Once you treat the model as a local microservice, the next problem is resource planning: how to map model layers onto GPUs and CPU so the runner fits within the machine’s budget.

Computing layer costs

The GPU layout engine lives in functions like buildLayout, assignLayers, findBestFit, and greedyFit. It starts by computing how big each layer is in bytes:

func (s *llmServer) buildLayout(systemGPUs []ml.DeviceInfo, memory *ml.BackendMemory,
    requireFull bool, backoff float32) (ml.GPULayersList, []uint64) {

    gpus := append(make([]ml.DeviceInfo, 0, len(systemGPUs)), systemGPUs...)
    sort.Sort(sort.Reverse(ml.ByFreeMemory(gpus)))

    layers := make([]uint64, len(memory.CPU.Weights))
    for i := range layers {
        for j := range memory.GPUs {
            layers[i] += memory.GPUs[j].Weights[i]
            layers[i] += memory.GPUs[j].Cache[i]
        }
        layers[i] += memory.CPU.Weights[i]
        layers[i] += memory.CPU.Cache[i]
        logutil.Trace("layer to assign", "layer", i, "size", format.HumanBytes2(layers[i]))
    }

    // ... then calls assignLayers(...)
}

This builds a slice layers where layers[i] is the total bytes required for layer i across CPU and GPUs, including weights and cache. That gives a stable, backend-agnostic view of costs before making placement decisions.

Packing layers onto GPUs

With layer sizes and per-GPU free memory, the planner decides where to put each layer:

  • assignLayers chooses how many GPUs to involve and whether some layers (like the output layer) must stay on CPU when VRAM is tight.
  • findBestFit binary-searches a “capacity factor” to balance utilization across GPUs instead of overfilling one device.
  • greedyFit implements the actual packing, iterating layers (typically from the end) and dropping them onto GPUs until their free space is exhausted.

The algorithm is purposely heuristic: roughly O(L * G) for L layers and G GPUs, which is fine because model loads are rare relative to inference. The tradeoff favors predictable, debuggable behavior over optimality.

Verifying the plan against reality

After computing a candidate layout, verifyLayout checks whether the plan is actually safe for the machine:

  • accumulate VRAM usage for graphs and offloaded layers per device,
  • compute total CPU memory requirements,
  • compare CPU usage to systemInfo.FreeMemory and FreeSwap (with a macOS-specific swap exception), and
  • when requireFull is true, enforce that all layers must fit, otherwise return ErrLoadRequiredFull.

This is the city planner sanity check: even if the trunks (GPUs) can technically hold all suitcases (layers), the total load must still respect system-level constraints like RAM and swap.

From Layout to Load Protocol

Planning where layers should live is not enough. The server has to negotiate with the runner process to allocate memory, load weights, and react when reality doesn’t match estimates. That negotiation is encoded as a simple load state machine over HTTP.

A state machine for loading

The load lifecycle is expressed by a LoadOperation enum:

type LoadOperation int

const (
    LoadOperationFit    LoadOperation = iota // Return memory requirements but do not allocate
    LoadOperationAlloc                       // Allocate memory but do not load the weights
    LoadOperationCommit                      // Load weights - further changes cannot be made
    LoadOperationClose                       // Close model and free memory
)

The protocol follows a Fit → Alloc → Commit → Close flow. Fit lets the runner report memory requirements without committing. Alloc reserves memory. Commit actually loads the model, and Close tears it down. This gives the server a safe way to probe and refine its layout before it locks in.

Two loading strategies, one abstraction

Both backends implement Load via this protocol, but differ in sophistication:

  • llamaServer.Load performs a single-pass layout based on GGML estimates, chooses GPU graph sizes, derives options like UseMmap from OS/backend, then sends a LoadOperationCommit and waits for readiness.
  • ollamaServer.Load implements an iterative negotiation loop. It sends Fit and Alloc requests, reads back actual usage, adjusts the layout with a backoff factor when allocations fail, and only then commits the final plan.
How the iterative negotiation behaves

The new engine tracks past allocations keyed by a layout hash, and uses a backoff factor to gradually shrink its assumptions about free VRAM when allocations fail. When it detects oscillation between layouts (for example, 39 vs. 41 layers offloaded), it explores intermediate options to break the cycle. The pattern is a feedback loop: measure, adapt, avoid retrying known-bad states.

The runner’s HTTP surface

The wire protocol between llmServer and the runner is intentionally boring REST on 127.0.0.1::

  • POST /load with a LoadRequest and a LoadOperation, returning a LoadResponse.
  • GET /health for ServerStatus and load progress.
  • POST /completion for token streaming.
  • POST /embedding for embeddings.

initModel wraps the /load call:

func (s *llmServer) initModel(ctx context.Context, req LoadRequest,
    operation LoadOperation) (*LoadResponse, error) {

    req.Operation = operation
    data, err := json.Marshal(req)
    if err != nil {
        return nil, fmt.Errorf("error marshaling load data: %w", err)
    }

    r, err := http.NewRequestWithContext(ctx, http.MethodPost,
        fmt.Sprintf("http://127.0.0.1:%d/load", s.port), bytes.NewBuffer(data))
    if err != nil {
        return nil, fmt.Errorf("error creating load request: %w", err)
    }
    r.Header.Set("Content-Type", "application/json")

    resp, err := http.DefaultClient.Do(r)
    // ... read body, handle status >= 400, unmarshal LoadResponse
}

All the interesting logic lives in layout and state management, not in the HTTP details. That separation is deliberate: it keeps the protocol simple and moves complexity into testable, in-process functions.

Bootstrapping the runner process

To turn a model into a microservice, StartRunner spawns a subprocess of the current binary with a runner subcommand on a chosen port:

func StartRunner(ollamaEngine bool, modelPath string, gpuLibs []string,
    out io.Writer, extraEnvs map[string]string) (cmd *exec.Cmd, port int, err error) {

    exe, err := os.Executable()
    // ... find a free localhost port

    params := []string{"runner"}
    if ollamaEngine {
        params = append(params, "--ollama-engine")
    }
    if modelPath != "" {
        params = append(params, "--model", modelPath)
    }
    params = append(params, "--port", strconv.Itoa(port))

    cmd = exec.Command(exe, params...)
    // configure environment, GPU library paths, IO pipes
    // start process and return (cmd, port)
}

The rest of the system only sees PIDs and ports behind the LlamaServer interface. That boundary—“runner as separate process on localhost with a tiny API”—is what makes the model feel like a true microservice, not just a linked library.

Streaming Completions with Guardrails

Once a model is loaded and healthy, completions dominate the hot path. The design here is to keep the API simple while surrounding it with cheap guardrails: concurrency caps, format validation, bounded output, and basic protection against pathological token streams.

From request struct to streaming loop

A completion request captures the prompt and configuration:

type CompletionRequest struct {
    Prompt  string
    Format  json.RawMessage
    Images  []ImageData
    Options *api.Options

    Grammar    string
    Shift      bool
    Truncate   bool
    Logprobs   bool
    TopLogprobs int
}

Format has special handling for JSON. If it is the string "json", the server injects a built-in JSON grammar. If it’s a JSON object, it is treated as JSON Schema and converted to a grammar via llama.SchemaToGrammar. Callers get structured outputs using a single parameter, without learning grammar internals.

The main Completion method does a small but important sequence of steps:

  1. Interpret Format and set Grammar accordingly.
  2. Acquire a semaphore slot (s.sem) to limit per-runner concurrency.
  3. Clamp NumPredict to a multiple of the context window (for example 10 * NumCtx) to avoid unbounded runs.
  4. Wait for the runner to be Ready via getServerStatusRetry.
  5. Send POST /completion and read a streaming response line by line.
  6. On each chunk, unmarshal JSON and forward content to the user callback.
  7. Abort on context cancellation or when a token repetition heuristic fires.

The streaming loop looks like this (simplified):

scanner := bufio.NewScanner(res.Body)
buf := make([]byte, 0, maxBufferSize)
scanner.Buffer(buf, maxBufferSize)

var lastToken string
var tokenRepeat int

for scanner.Scan() {
    select {
    case <-ctx.Done():
        return ctx.Err()
    default:
        line := scanner.Bytes()
        if len(line) == 0 {
            continue
        }

        evt, ok := bytes.CutPrefix(line, []byte("data: "))
        if !ok {
            evt = line
        }

        var c CompletionResponse
        if err := json.Unmarshal(evt, &c); err != nil {
            return fmt.Errorf("error unmarshalling llm prediction response: %v", err)
        }

        switch {
        case strings.TrimSpace(c.Content) == lastToken:
            tokenRepeat++
        default:
            lastToken = strings.TrimSpace(c.Content)
            tokenRepeat = 0
        }

        if tokenRepeat > 30 {
            slog.Debug("prediction aborted, token repeat limit reached")
            return ctx.Err()
        }

        if c.Content != "" {
            fn(CompletionResponse{Content: c.Content, Logprobs: c.Logprobs})
        }
        if c.Done {
            fn(c)
            return nil
        }
    }
}

One weakness the internal report surfaces: when the repetition limit triggers, the method returns ctx.Err(), making it indistinguishable from a client-side cancellation. A more precise design would return a dedicated error (for example ErrTokenRepeatLimit), so logs and callers can tell heuristic aborts from user-initiated cancellations.

Concurrency and resource control

Both Completion and Embedding share the same semaphore. That per-runner concurrency limit, set at construction time via numParallel, is a simple but effective control:

  • it bounds the load each runner can generate on GPUs and CPU,
  • backpressure shows up naturally as calls blocking on the semaphore, and
  • higher layers can observe saturation via metrics and adjust numParallel.

This fits the general theme of the file: keep the public API simple, but make resource usage and guardrails explicit inside the implementation.

Operational Lessons Beyond LLMs

The final piece of treating a model as a local microservice is operability: health checks, progress reporting, memory visibility, and a code structure that remains understandable as the system evolves.

Health and load progress

WaitUntilRunning is a good pattern for supervising a long startup:

  • poll /health with a short per-request timeout,
  • track ServerStatus and only log when it changes to avoid noise,
  • monitor loadProgress (0–100%) and reset a timer whenever it increases, and
  • fail when a configurable LoadTimeout elapses without progress, including the last progress value and any error message from the runner.

That gives operators two answers: “how far along are we?” and “did we stall?”. The internal performance report suggests turning this into metrics like load duration and success/failure rates, but even at the code level, the pattern is useful: poll, track transitions, detect stalls.

Memory and device visibility

VRAMSize, TotalSize, and VRAMByGPU expose how much memory the loaded model consumes, based on the ml.BackendMemory calculated during Load. These methods don’t change state; they provide the information higher-level schedulers or monitoring systems need to:

  • decide which models to evict when GPUs are near capacity,
  • balance models across devices, and
  • set alerts when VRAM usage is consistently high.

This is an important design choice: the microservice abstraction doesn’t just hide implementation details; it also exposes the right knobs and metrics for operational decisions.

Security and privacy in logging

The runner API is bound to 127.0.0.1, leaving exposure and authentication to higher layers. Within this file, the main security concern is logging:

  • logutil.Trace can log full prompts and embedding inputs, which may contain sensitive data.
  • Some error paths log raw response bodies from the runner, which might echo user content.

For production environments, a safer approach is to treat prompts like passwords: log metadata (sizes, model IDs, durations), not contents, except under tightly controlled debug flags.

Structural smells and refactors worth copying

Because llm/server.go has grown over time, it now mixes several concerns in one file: process management, HTTP client behavior, GPU layout, load negotiation, and the public API. The internal report calls out refactors that generalize well:

Current smell Impact Refactor lesson
Single large file blending unrelated responsibilities High cognitive load; hard to test layout logic or HTTP client in isolation. Extract a runnerClient (HTTP + process), a dedicated layout package, and keep LlamaServer as a thin orchestration layer.
Conflated errors (token repetition vs. cancellation) Unclear why completions stopped; hard to build precise alerts. Define explicit error types for expected failure modes and map them cleanly to logs and metrics.
Implicit dependency on http.DefaultClient No central control over timeouts, retries, or connection pools. Inject a tuned *http.Client so behavior is explicit and testable.

None of these are LLM-specific. They are the same patterns that make any microservice-based system easier to reason about and operate.

What to take back to your own systems

Stepping back, the main lesson from llm/server.go is architectural: heavy components behave better when you treat them as local microservices with explicit contracts and resource models, not as opaque libraries. Concretely:

  • Isolate heavy dependencies in their own process. Give them a narrow API over localhost so they can crash, restart, and be upgraded without taking down your main service.
  • Make resource planning a first-class concern. Compute per-unit costs (layers, shards, tenants), run a heuristic placement, then verify against real system constraints.
  • Negotiate with reality. Use probe phases like Fit before committing allocations. Assume estimates are wrong and let the system tell you what actually fits.
  • Add guardrails around streaming interfaces. Cap buffers, limit output, detect obvious loops, and surface distinct error types for distinct failure modes.
  • Expose operational signals via your abstractions. Methods like VRAMSize and HasExited are how SREs and higher-level schedulers keep the system healthy.

Treating an LLM as a local microservice forces you to confront lifecycle, resources, and observability head-on. If you apply the same discipline to other heavyweight pieces in your architecture—databases, search engines, batch workers—you’ll end up with systems that are not just functional, but predictable and operable under real-world load.

CONSULTING

Running self-hosted LLMs?

vLLM, Ollama, private inference — self-hosting sounds simple until GPU allocation, model routing, and latency requirements collide. A review gets it right.

Full Source Code

Here's the full source code of the file that inspired this article.
Read on GitHub

Thanks for reading! I hope this was useful. If you have questions or thoughts, feel free to reach out.

Content Creation Process: This article was generated via a semi-automated workflow using AI tools. I prepared the strategic framework, including specific prompts and data sources. From there, the automation system conducted the research, analysis, and writing. The content passed through automated verification steps before being finalized and published without manual intervention.

Mahmoud Zalt

About the Author

I’m Zalt, a technologist with 16+ years of experience, passionate about designing and building AI systems that move us closer to a world where machines handle everything and humans reclaim wonder.

Let's connect if you're working on interesting AI projects, looking for technical advice or want to discuss anything.

Support this content

Share this article