Skip to main content
المدونة

Zalt Blog

Deep Dives into Code & Architecture

AT SCALE

When One File Becomes Your AI Gateway

By محمود الزلط
Code Cracking
30m read
<

When one file becomes your AI gateway, you’re not just organizing code—you’re defining how every request touches your models. How close is your setup to that?

/>
When One File Becomes Your AI Gateway - Featured blog post image

CONSULTING

Got a specific AI infrastructure question?

Gateway patterns, provider routing, model switching — bring your setup, get a clear answer in one focused session.

We’re examining how Ollama turns a single Go file, server/routes.go, into the main gateway for local and remote AI models. Ollama is a local AI runtime that lets you run, manage, and interact with LLMs through a simple HTTP API, while hiding most of the GPU and model-runtime complexity. I’m Mahmoud Zalt, an AI solutions architect, and we’ll look at how this “god file” orchestrates models, streaming, and advanced behaviors like thinking and tools — and how to design your own gateway so it scales without collapsing under its own complexity.

The Gateway: From HTTP to Model Runner

The file server/routes.go looks like a pile of handlers at first, but it’s really an entrance hall. Every request comes in, gets classified, and is forwarded to the right “room” – text generation, chat, embeddings, model management, or remote delegation – all funneled through a shared gateway to the model pool.

server/
  routes.go   <-- HTTP API layer & entrypoint
  scheduler.go (not shown) -- manages model runners
  model/
    ...       -- model configs, manifests
  llm/
    ...       -- low-level model runtime

Request Flow (simplified):

[HTTP Client]
      |
      v
[net/http.Server] --(Serve)--> [Gin Router]
      |                           |
      |                +----------+----------+
      |                |                     |
      v                v                     v
  /api/generate   /api/chat           /api/embed, /api/tags, ...
      |                |                     |
      v                v                     v
[GenerateHandler] [ChatHandler]        [Other Handlers]
      |                |
      +-------+--------+
              v
        scheduleRunner
              |
              v
        [Scheduler]
              |
              v
        [llm.LlamaServer]
              |
              v
     Streamed Completion/Embedding
              |
              v
        streamResponse / JSON
              |
              v
         [HTTP Client]
Ollama’s HTTP layer as a gateway: routing, scheduling, and orchestration live here.

The high-level pattern is consistent:

  • Serve bootstraps everything: logging, manifest pruning, GPU discovery, scheduler initialization, and net/http startup.
  • (*Server) GenerateRoutes wires all HTTP paths (native, OpenAI-compatible, Anthropic-compatible) to handlers via Gin.
  • Each handler translates HTTP JSON into internal API structs, then asks the scheduler for a suitable runner via scheduleRunner.
  • The runner is an llm.LlamaServer instance that performs the actual token generation, chat, or embeddings work.

The central design idea is to hide the “model pool” behind a small, explicit gateway. The HTTP layer can grow large, but it talks to models through one narrow interface, which is what keeps the complexity survivable.

The heart of that gateway is scheduleRunner. It validates the model name, checks capabilities (completion, tools, images, thinking, etc.), merges model defaults with request options, and then consults the scheduler for a runner:

// scheduleRunner schedules a runner after validating inputs.
func (s *Server) scheduleRunner(
    ctx context.Context,
    name string,
    caps []model.Capability,
    requestOpts map[string]any,
    keepAlive *api.Duration,
) (llm.LlamaServer, *Model, *api.Options, error) {
    if name == "" {
        return nil, nil, nil, fmt.Errorf("model %w", errRequired)
    }

    model, err := GetModel(name)
    if err != nil {
        return nil, nil, nil, err
    }

    if slices.Contains(model.Config.ModelFamilies, "mllama") && len(model.ProjectorPaths) > 0 {
        return nil, nil, nil, fmt.Errorf("'llama3.2-vision' is no longer compatible ...")
    }

    if err := model.CheckCapabilities(caps...); err != nil {
        return nil, nil, nil, fmt.Errorf("%s %w", name, err)
    }

    opts, err := s.modelOptions(model, requestOpts)
    if err != nil {
        return nil, nil, nil, err
    }

    runnerCh, errCh := s.sched.GetRunner(ctx, model, opts, keepAlive)

    var runner *runnerRef
    select {
    case runner = <-runnerCh:
    case err = <-errCh:
        return nil, nil, nil, err
    }

    return runner.llama, model, &opts, nil
}
scheduleRunner decouples HTTP concerns from GPU and model-pool concerns.

This is a classic facade: handlers like GenerateHandler, ChatHandler, and EmbedHandler all say “give me a runner that can do X” and never think about GPU counts, cached models, or queueing policies.

One Streaming Primitive for Everything

Once a runner starts emitting tokens or events, the gateway’s job is to move them to clients efficiently and consistently. Ollama uses NDJSON streaming (newline-delimited JSON) as the single primitive for partial results.

Across generation, chat, and model pull/push, the pattern is the same:

  1. A runner or background job sends values into a chan any.
  2. The handler either aggregates them (non-streaming) or hands the channel to streamResponse for streaming.
func streamResponse(c *gin.Context, ch chan any) {
    c.Header("Content-Type", "application/x-ndjson")

    c.Stream(func(w io.Writer) bool {
        val, ok := <-ch
        if !ok {
            return false
        }

        // Special case: error objects
        if h, ok := val.(gin.H); ok {
            if e, ok := h["error"].(string); ok {
                status, ok := h["status"].(int)
                if !ok {
                    status = http.StatusInternalServerError
                }

                if !c.Writer.Written() {
                    c.Header("Content-Type", "application/json")
                    c.JSON(status, gin.H{"error": e})
                } else {
                    _ = json.NewEncoder(c.Writer).
                        Encode(gin.H{"error": e})
                }

                return false
            }
        }

        bts, err := json.Marshal(val)
        if err != nil {
            slog.Info("streamResponse: json.Marshal failed", "error", err)
            return false
        }

        bts = append(bts, '\n')
        if _, err := w.Write(bts); err != nil {
            slog.Info("streamResponse: w.Write failed", "error", err)
            return false
        }

        return true
    })
}
streamResponse centralizes NDJSON streaming and error semantics.

Errors are handled in two phases:

  • If an error arrives before anything is written, the helper switches to a normal JSON error body with an appropriate status code.
  • If content has already been streamed, it cannot change the HTTP status line, so it emits a final JSON object with an error field as the last NDJSON line and ends the stream.

This cleanly separates transport-level failure (HTTP status + headers) from stream-level failure (an error event at the end of the stream). Clients can adopt a simple rule: read lines until EOF, and if the last line carries error, treat the whole operation as failed.

Scenario What client sees How it’s signaled
Validation error (e.g., bad JSON) Single JSON object with error 400/422 with JSON body
Model error before first token Single JSON object with error Status set by streamResponse
Error mid-stream Several normal chunks, then {"error": ...} Last NDJSON item, HTTP 200

Layering Thinking, Tools, and Structure

Up to this point the gateway looks like a conventional controller layer: handlers in, scheduler out. It gets more interesting in ChatHandler, where the gateway orchestrates thinking, tools, and structured outputs on top of raw model completions.

You can think of the LLM as an actor on stage. The handler assembles the script (prompt), the scheduler picks which actor performs, and clients watch via the stream. On top of that, the gateway plays director by attaching parsers that interpret lines as thoughts, tool calls, or JSON output.

The chat pipeline roughly does this:

  • Merge model-level messages and system prompt with request messages.
  • Optionally enable “thinking” mode for models that emit internal thoughts inside special tags.
  • Attach tools and a tool parser if the request includes tool definitions.
  • Optionally enforce structured outputs, so the final answer must match JSON or a schema.

Thinking and structured outputs conflict by default: thinking is free-form text between tags; structured outputs want strict, machine-parseable shapes. The file resolves this with a two-phase interaction:

  1. First completion: let the model think freely without format constraints.
  2. Second completion: once thinking is captured, restart with structured outputs enabled, using the previous thinking as part of the conversation history.
type structuredOutputsState int
const (
    structuredOutputsState_None structuredOutputsState = iota
    structuredOutputsState_ReadyToApply
    structuredOutputsState_Applying
)

ch := make(chan any)
go func() {
    defer close(ch)

    structuredOutputsState := structuredOutputsState_None

    for {
        var tb strings.Builder

        currentFormat := req.Format
        // First pass: disable structured outputs when thinking is active.
        if req.Format != nil && structuredOutputsState == structuredOutputsState_None &&
           ((builtinParser != nil || thinkingState != nil) &&
            slices.Contains(m.Capabilities(), model.CapabilityThinking)) {
            currentFormat = nil
        }

        ctx, cancel := context.WithCancel(c.Request.Context())
        err := r.Completion(ctx, llm.CompletionRequest{/* ... */}, func(r llm.CompletionResponse) {
            res := api.ChatResponse{/* ... */}

            if builtinParser != nil {
                content, thinking, toolCalls, err := builtinParser.Add(r.Content, r.Done)
                if err != nil {
                    ch <- gin.H{"error": err.Error()}
                    return
                }

                res.Message.Content = content
                res.Message.Thinking = thinking
                // ... tool handling omitted

                tb.WriteString(thinking)
                if structuredOutputsState == structuredOutputsState_None &&
                   req.Format != nil && tb.String() != "" && res.Message.Content != "" {
                    structuredOutputsState = structuredOutputsState_ReadyToApply
                    cancel() // stop first pass, move to structured output pass
                    return
                }

                ch <- res
                return
            }

            if thinkingState != nil {
                thinkingContent, remainingContent :=
                    thinkingState.AddContent(res.Message.Content)
                // ... similar transition logic ...
                _ = remainingContent
                _ = thinkingContent
            }

            ch <- res
        })

        if err != nil {
            if structuredOutputsState == structuredOutputsState_ReadyToApply &&
               strings.Contains(err.Error(), "context canceled") &&
               c.Request.Context().Err() == nil {
                // Expected cancellation when switching passes.
            } else {
                ch <- gin.H{"error": err.Error()}
                return
            }
        }

        if structuredOutputsState == structuredOutputsState_ReadyToApply {
            structuredOutputsState = structuredOutputsState_Applying
            msg := api.Message{
                Role:     "assistant",
                Thinking: tb.String(),
            }

            msgs = append(msgs, msg)
            prompt, _, err = chatPrompt(/* now with thinking baked in */)
            if err != nil {
                ch <- gin.H{"error": err.Error()}
                return
            }

            if shouldUseHarmony(m) || (builtinParser != nil && m.Config.Parser == "harmony") {
                prompt += "<|end|><|start|>assistant<|channel|>final<|message|>"
            }

            continue // run second pass with structured outputs
        }

        break
    }
}()
Two-pass chat: first gather thinking, then produce structured output, all inside the handler.

This logic forces ChatHandler to understand several deep concerns:

  • Model capabilities such as CapabilityThinking and CapabilityTools.
  • Template tokens like harmony’s <|start|> / <|end|>.
  • Parser state machines (built-in parser vs generic thinking parser vs tools parser).
  • The difference between “intentional” cancellation (to switch passes) and real errors.

Embeddings and Where Coupling Leaks

Embeddings look straightforward compared to chat: text in, vector out. But the embedding path in routes.go hides an important lesson about cross-layer coupling.

EmbedHandler accepts flexible input (string or array), schedules a runner, and runs embeddings in parallel via errgroup. The interesting part is the retry logic when the model rejects input for exceeding the context window:

embedWithRetry := func(text string) ([]float32, int, error) {
    emb, tokCount, err := r.Embedding(ctx, text)
    if err == nil {
        return emb, tokCount, nil
    }

    var serr api.StatusError
    if !errors.As(err, &serr) || serr.StatusCode != http.StatusBadRequest {
        return nil, 0, err
    }
    if req.Truncate != nil && !*req.Truncate {
        return nil, 0, err
    }

    tokens, err := r.Tokenize(ctx, text)
    if err != nil {
        return nil, 0, err
    }

    ctxLen := min(opts.NumCtx, int(kvData.ContextLength()))
    if bos := kvData.Uint("tokenizer.ggml.bos_token_id"); len(tokens) > 0 &&
       tokens[0] != int(bos) && kvData.Bool("add_bos_token", true) {
        ctxLen--
    }
    if eos := kvData.Uint("tokenizer.ggml.eos_token_id"); len(tokens) > 0 &&
       tokens[len(tokens)-1] != int(eos) && kvData.Bool("add_eos_token", true) {
        ctxLen--
    }

    if len(tokens) <= ctxLen {
        return nil, 0, fmt.Errorf("input exceeds maximum context length and cannot be truncated further")
    }
    if ctxLen <= 0 {
        return nil, 0, fmt.Errorf("input after truncation exceeds maximum context length")
    }

    truncatedTokens := tokens[:ctxLen]
    truncated, err := r.Detokenize(ctx, truncatedTokens)
    if err != nil {
        return nil, 0, err
    }
    return r.Embedding(ctx, truncated)
}
Embedding retry logic reaches into tokenizer metadata to decide how to truncate.

Behavior-wise, this is friendly: if the first embedding call fails with a 400 and truncation is allowed, the server tokenizes the text, computes a safe context length (accounting for BOS/EOS), truncates tokens, detokenizes, and retries. Clients don’t need to understand context windows to get a working embedding.

The tradeoff is where this logic lives. To compute ctxLen, the handler reaches into kvData using raw keys such as "tokenizer.ggml.bos_token_id" and flags like "add_bos_token". That’s tight coupling between the HTTP layer and the tokenizer’s low-level storage format.

The consequences are predictable:

  • If tokenizer metadata changes shape, EmbedHandler must change too.
  • Any other component that wants “safe truncation” has to either copy this logic or also depend on ggml.KV details.

After computing embeddings, the handler normalizes each vector (L2 norm) and optionally reduces its dimension, then normalizes again. That’s a good example of appropriate responsibility: post-processing stays at the gateway, while the LLM runtime focuses on producing raw embeddings.

Running the Gateway in Production

Beyond request flows, the same file encodes several operational policies: GPU-aware defaults, overload handling, metrics hooks, and remote model delegation. All of these are wired through the gateway abstraction, not bolted on afterward.

GPU-aware defaults

During Serve, the server discovers GPUs, sums their effective VRAM (subtracting configurable overhead), and chooses a default context-length tier:

  • >= 47 GiB → defaultNumCtx = 262144
  • >= 23 GiB → defaultNumCtx = 32768
  • else → defaultNumCtx = 4096

That default flows into modelOptions, then into scheduleRunner, so every request starts from a hardware-aware baseline unless explicitly overridden. The decision is made once at startup and reused everywhere.

Scheduler and overload

Overload is surfaced via scheduler errors like ErrMaxQueue, which handleScheduleError maps into a 503 response. The scheduler owns the opinion about “too many queued requests”; the gateway just turns it into HTTP.

The surrounding comments emphasize the need for metrics such as queue depth and endpoint latency to understand performance under load, for example:

  • Per-endpoint request duration to see which routes degrade first.
  • Per-model token throughput to correlate GPU pressure with slow responses.

Without these, it’s easy to blame “the model” when the real problem is an overloaded queue or insufficient GPU tier for the requested context size.

Local and remote models through one gateway

The gateway also acts as a reverse proxy for remote models. If a model has RemoteHost and RemoteModel set, GenerateHandler and ChatHandler follow a delegation path instead of using the local scheduler:

  • Check global remote-inference status through internalcloud.Status().
  • Parse the remote URL, and enforce that its host is in envconfig.Remotes() to avoid proxying arbitrary destinations.
  • Apply model-level defaults (templates, system prompts, options), rewrite the model name, and stream responses back, patching Model/RemoteModel/RemoteHost fields so clients see consistent metadata.

From the client’s point of view, local and remote models are indistinguishable: they always hit /api/generate or /api/chat and get the same JSON shapes and streaming behavior. From the server’s point of view, it’s one more routing branch inside the gateway.

Specialized error types such as AuthorizationError and StatusError keep HTTP status codes and messages precise, and can optionally carry fields like signin_url to drive client UX.

What to Reuse in Your Own Stack

All of this lives in one big file, which can feel overwhelming, but the core pattern is straightforward: treat your HTTP layer as an AI gateway that orchestrates a model pool, streaming, and advanced interaction modes through a narrow abstraction.

1. Build a model gateway, not a bag of endpoints

  • Hide model loading, capability checks, and queueing behind a facade like scheduleRunner.
  • Keep the scheduler as a separate concern: handlers declare capabilities; the scheduler chooses a worker.

2. Make streaming a shared primitive

  • Centralize NDJSON or SSE handling in helpers like streamResponse.
  • Define once how errors surface in streams versus regular JSON, and reuse that everywhere.

3. Watch for cross-layer leakage

  • If a handler depends on low-level tokenizer keys, introduce a higher-level API around it.
  • Let the gateway orchestrate behavior (like retry-with-truncation), but keep file formats and storage details deeper in the stack.

4. Treat “thinking”, tools, and structure as orchestration

  • Use multi-pass interactions when you need both hidden reasoning and constrained output.
  • Encapsulate that orchestration into reusable components as it grows, instead of expanding a single mega-handler.

5. Encode operational policy into the gateway

  • Derive sane defaults (like context length tiers) from hardware at startup and feed them into all requests.
  • Surface scheduler overload as clear HTTP errors and back it with queue and latency metrics.
  • Unify local and remote model behavior behind one API so clients get a single mental model.

You don’t have to copy Ollama’s architecture, but you do want its core move: a single, opinionated gateway that owns how models are scheduled, how outputs are streamed, and how advanced behaviors are composed. If you get that gateway abstraction right, you can evolve your model pool, templates, and infrastructure without rewriting your entire API surface each time your AI stack grows.

Full Source Code

Here's the full source code of the file that inspired this article.
Read on GitHub

Thanks for reading! I hope this was useful. If you have questions or thoughts, feel free to reach out.

Content Creation Process: This article was generated via a semi-automated workflow using AI tools. I prepared the strategic framework, including specific prompts and data sources. From there, the automation system conducted the research, analysis, and writing. The content passed through automated verification steps before being finalized and published without manual intervention.

Mahmoud Zalt

About the Author

I’m Zalt, a technologist with 16+ years of experience, passionate about designing and building AI systems that move us closer to a world where machines handle everything and humans reclaim wonder.

Let's connect if you're working on interesting AI projects, looking for technical advice or want to discuss anything.

Support this content

Share this article