We’re examining how Ollama turns a single Go file, server/routes.go, into the main gateway for local and remote AI models. Ollama is a local AI runtime that lets you run, manage, and interact with LLMs through a simple HTTP API, while hiding most of the GPU and model-runtime complexity. I’m Mahmoud Zalt, an AI solutions architect, and we’ll look at how this “god file” orchestrates models, streaming, and advanced behaviors like thinking and tools — and how to design your own gateway so it scales without collapsing under its own complexity.
The Gateway: From HTTP to Model Runner
The file server/routes.go looks like a pile of handlers at first, but it’s really an entrance hall. Every request comes in, gets classified, and is forwarded to the right “room” – text generation, chat, embeddings, model management, or remote delegation – all funneled through a shared gateway to the model pool.
server/
routes.go <-- HTTP API layer & entrypoint
scheduler.go (not shown) -- manages model runners
model/
... -- model configs, manifests
llm/
... -- low-level model runtime
Request Flow (simplified):
[HTTP Client]
|
v
[net/http.Server] --(Serve)--> [Gin Router]
| |
| +----------+----------+
| | |
v v v
/api/generate /api/chat /api/embed, /api/tags, ...
| | |
v v v
[GenerateHandler] [ChatHandler] [Other Handlers]
| |
+-------+--------+
v
scheduleRunner
|
v
[Scheduler]
|
v
[llm.LlamaServer]
|
v
Streamed Completion/Embedding
|
v
streamResponse / JSON
|
v
[HTTP Client]
The high-level pattern is consistent:
Servebootstraps everything: logging, manifest pruning, GPU discovery, scheduler initialization, andnet/httpstartup.(*Server) GenerateRouteswires all HTTP paths (native, OpenAI-compatible, Anthropic-compatible) to handlers via Gin.- Each handler translates HTTP JSON into internal API structs, then asks the scheduler for a suitable runner via
scheduleRunner. - The runner is an
llm.LlamaServerinstance that performs the actual token generation, chat, or embeddings work.
The central design idea is to hide the “model pool” behind a small, explicit gateway. The HTTP layer can grow large, but it talks to models through one narrow interface, which is what keeps the complexity survivable.
The heart of that gateway is scheduleRunner. It validates the model name, checks capabilities (completion, tools, images, thinking, etc.), merges model defaults with request options, and then consults the scheduler for a runner:
// scheduleRunner schedules a runner after validating inputs.
func (s *Server) scheduleRunner(
ctx context.Context,
name string,
caps []model.Capability,
requestOpts map[string]any,
keepAlive *api.Duration,
) (llm.LlamaServer, *Model, *api.Options, error) {
if name == "" {
return nil, nil, nil, fmt.Errorf("model %w", errRequired)
}
model, err := GetModel(name)
if err != nil {
return nil, nil, nil, err
}
if slices.Contains(model.Config.ModelFamilies, "mllama") && len(model.ProjectorPaths) > 0 {
return nil, nil, nil, fmt.Errorf("'llama3.2-vision' is no longer compatible ...")
}
if err := model.CheckCapabilities(caps...); err != nil {
return nil, nil, nil, fmt.Errorf("%s %w", name, err)
}
opts, err := s.modelOptions(model, requestOpts)
if err != nil {
return nil, nil, nil, err
}
runnerCh, errCh := s.sched.GetRunner(ctx, model, opts, keepAlive)
var runner *runnerRef
select {
case runner = <-runnerCh:
case err = <-errCh:
return nil, nil, nil, err
}
return runner.llama, model, &opts, nil
}
scheduleRunner decouples HTTP concerns from GPU and model-pool concerns.
This is a classic facade: handlers like GenerateHandler, ChatHandler, and EmbedHandler all say “give me a runner that can do X” and never think about GPU counts, cached models, or queueing policies.
One Streaming Primitive for Everything
Once a runner starts emitting tokens or events, the gateway’s job is to move them to clients efficiently and consistently. Ollama uses NDJSON streaming (newline-delimited JSON) as the single primitive for partial results.
Across generation, chat, and model pull/push, the pattern is the same:
- A runner or background job sends values into a
chan any. - The handler either aggregates them (non-streaming) or hands the channel to
streamResponsefor streaming.
func streamResponse(c *gin.Context, ch chan any) {
c.Header("Content-Type", "application/x-ndjson")
c.Stream(func(w io.Writer) bool {
val, ok := <-ch
if !ok {
return false
}
// Special case: error objects
if h, ok := val.(gin.H); ok {
if e, ok := h["error"].(string); ok {
status, ok := h["status"].(int)
if !ok {
status = http.StatusInternalServerError
}
if !c.Writer.Written() {
c.Header("Content-Type", "application/json")
c.JSON(status, gin.H{"error": e})
} else {
_ = json.NewEncoder(c.Writer).
Encode(gin.H{"error": e})
}
return false
}
}
bts, err := json.Marshal(val)
if err != nil {
slog.Info("streamResponse: json.Marshal failed", "error", err)
return false
}
bts = append(bts, '\n')
if _, err := w.Write(bts); err != nil {
slog.Info("streamResponse: w.Write failed", "error", err)
return false
}
return true
})
}
streamResponse centralizes NDJSON streaming and error semantics.Errors are handled in two phases:
- If an error arrives before anything is written, the helper switches to a normal JSON error body with an appropriate status code.
- If content has already been streamed, it cannot change the HTTP status line, so it emits a final JSON object with an
errorfield as the last NDJSON line and ends the stream.
This cleanly separates transport-level failure (HTTP status + headers) from stream-level failure (an error event at the end of the stream). Clients can adopt a simple rule: read lines until EOF, and if the last line carries error, treat the whole operation as failed.
| Scenario | What client sees | How it’s signaled |
|---|---|---|
| Validation error (e.g., bad JSON) | Single JSON object with error |
400/422 with JSON body |
| Model error before first token | Single JSON object with error |
Status set by streamResponse |
| Error mid-stream | Several normal chunks, then {"error": ...} |
Last NDJSON item, HTTP 200 |
Layering Thinking, Tools, and Structure
Up to this point the gateway looks like a conventional controller layer: handlers in, scheduler out. It gets more interesting in ChatHandler, where the gateway orchestrates thinking, tools, and structured outputs on top of raw model completions.
You can think of the LLM as an actor on stage. The handler assembles the script (prompt), the scheduler picks which actor performs, and clients watch via the stream. On top of that, the gateway plays director by attaching parsers that interpret lines as thoughts, tool calls, or JSON output.
The chat pipeline roughly does this:
- Merge model-level messages and system prompt with request messages.
- Optionally enable “thinking” mode for models that emit internal thoughts inside special tags.
- Attach tools and a tool parser if the request includes tool definitions.
- Optionally enforce structured outputs, so the final answer must match JSON or a schema.
Thinking and structured outputs conflict by default: thinking is free-form text between tags; structured outputs want strict, machine-parseable shapes. The file resolves this with a two-phase interaction:
- First completion: let the model think freely without format constraints.
- Second completion: once thinking is captured, restart with structured outputs enabled, using the previous thinking as part of the conversation history.
type structuredOutputsState int
const (
structuredOutputsState_None structuredOutputsState = iota
structuredOutputsState_ReadyToApply
structuredOutputsState_Applying
)
ch := make(chan any)
go func() {
defer close(ch)
structuredOutputsState := structuredOutputsState_None
for {
var tb strings.Builder
currentFormat := req.Format
// First pass: disable structured outputs when thinking is active.
if req.Format != nil && structuredOutputsState == structuredOutputsState_None &&
((builtinParser != nil || thinkingState != nil) &&
slices.Contains(m.Capabilities(), model.CapabilityThinking)) {
currentFormat = nil
}
ctx, cancel := context.WithCancel(c.Request.Context())
err := r.Completion(ctx, llm.CompletionRequest{/* ... */}, func(r llm.CompletionResponse) {
res := api.ChatResponse{/* ... */}
if builtinParser != nil {
content, thinking, toolCalls, err := builtinParser.Add(r.Content, r.Done)
if err != nil {
ch <- gin.H{"error": err.Error()}
return
}
res.Message.Content = content
res.Message.Thinking = thinking
// ... tool handling omitted
tb.WriteString(thinking)
if structuredOutputsState == structuredOutputsState_None &&
req.Format != nil && tb.String() != "" && res.Message.Content != "" {
structuredOutputsState = structuredOutputsState_ReadyToApply
cancel() // stop first pass, move to structured output pass
return
}
ch <- res
return
}
if thinkingState != nil {
thinkingContent, remainingContent :=
thinkingState.AddContent(res.Message.Content)
// ... similar transition logic ...
_ = remainingContent
_ = thinkingContent
}
ch <- res
})
if err != nil {
if structuredOutputsState == structuredOutputsState_ReadyToApply &&
strings.Contains(err.Error(), "context canceled") &&
c.Request.Context().Err() == nil {
// Expected cancellation when switching passes.
} else {
ch <- gin.H{"error": err.Error()}
return
}
}
if structuredOutputsState == structuredOutputsState_ReadyToApply {
structuredOutputsState = structuredOutputsState_Applying
msg := api.Message{
Role: "assistant",
Thinking: tb.String(),
}
msgs = append(msgs, msg)
prompt, _, err = chatPrompt(/* now with thinking baked in */)
if err != nil {
ch <- gin.H{"error": err.Error()}
return
}
if shouldUseHarmony(m) || (builtinParser != nil && m.Config.Parser == "harmony") {
prompt += "<|end|><|start|>assistant<|channel|>final<|message|>"
}
continue // run second pass with structured outputs
}
break
}
}()
This logic forces ChatHandler to understand several deep concerns:
- Model capabilities such as
CapabilityThinkingandCapabilityTools. - Template tokens like harmony’s
<|start|>/<|end|>. - Parser state machines (built-in parser vs generic thinking parser vs tools parser).
- The difference between “intentional” cancellation (to switch passes) and real errors.
Embeddings and Where Coupling Leaks
Embeddings look straightforward compared to chat: text in, vector out. But the embedding path in routes.go hides an important lesson about cross-layer coupling.
EmbedHandler accepts flexible input (string or array), schedules a runner, and runs embeddings in parallel via errgroup. The interesting part is the retry logic when the model rejects input for exceeding the context window:
embedWithRetry := func(text string) ([]float32, int, error) {
emb, tokCount, err := r.Embedding(ctx, text)
if err == nil {
return emb, tokCount, nil
}
var serr api.StatusError
if !errors.As(err, &serr) || serr.StatusCode != http.StatusBadRequest {
return nil, 0, err
}
if req.Truncate != nil && !*req.Truncate {
return nil, 0, err
}
tokens, err := r.Tokenize(ctx, text)
if err != nil {
return nil, 0, err
}
ctxLen := min(opts.NumCtx, int(kvData.ContextLength()))
if bos := kvData.Uint("tokenizer.ggml.bos_token_id"); len(tokens) > 0 &&
tokens[0] != int(bos) && kvData.Bool("add_bos_token", true) {
ctxLen--
}
if eos := kvData.Uint("tokenizer.ggml.eos_token_id"); len(tokens) > 0 &&
tokens[len(tokens)-1] != int(eos) && kvData.Bool("add_eos_token", true) {
ctxLen--
}
if len(tokens) <= ctxLen {
return nil, 0, fmt.Errorf("input exceeds maximum context length and cannot be truncated further")
}
if ctxLen <= 0 {
return nil, 0, fmt.Errorf("input after truncation exceeds maximum context length")
}
truncatedTokens := tokens[:ctxLen]
truncated, err := r.Detokenize(ctx, truncatedTokens)
if err != nil {
return nil, 0, err
}
return r.Embedding(ctx, truncated)
}
Behavior-wise, this is friendly: if the first embedding call fails with a 400 and truncation is allowed, the server tokenizes the text, computes a safe context length (accounting for BOS/EOS), truncates tokens, detokenizes, and retries. Clients don’t need to understand context windows to get a working embedding.
The tradeoff is where this logic lives. To compute ctxLen, the handler reaches into kvData using raw keys such as "tokenizer.ggml.bos_token_id" and flags like "add_bos_token". That’s tight coupling between the HTTP layer and the tokenizer’s low-level storage format.
The consequences are predictable:
- If tokenizer metadata changes shape,
EmbedHandlermust change too. - Any other component that wants “safe truncation” has to either copy this logic or also depend on
ggml.KVdetails.
After computing embeddings, the handler normalizes each vector (L2 norm) and optionally reduces its dimension, then normalizes again. That’s a good example of appropriate responsibility: post-processing stays at the gateway, while the LLM runtime focuses on producing raw embeddings.
Running the Gateway in Production
Beyond request flows, the same file encodes several operational policies: GPU-aware defaults, overload handling, metrics hooks, and remote model delegation. All of these are wired through the gateway abstraction, not bolted on afterward.
GPU-aware defaults
During Serve, the server discovers GPUs, sums their effective VRAM (subtracting configurable overhead), and chooses a default context-length tier:
- >= 47 GiB →
defaultNumCtx = 262144 - >= 23 GiB →
defaultNumCtx = 32768 - else →
defaultNumCtx = 4096
That default flows into modelOptions, then into scheduleRunner, so every request starts from a hardware-aware baseline unless explicitly overridden. The decision is made once at startup and reused everywhere.
Scheduler and overload
Overload is surfaced via scheduler errors like ErrMaxQueue, which handleScheduleError maps into a 503 response. The scheduler owns the opinion about “too many queued requests”; the gateway just turns it into HTTP.
The surrounding comments emphasize the need for metrics such as queue depth and endpoint latency to understand performance under load, for example:
- Per-endpoint request duration to see which routes degrade first.
- Per-model token throughput to correlate GPU pressure with slow responses.
Without these, it’s easy to blame “the model” when the real problem is an overloaded queue or insufficient GPU tier for the requested context size.
Local and remote models through one gateway
The gateway also acts as a reverse proxy for remote models. If a model has RemoteHost and RemoteModel set, GenerateHandler and ChatHandler follow a delegation path instead of using the local scheduler:
- Check global remote-inference status through
internalcloud.Status(). - Parse the remote URL, and enforce that its host is in
envconfig.Remotes()to avoid proxying arbitrary destinations. - Apply model-level defaults (templates, system prompts, options), rewrite the model name, and stream responses back, patching
Model/RemoteModel/RemoteHostfields so clients see consistent metadata.
From the client’s point of view, local and remote models are indistinguishable: they always hit /api/generate or /api/chat and get the same JSON shapes and streaming behavior. From the server’s point of view, it’s one more routing branch inside the gateway.
Specialized error types such as AuthorizationError and StatusError keep HTTP status codes and messages precise, and can optionally carry fields like signin_url to drive client UX.
What to Reuse in Your Own Stack
All of this lives in one big file, which can feel overwhelming, but the core pattern is straightforward: treat your HTTP layer as an AI gateway that orchestrates a model pool, streaming, and advanced interaction modes through a narrow abstraction.
1. Build a model gateway, not a bag of endpoints
- Hide model loading, capability checks, and queueing behind a facade like
scheduleRunner. - Keep the scheduler as a separate concern: handlers declare capabilities; the scheduler chooses a worker.
2. Make streaming a shared primitive
- Centralize NDJSON or SSE handling in helpers like
streamResponse. - Define once how errors surface in streams versus regular JSON, and reuse that everywhere.
3. Watch for cross-layer leakage
- If a handler depends on low-level tokenizer keys, introduce a higher-level API around it.
- Let the gateway orchestrate behavior (like retry-with-truncation), but keep file formats and storage details deeper in the stack.
4. Treat “thinking”, tools, and structure as orchestration
- Use multi-pass interactions when you need both hidden reasoning and constrained output.
- Encapsulate that orchestration into reusable components as it grows, instead of expanding a single mega-handler.
5. Encode operational policy into the gateway
- Derive sane defaults (like context length tiers) from hardware at startup and feed them into all requests.
- Surface scheduler overload as clear HTTP errors and back it with queue and latency metrics.
- Unify local and remote model behavior behind one API so clients get a single mental model.
You don’t have to copy Ollama’s architecture, but you do want its core move: a single, opinionated gateway that owns how models are scheduled, how outputs are streamed, and how advanced behaviors are composed. If you get that gateway abstraction right, you can evolve your model pool, templates, and infrastructure without rewriting your entire API surface each time your AI stack grows.



