Skip to main content
المدونة

Zalt Blog

Deep Dives into Code & Architecture

AT SCALE

Daemon Orchestration at Container Scale

By محمود الزلط
Code Cracking
25m read
<

Most teams focus on container runtimes, not the control plane. Daemon Orchestration at Container Scale digs into how the daemon actually keeps fleets in line.

/>
Daemon Orchestration at Container Scale - Featured blog post image

MENTORING

1:1 engineering mentorship.

Architecture, AI systems, career growth. Ongoing or one-off.

We’re examining how Docker Engine coordinates startup, restore, networking, and shutdown through its central control point: daemon/daemon.go. Docker Engine runs and manages containers on a host; this file is where container metadata, storage, networking, plugins, and the runtime all converge. I’m Mahmoud Zalt, an AI solutions architect, and we’ll unpack how this daemon “control tower” keeps a stateful system reliable at container scale—and where its design starts to strain.

By the end, you’ll see one core lesson: treat lifecycle orchestration—boot, restore, and shutdown—as a first‑class design problem, with bounded concurrency, clear phases, and disciplined tear‑down. We’ll use Docker’s daemon as a concrete case study of patterns you can reuse in your own systems.

The Daemon as a Control Tower

A useful mental model for Docker’s Daemon is an airport control tower. It doesn’t run containers itself, but it knows about every runway (networks), gate (volumes), airplane (containers), warehouse (images), and fuel truck (plugins and runtimes). This file coordinates who can start, stop, connect, and how to bring the whole airport up and down safely.

moby/moby
└── daemon/
    ├── daemon.go              # Orchestrates daemon lifecycle, containers, images, networking
    ├── config/                # Daemon configuration types and validation
    ├── container/             # Container metadata and runtime abstractions
    ├── containerd/            # Containerd image service integration
    ├── internal/
    │   ├── image/             # Internal image model and storage
    │   ├── layer/             # Layer store and graphdriver integration
    │   ├── libcontainerd/     # Containerd client wrapper for containers
    │   ├── metrics/           # Metrics registration utilities
    │   └── distribution/      # Distribution metadata store
    ├── libnetwork/            # Networking and IPAM controller
    ├── volume/                # Volume service and drivers
    ├── internal/nri/          # NRI integration
    └── server/
        └── backend/           # HTTP API server backends using Daemon
Figure 1: Where daemon.go sits in the Docker Engine.

At the center is a Daemon struct that acts as a facade over many subsystems:

type Daemon struct {
    id                string
    repository        string
    containers        container.Store
    containersReplica *container.ViewDB
    execCommands      *container.ExecStore
    imageService      ImageService
    configStore       atomic.Pointer[configStore]
    statsCollector    *stats.Collector
    registryService   *registry.Service
    EventsService     *events.Events
    netController     *libnetwork.Controller
    volumes           *volumesservice.VolumesService
    // ... many more fields ...
    usesSnapshotter bool
}
Figure 2: The daemon as a facade over containers, images, networking, and more.

This facade framing is important. daemon.go is mostly orchestration: it wires and orders subsystems rather than implementing low‑level logic. That’s exactly what makes lifecycle code here both powerful and easy to break.

Bounded Startup and Restore

With the control‑tower role in mind, the next question is: how does the daemon wake up on a host with hundreds or thousands of containers without overwhelming the machine? The answer is a bounded, phase‑based startup path: NewDaemon → loadContainers → restore.

Bounded parallelism when loading containers

On startup, the daemon must scan all containers on disk. Sequential loading would be too slow; full parallelism risks exhausting OS limits (file descriptors, CPU, IO). Docker uses a worker pool controlled by a weighted semaphore and a dynamic parallelism cap:

func (daemon *Daemon) loadContainers(ctx context.Context) (map[string]map[string]*container.Container, error) {
    var mapLock sync.Mutex
    driverContainers := make(map[string]map[string]*container.Container)

    dir, err := os.ReadDir(daemon.repository)
    if err != nil {
        return nil, err
    }

    parallelLimit := adjustParallelLimit(len(dir), 128*runtime.NumCPU())
    var group sync.WaitGroup
    sem := semaphore.NewWeighted(int64(parallelLimit))

    for _, v := range dir {
        id := v.Name()
        group.Go(func() {
            _ = sem.Acquire(context.WithoutCancel(ctx), 1)
            defer sem.Release(1)

            c, err := daemon.load(id)
            if err != nil {
                // log and skip
                return
            }

            mapLock.Lock()
            if containers, ok := driverContainers[c.Driver]; !ok {
                driverContainers[c.Driver] = map[string]*container.Container{c.ID: c}
            } else {
                containers[c.ID] = c
            }
            mapLock.Unlock()
        })
    }
    group.Wait()

    return driverContainers, nil
}
Figure 3: Bounded parallelism when loading containers from disk.

The semaphore ensures at most parallelLimit loads are in flight. adjustParallelLimit tunes that number based on container count and CPU cores, while respecting OS constraints to avoid EMFILE and similar failures. The core pattern is: parallelize aggressively but under explicit back‑pressure, especially during bootstrap.

Restore as a phased city restart

Loading metadata is only half the story. The restore function takes the containers discovered on disk and brings the system back to a coherent, running state. It does this in ordered phases, more like restoring a city district by district than flipping every switch at once.

Phase 1: Attach and register containers

The first pass over containers attaches runtime state and registers everything in in‑memory stores, again under bounded parallelism. Key responsibilities include:

  • Reattaching read‑write layers so containers can be mounted.
  • Reconstructing basic state (running, paused) for observability.
  • Registering names and container objects in the daemon’s stores.
  • Dropping or quarantining containers that cannot be registered cleanly, while keeping them removable.

Phase 2: Reconcile daemon state with containerd

The second pass is where restore becomes subtle. For each container, the daemon queries containerd, reconciles health and task status, and corrects mismatches between its own c.State and what is actually running.

Two views of “alive” must be reconciled:

  • Daemon state: what the Daemon remembers from disk (c.State).
  • Runtime state: what containerd reports about tasks and processes.

When they disagree, restore tears down orphaned tasks, fixes container state on disk, and updates health checks and restart managers. This reconciliation is why a daemon restart typically feels seamless from the outside.

Phase 3: Rebuild networking and restart policies

After state is reconciled and BaseFS paths are validated via temporary Mount/Unmount, restore determines:

  • Which containers are eligible for auto‑restart, respecting restart policies and excluding Swarm containers until the cluster is ready.
  • Which AutoRemove containers are safe to clean up.
  • Which sandboxes are active so the network controller can account for existing namespaces.

Only then does the daemon initialize networking with knowledge of active sandboxes, repair port mappings, restore legacy links, and finally restart containers that should come back online.

The order of these phases is doing real work: attach and register → reconcile runtime state → rebuild networking and restarts. If you start containers before reconciling or before networking is stable, you get subtle bugs, flapping health checks, and hard‑to‑diagnose race conditions.

Shutdown Discipline and Timeouts

A control tower that starts well but shuts down unpredictably is still a liability. Docker’s daemon is explicit about shutdown semantics: it computes honest timeouts based on container configuration and tears down subsystems in a specific, dependency‑aware order. It also supports a “live restore” mode, where the daemon exits but containers keep running.

Computing a truthful shutdown timeout

The daemon exposes ShutdownTimeout(), which delegates to a helper that walks all containers and derives a safe bound from their individual stop timeouts:

func (daemon *Daemon) ShutdownTimeout() int {
    return daemon.shutdownTimeout(&daemon.config().Config)
}

func (daemon *Daemon) shutdownTimeout(cfg *config.Config) int {
    shutdownTimeout := cfg.ShutdownTimeout
    if shutdownTimeout < 0 {
        return -1
    }
    if daemon.containers == nil {
        return shutdownTimeout
    }

    graceTimeout := 5
    for _, c := range daemon.containers.List() {
        stopTimeout := c.StopTimeout()
        if stopTimeout < 0 {
            return -1
        }
        if stopTimeout+graceTimeout > shutdownTimeout {
            shutdownTimeout = stopTimeout + graceTimeout
        }
    }
    return shutdownTimeout
}
Figure 4: Deriving the daemon shutdown timeout from container stop timeouts.

Two rules fall out of this:

  1. If any container is configured with an infinite stop timeout (-1), the daemon’s shutdown timeout becomes infinite.
  2. Otherwise, the daemon uses the maximum per‑container timeout plus a small grace period.

That keeps behavior aligned with operator intent: if a critical container must never be killed forcefully, the daemon waits as long as needed. If all containers have finite timeouts, the daemon chooses a bound that is actually sufficient to stop them cleanly.

Orderly shutdown and live restore

The Shutdown method applies those rules and encodes a strict shutdown order. A key decision point is whether live restore is enabled and whether there are running containers.

func (daemon *Daemon) Shutdown(ctx context.Context) error {
    daemon.shutdown = true

    cfg := &daemon.config().Config
    if cfg.LiveRestoreEnabled && daemon.containers != nil {
        if ls, err := daemon.Containers(ctx, &backend.ContainerListOptions{}); len(ls) != 0 || err != nil {
            metrics.CleanupPlugin(daemon.PluginStore)
            return err
        }
    }

    if daemon.containers != nil {
        daemon.containers.ApplyAll(func(c *container.Container) {
            if !c.State.IsRunning() {
                return
            }
            if err := daemon.shutdownContainer(c); err != nil {
                return
            }
            if mountid, err := daemon.imageService.GetLayerMountID(c.ID); err == nil {
                daemon.cleanupMountsByID(mountid)
            }
        })
    }

    if daemon.volumes != nil { _ = daemon.volumes.Shutdown() }
    if daemon.imageService != nil { _ = daemon.imageService.Cleanup() }
    if daemon.clusterProvider != nil { daemon.DaemonLeavesCluster() }
    metrics.CleanupPlugin(daemon.PluginStore)
    daemon.pluginShutdown()
    if daemon.nri != nil { daemon.nri.Shutdown(ctx) }
    if daemon.netController != nil { daemon.netController.Stop() }
    if daemon.containerdClient != nil { daemon.containerdClient.Close() }
    if daemon.mdDB != nil { daemon.mdDB.Close() }
    if daemon.EventsService != nil { daemon.EventsService.Close() }

    return daemon.cleanupMounts(cfg)
}
Figure 5: High‑level shutdown flow and ordering.

When live restore is on and containers are running, the daemon mostly backs away, leaving containers alive with mounts and networking intact. Otherwise, shutdown proceeds as follows:

  • Stop running containers, then clean up their mounts.
  • Shut down volumes and image services.
  • Leave the cluster, then shut down plugins and NRI.
  • Stop networking, then close containerd and metadata DB.
  • Close the events service and finally clean up any remaining mounts.

This mostly mirrors initialization in reverse. That pattern isn’t cosmetic—it avoids resource leaks (e.g., open namespaces), broken plugins, and user‑visible errors from tearing down dependencies out of order.

Networking Defaults That Scale

Lifecycle orchestration isn’t only about processes; it also includes how defaults behave under scale. The daemon’s approach to networking configuration is a quiet but important example: it aims to “just work” even when operators provide no explicit IPAM settings, while remaining safe in large deployments.

Deriving stable IPv6 ULA pools

When there are no user‑supplied IPv6 address pools, the daemon derives a private IPv6 ULA (Unique Local Address) prefix from a host identifier and uses that as a default address pool. It combines general network options with this derived pool:

func (daemon *Daemon) networkOptions(conf *config.Config, pg plugingetter.PluginGetter, hostID string, activeSandboxes map[string]any) ([]nwconfig.Option, error) {
    options := []nwconfig.Option{
        nwconfig.OptionDataDir(filepath.Join(conf.Root, config.LibnetDataPath)),
        nwconfig.OptionExecRoot(conf.GetExecRoot()),
        nwconfig.OptionDefaultDriver(network.DefaultNetwork),
        nwconfig.OptionDefaultNetwork(network.DefaultNetwork),
        nwconfig.OptionNetworkControlPlaneMTU(conf.NetworkControlPlaneMTU),
        nwconfig.OptionFirewallBackend(conf.FirewallBackend),
    }

    options = append(options, networkPlatformOptions(conf)...)

    defaultAddressPools := ipamutils.GetLocalScopeDefaultNetworks()
    if len(conf.NetworkConfig.DefaultAddressPools.Value()) > 0 {
        defaultAddressPools = conf.NetworkConfig.DefaultAddressPools.Value()
    }

    if !slices.ContainsFunc(defaultAddressPools, func(nw *ipamutils.NetworkToSplit) bool {
        return nw.Base.Addr().Is6() && !nw.Base.Addr().Is4In6()
    }) {
        defaultAddressPools = append(defaultAddressPools, deriveULABaseNetwork(hostID))
    }
    options = append(options, nwconfig.OptionDefaultAddressPoolConfig(defaultAddressPools))

    if conf.LiveRestoreEnabled && len(activeSandboxes) != 0 {
        options = append(options, nwconfig.OptionActiveSandboxes(activeSandboxes))
    }
    if pg != nil {
        options = append(options, nwconfig.OptionPluginGetter(pg))
    }

    return options, nil
}
Figure 6: Building network options with a derived IPv6 default pool.

The helper that derives the IPv6 base network is compact but deliberate:

func deriveULABaseNetwork(hostID string) *ipamutils.NetworkToSplit {
    sha := sha256.Sum256([]byte(hostID))
    gid := binary.BigEndian.Uint64(sha[:]) & (1<<40 - 1)
    addr := ipbits.Add(netip.MustParseAddr("fd00::"), gid, 80)

    return &ipamutils.NetworkToSplit{
        Base: netip.PrefixFrom(addr, 48),
        Size: 64,
    }
}
Figure 7: Host‑specific, deterministic IPv6 ULA derivation.

It hashes a host‑specific ID, keeps 40 bits, and adds that to fd00:: to get a /48 prefix. Each host gets a deterministic, private IPv6 block without extra config. From a lifecycle perspective, this means networking “just works” during startup and restore without coordination, and it behaves predictably as fleets grow.

Hard Lessons from a Giant Constructor

The same file that shows strong lifecycle patterns also demonstrates what happens when a system grows organically for years. The NewDaemon constructor has become a large, multi‑responsibility method that tries to do everything at once: validate config, manage filesystem state, connect to containerd, choose between graphdriver and snapshotter, migrate images, initialize plugins, volumes, networking, metrics, NRI, and finally restore containers.

Aspect Current Reality Consequence
Size ~260 SLoC, cyclomatic complexity ~35 Hard to understand as a whole, risky to modify
Responsibilities Config, filesystem, security, containerd, images, migration, plugins, volumes, networking, restore, metrics Violates single‑responsibility principle
Testing Heavy external dependencies (containerd, disk, network) Requires integration tests; unit testing is difficult

The code review explicitly flags this as a “large, multi‑responsibility constructor” smell. The suggested direction is to extract distinct phases into helpers such as initImageService or restoreSingleContainer. That would turn NewDaemon into a clearer orchestration shell instead of a monolith of interleaved concerns.

For example, image service initialization and migration logic could be pulled into one function that hides graphdriver vs snapshotter decisions and migration thresholds behind a clean interface. Today, those details are tangled with container loading and containerd client setup, which makes failures during startup harder to reason about.

A small but telling security wart

One specific issue reinforces how easy it is for lifecycle code to leak too much information. When snapshotter migration is enabled with a zero threshold, the daemon logs all environment variables via os.Environ(). That’s useful for debugging, but an obvious risk for secrets.

The recommended change is minimal: log only the specific variable and its parsed value instead of the entire environment. It’s a good reminder that lifecycle and migration paths often touch configuration and environment, and you need to be deliberate about what you expose to logs.

Practical Takeaways

Stepping back from the details, daemon/daemon.go is a worked example of how to orchestrate a complex, stateful system at scale. The primary lesson is to treat lifecycle orchestration—startup, restore, shutdown, and defaults—as a first‑class design problem, not “just wiring”. Docker’s daemon shows both the benefits of taking this seriously and the costs when complexity accumulates.

Patterns to apply in your own systems

  • Use a facade for orchestration, not for logic. Let your main service struct coordinate subsystems (storage, networking, runtime, plugins), but keep substantial logic in those subsystems. When it grows unwieldy, extract dedicated managers.
  • Bound concurrency during bootstrap and restore. Use semaphores or equivalent to cap parallel work, and derive limits from both workload size and platform constraints. It’s the difference between a fast startup and bringing a machine to its knees.
  • Restore state in explicit phases. Separate “read and register”, “reconcile with reality”, and “rebuild dependents like networking and restart policies”. Avoid starting anything user‑visible before reconciliation is complete.
  • Make shutdown behavior explicit and dependency‑aware. Compute effective timeouts from per‑unit configuration and shut things down in reverse initialization order. Offer modes like live restore only when you can clearly define their semantics.
  • Choose smart, scalable defaults. The derived IPv6 ULA pool is a good model: remove configuration friction while staying safe and predictable at scale.
  • Keep constructors as orchestration scripts. When a constructor starts handling migrations, environment parsing, and multiple backend choices inline, factor those into testable phases and helpers.

If you design your service’s lifecycle with the same care Docker’s daemon applies to containers—bounded startup, phased restore, disciplined shutdown, and thoughtful defaults—you’ll get a system that can grow with your workloads without becoming opaque. The control tower may be complex, but its behavior will stay understandable and reliable over years, not just releases.

Full Source Code

Here's the full source code of the file that inspired this article.
Read on GitHub

Thanks for reading! I hope this was useful. If you have questions or thoughts, feel free to reach out.

Content Creation Process: This article was generated via a semi-automated workflow using AI tools. I prepared the strategic framework, including specific prompts and data sources. From there, the automation system conducted the research, analysis, and writing. The content passed through automated verification steps before being finalized and published without manual intervention.

Mahmoud Zalt

About the Author

I’m Zalt, a technologist with 16+ years of experience, passionate about designing and building AI systems that move us closer to a world where machines handle everything and humans reclaim wonder.

Let's connect if you're working on interesting AI projects, looking for technical advice or want to discuss anything.

Support this content

Share this article