Skip to main content

When Orchestration Becomes the Product

When does coordination logic stop being just glue and start being what users actually feel? “When Orchestration Becomes the Product” digs into that shift.

Code Cracking
20m read
#orchestration#engineering#devtools
When Orchestration Becomes the Product - Featured blog post image

CONSULTING

Got a specific orchestration problem?

Routing logic, coordination patterns, reliability tradeoffs — bring your specific situation and get a clear answer in one session.

We’re examining how Ansible turns playbooks, inventory, and plugins into a single, coherent automation run. The core of that behavior lives in PlaybookExecutor, the class behind the ansible-playbook command. I'm Mahmoud Zalt, an AI solutions architect, and we’ll walk through how this one orchestrator file shapes safety, performance, and operator experience—often more than the individual modules ever do.

Our focus is one lesson: treat orchestration as a first-class product. We’ll see how batching (serial), failure handling, retries, and callbacks work together, where subtle algorithmic choices start to hurt at scale, and which patterns you can reuse in your own automation systems.

Where PlaybookExecutor Sits in Ansible

To understand why orchestration design matters, it helps to see where PlaybookExecutor lives in the Ansible codebase and what it actually owns.

ansible/
  lib/
    ansible/
      executor/
        playbook_executor.py  <-- PlaybookExecutor orchestrates playbooks
        task_queue_manager.py <-- TaskQueueManager executes tasks per host
      playbook/
        __init__.py           <-- Playbook.load provides Play objects
      utils/
        display.py            <-- Display for user interaction
        helpers.py            <-- pct_to_int for serial batching
        path.py               <-- makedirs_safe for retry files
      plugins/
        loader.py             <-- connection_loader, shell_loader, become_loader
      _internal/_templating/
        _engine.py            <-- TemplateEngine for vars and prompts
Where PlaybookExecutor sits in the Ansible architecture.

Think of PlaybookExecutor as a dispatcher: each playbook is a train, each play is a carriage, and each batch of hosts is a compartment. The dispatcher decides which compartments move when (via serial), and records which ones had issues so you can send a "repair train" later (retry files).

The constructor wires together the collaborators it needs—inventory, variable manager, loader, passwords—and chooses between "planning" modes (list hosts, list tasks, list tags, syntax check) and actual execution:

class PlaybookExecutor:
    """Primary class for executing playbooks behind ansible-playbook."""

    def __init__(self, playbooks, inventory, variable_manager, loader, passwords):
        self._playbooks = playbooks
        self._inventory = inventory
        self._variable_manager = variable_manager
        self._loader = loader
        self.passwords = passwords
        self._unreachable_hosts = dict()

        if (context.CLIARGS.get('listhosts') or
                context.CLIARGS.get('listtasks') or
                context.CLIARGS.get('listtags') or
                context.CLIARGS.get('syntax')):
            self._tqm = None
        else:
            self._tqm = TaskQueueManager(
                inventory=inventory,
                variable_manager=variable_manager,
                loader=loader,
                passwords=self.passwords,
                forks=context.CLIARGS.get('forks'),
            )

TaskQueueManager is the assembly line that actually runs tasks on hosts. PlaybookExecutor decides whether to spin it up and, if so, in what shape: how many forks, which hosts per batch, when to stop, and how to surface results.

Serial Batching: Safety vs. Scale

One of the most important policies in any orchestrator is: How many things do we touch at once? In Ansible, that policy is expressed by the serial keyword in a play and implemented by PlaybookExecutor._get_serialized_batches().

Serial as a blast-radius control

serial lets you say "only work on N hosts at a time" (or a percentage). That’s a classic blast-radius control: if a deployment goes bad, it only breaks the current batch, not the entire fleet.

In code, the executor turns the host list into batches like this:

def _get_serialized_batches(self, play):
    """Return hosts subdivided into batches based on play.serial."""

    all_hosts = self._inventory.get_hosts(play.hosts, order=play.order)
    all_hosts_len = len(all_hosts)

    serial_batch_list = play.serial
    if len(serial_batch_list) == 0:
        serial_batch_list = [-1]

    cur_item = 0
    serialized_batches = []

    while len(all_hosts) > 0:
        serial = pct_to_int(serial_batch_list[cur_item], all_hosts_len)

        if serial <= 0:
            serialized_batches.append(all_hosts)
            break
        else:
            play_hosts = []
            for x in range(serial):
                if len(all_hosts) > 0:
                    play_hosts.append(all_hosts.pop(0))

            serialized_batches.append(play_hosts)

        cur_item += 1
        if cur_item > len(serial_batch_list) - 1:
            cur_item = len(serial_batch_list) - 1

    return serialized_batches

A few details matter for behavior:

  • play.serial can be a list (e.g. [10, 20, "50%"]), not just a scalar.
  • pct_to_int converts percentage strings like "50%" relative to the total host count.
  • serial <= 0 means "take all remaining hosts in one last batch".
  • Once the list of serial values is exhausted, the last value is reused for all remaining batches.

This gives operators a simple, predictable language for rollout patterns while keeping the implementation confined to a single helper.

The subtle performance trap

The interesting part is not the semantics but the algorithmic cost. The batching loop repeatedly does all_hosts.pop(0). Popping from the front of a Python list is O(n), so doing it for every host turns the whole batching step into O(H²) for H hosts.

On a few hundred hosts, this is fine. On tens of thousands, startup time becomes noticeably dominated by "just preparing work" before any tasks run. That’s easy to miss because the orchestration layer is rarely where people look first for performance issues.

Aspect Current behavior Impact
Batch semantics Integers, lists, and percentages via pct_to_int Rich rollout control (staged, canary-like patterns)
Implementation detail Repeated pop(0) from a list O(H²) batching time for large inventories
Refactor direction Index-based slicing (or deque) Same semantics in O(H) time
Illustrative linear-time batching refactor

The report suggests refactoring to avoid mutating the list from the front. Conceptually, you switch to index-based slicing while preserving the user-visible behavior:

def _get_serialized_batches(self, play):
    all_hosts = self._inventory.get_hosts(play.hosts, order=play.order)
    all_hosts_len = len(all_hosts)

    serial_batch_list = play.serial or [-1]

    cur_item = 0
    serialized_batches = []
    index = 0

    while index < all_hosts_len:
        serial = pct_to_int(serial_batch_list[cur_item], all_hosts_len)

        if serial <= 0:
            serialized_batches.append(all_hosts[index:])
            break
        else:
            next_index = index + serial
            batch = all_hosts[index:next_index]
            if not batch:
                break
            serialized_batches.append(batch)
            index = next_index

        cur_item += 1
        if cur_item > len(serial_batch_list) - 1:
            cur_item = len(serial_batch_list) - 1

    return serialized_batches

Nothing about the orchestration contract changes—only the cost of getting there.

Failures, Early Exit, and Retries

Batching defines how we roll out; failure handling defines when we stop and how we recover. PlaybookExecutor encodes these policies in a tight loop over batches plus a small helper for retry files.

Batch-level failure policies

Once batches are computed, the executor restricts the inventory to each batch and calls TaskQueueManager.run(). During that loop, it watches for flags and host counts that tell it to stop early:

self._tqm._unreachable_hosts.update(self._unreachable_hosts)

previously_failed = len(self._tqm._failed_hosts)
previously_unreachable = len(self._tqm._unreachable_hosts)

break_play = False
batches = self._get_serialized_batches(play)
if len(batches) == 0:
    self._tqm.send_callback('v2_playbook_on_play_start', play)
    self._tqm.send_callback('v2_playbook_on_no_hosts_matched')

for batch in batches:
    self._inventory.restrict_to_hosts(batch)
    try:
        result = self._tqm.run(play=play)
    except AnsibleEndPlay as e:
        result = e.result
        break

    if result & self._tqm.RUN_FAILED_BREAK_PLAY != 0:
        result = self._tqm.RUN_FAILED_HOSTS
        break_play = True

    failed_hosts_count = (
        len(self._tqm._failed_hosts) + len(self._tqm._unreachable_hosts)
        - (previously_failed + previously_unreachable)
    )

    if len(batch) == failed_hosts_count:
        break_play = True
        break

    previously_failed += len(self._tqm._failed_hosts) - previously_failed
    previously_unreachable += len(self._tqm._unreachable_hosts) - previously_unreachable
    self._unreachable_hosts.update(self._tqm._unreachable_hosts)

if break_play:
    break

The orchestration patterns here are reusable:

  • Failure as protocol, not exceptions: TaskQueueManager.run() returns bit flags like RUN_FAILED_BREAK_PLAY. The executor interprets those into higher-level actions (normalize to RUN_FAILED_HOSTS, then stop the play). That keeps decision logic in the orchestrator while letting the worker signal intent.
  • Batch-level circuit breaker: If every host in a batch failed or was unreachable, the executor stops iterating batches. There’s no point in continuing the rollout on a pattern that is clearly broken.
  • Cross-play state: self._unreachable_hosts accumulates unreachable hosts across plays. That state feeds later decisions like retry generation.

Retry files: a tiny feature with big UX impact

Ansible’s retry files are a deceptively small feature: after a run, you get a .retry file listing failed and unreachable hosts, which you can feed back via --limit @file.retry. In PlaybookExecutor, this is handled by a focused helper:

def _generate_retry_inventory(self, retry_path, replay_hosts):
    """Generate an inventory containing only failed/unreachable hosts."""
    try:
        makedirs_safe(os.path.dirname(retry_path))
        with open(retry_path, 'w') as fd:
            for x in replay_hosts:
                fd.write("%s\n" % x)
    except Exception as e:
        display.warning(
            "Could not create retry file '%s'.\n\t%s" % (retry_path, to_text(e))
        )
        return False

    return True

The orchestration logic around it lives in run(), once TaskQueueManager has reported its final host states:

if self._tqm is not None:
    if C.RETRY_FILES_ENABLED:
        retries = set(self._tqm._failed_hosts.keys())
        retries.update(self._tqm._unreachable_hosts.keys())
        retries = sorted(retries)
        if len(retries) > 0:
            if C.RETRY_FILES_SAVE_PATH:
                basedir = C.RETRY_FILES_SAVE_PATH
            elif playbook_path:
                basedir = os.path.dirname(os.path.abspath(playbook_path))
            else:
                basedir = '~/'

            (retry_name, ext) = os.path.splitext(os.path.basename(playbook_path))
            filename = os.path.join(basedir, "%s.retry" % retry_name)
            if self._generate_retry_inventory(filename, retries):
                display.display("\tto retry, use: --limit @%s\n" % filename)

A few design choices stand out:

  • A feature flag (C.RETRY_FILES_ENABLED) and configurable save path keep the core behavior opt-in and environment-aware.
  • Failed and unreachable hosts are treated the same for retry purposes—both are "try again later" candidates.
  • The orchestrator finishes with a concrete hint: to retry, use: --limit @file.retry, turning failure into a guided next step.

Conservative error handling at the edges

The retry helper catches Exception broadly and logs a warning instead of failing the run. For a CLI-oriented tool, that’s a pragmatic tradeoff: a filesystem glitch doesn’t get to break the entire playbook.

In an automation or API setting, you might tighten that up—distinguish PermissionError from other I/O issues, or expose a non-zero status when retry generation is considered part of the contract. The important part is that orchestration code is where those policy decisions live.

Callbacks and Observability

Beyond control flow, PlaybookExecutor also defines how runs are made observable. It doesn’t log or print for every event directly; instead it emits callback events that other components can subscribe to.

Observer pattern in practice

Throughout execution, the executor sends events like:

  • v2_playbook_on_start
  • v2_playbook_on_play_start
  • v2_playbook_on_no_hosts_matched
  • v2_playbook_on_vars_prompt
  • v2_playbook_on_stats

Different callback plugins can then render these as human-readable output, JSON logs, or metrics. The executor itself stays focused on sequencing and policy, not on presentation.

What to measure in an orchestrator

The report suggests a set of metrics that make this behavior visible in real deployments. Three are especially useful when you treat orchestration as a product:

  • Playbook duration: a gauge like playbook_executor.play_duration_seconds for each run, which includes orchestration overhead as well as remote execution. Tracking p95 against an SLO gives you a clear sense of when runs become too slow for teams.
  • Batches per play: a counter such as playbook_executor.batches_per_play. This shows whether serial is tuned sensibly (few huge batches versus many tiny ones) and how rollout patterns change over time.
  • Retry pressure: a metric like playbook_executor.retry_file_hosts_count, counting hosts that end up in retry files. Persistent high ratios indicate systemic problems rather than random flakiness.

Practical Patterns to Reuse

Stepping back from Ansible specifics, PlaybookExecutor is a compact example of why orchestration deserves deliberate design. The class doesn’t execute modules itself; it encodes policies that define how safe, observable, and usable the whole system feels.

1. Treat orchestration as a first-class product

Design and review the orchestrator with the same care you’d give any user-facing service. Decisions about batching, stopping conditions, retries, and prompts directly shape the operator’s experience and failure modes.

2. Use simple semantics backed by focused helpers

Features like serial and retry files have simple, predictable semantics at the playbook level and are implemented by small helpers such as _get_serialized_batches() and _generate_retry_inventory(). That keeps policies easy to reason about and localizes complexity.

3. Watch the cost of "preparing work"

The quadratic batching behavior is a reminder that orchestration code can become a bottleneck at scale. Anywhere you transform large host lists, queues, or shards, treat performance as a first-class concern and prefer linear-time algorithms when behavior allows.

4. Separate worker results from orchestration policy

Let your worker layer return a small set of status flags. Let your orchestrator decide what those mean: continue, break the batch, break the run, or generate retries. That separation makes it easier to evolve policies without rewriting low-level execution code.

5. Make observability pluggable via callbacks

By emitting callback events instead of hard-coding logs, PlaybookExecutor allows different environments to attach their own UX and monitoring behavior. Adopting the same observer-style pattern in your orchestrator keeps it adaptable as your tooling evolves.

If you approach your own automation systems with the mindset that "orchestration is the product", you naturally start to ask better questions: How do we limit blast radius? How do we know when to stop? How do we help people recover? PlaybookExecutor offers concrete answers to all three—and a set of patterns you can carry into your next executor design.

Full Source Code

Direct source from the upstream repository. Preview it inline or open it on GitHub.

lib/ansible/executor/playbook_executor.py

ansible/ansible • devel

Choose one action below.

Open on GitHub

Thanks for reading! I hope this was useful. If you have questions or thoughts, feel free to reach out.

Content Creation Process: This article was generated via a semi-automated workflow using AI tools. I prepared the strategic framework, including specific prompts and data sources. From there, the automation system conducted the research, analysis, and writing. The content passed through automated verification steps before being finalized and published without manual intervention.

Mahmoud Zalt

About the Author

I’m Zalt, a technologist with 16+ years of experience, passionate about designing and building AI systems that move us closer to a world where machines handle everything and humans reclaim wonder.

Let's connect if you're working on interesting AI projects, looking for technical advice or want to discuss anything.

Support this content

Share this article

CONSULTING

Orchestration layer feeling fragile?

When coordination logic IS the product, architecture mistakes hit your users directly. A review catches the weak points before they cost you.