When Orchestration Becomes the Product

We’re examining how Ansible turns playbooks, inventory, and plugins into a single, coherent automation run. The core of that behavior lives in PlaybookExecutor, the class behind the ansible-playbook command. I'm Mahmoud Zalt, an AI solutions architect, and we’ll walk through how this one orchestrator file shapes safety, performance, and operator experience—often more than the individual modules ever do.

Our focus is one lesson: treat orchestration as a first-class product. We’ll see how batching (serial), failure handling, retries, and callbacks work together, where subtle algorithmic choices start to hurt at scale, and which patterns you can reuse in your own automation systems.

Where PlaybookExecutor Sits in Ansible

To understand why orchestration design matters, it helps to see where PlaybookExecutor lives in the Ansible codebase and what it actually owns.

ansible/
  lib/
    ansible/
      executor/
        playbook_executor.py  <-- PlaybookExecutor orchestrates playbooks
        task_queue_manager.py <-- TaskQueueManager executes tasks per host
      playbook/
        __init__.py           <-- Playbook.load provides Play objects
      utils/
        display.py            <-- Display for user interaction
        helpers.py            <-- pct_to_int for serial batching
        path.py               <-- makedirs_safe for retry files
      plugins/
        loader.py             <-- connection_loader, shell_loader, become_loader
      _internal/_templating/
        _engine.py            <-- TemplateEngine for vars and prompts

Where PlaybookExecutor sits in the Ansible architecture.

Think of PlaybookExecutor as a dispatcher: each playbook is a train, each play is a carriage, and each batch of hosts is a compartment. The dispatcher decides which compartments move when (via serial), and records which ones had issues so you can send a "repair train" later (retry files).

The constructor wires together the collaborators it needs—inventory, variable manager, loader, passwords—and chooses between "planning" modes (list hosts, list tasks, list tags, syntax check) and actual execution:

class PlaybookExecutor:
    """Primary class for executing playbooks behind ansible-playbook."""

    def __init__(self, playbooks, inventory, variable_manager, loader, passwords):
        self._playbooks = playbooks
        self._inventory = inventory
        self._variable_manager = variable_manager
        self._loader = loader
        self.passwords = passwords
        self._unreachable_hosts = dict()

        if (context.CLIARGS.get('listhosts') or
                context.CLIARGS.get('listtasks') or
                context.CLIARGS.get('listtags') or
                context.CLIARGS.get('syntax')):
            self._tqm = None
        else:
            self._tqm = TaskQueueManager(
                inventory=inventory,
                variable_manager=variable_manager,
                loader=loader,
                passwords=self.passwords,
                forks=context.CLIARGS.get('forks'),
            )

TaskQueueManager is the assembly line that actually runs tasks on hosts. PlaybookExecutor decides whether to spin it up and, if so, in what shape: how many forks, which hosts per batch, when to stop, and how to surface results.

Design tip: A small public API (here, essentially run()) backed by injected collaborators is a clean way to keep orchestration logic powerful without making it untestable or opaque.

Serial Batching: Safety vs. Scale

One of the most important policies in any orchestrator is: How many things do we touch at once? In Ansible, that policy is expressed by the serial keyword in a play and implemented by PlaybookExecutor._get_serialized_batches().

Serial as a blast-radius control

serial lets you say "only work on N hosts at a time" (or a percentage). That’s a classic blast-radius control: if a deployment goes bad, it only breaks the current batch, not the entire fleet.

In code, the executor turns the host list into batches like this:

def _get_serialized_batches(self, play):
    """Return hosts subdivided into batches based on play.serial."""

    all_hosts = self._inventory.get_hosts(play.hosts, order=play.order)
    all_hosts_len = len(all_hosts)

    serial_batch_list = play.serial
    if len(serial_batch_list) == 0:
        serial_batch_list = [-1]

    cur_item = 0
    serialized_batches = []

    while len(all_hosts) > 0:
        serial = pct_to_int(serial_batch_list[cur_item], all_hosts_len)

        if serial <= 0:
            serialized_batches.append(all_hosts)
            break
        else:
            play_hosts = []
            for x in range(serial):
                if len(all_hosts) > 0:
                    play_hosts.append(all_hosts.pop(0))

            serialized_batches.append(play_hosts)

        cur_item += 1
        if cur_item > len(serial_batch_list) - 1:
            cur_item = len(serial_batch_list) - 1

    return serialized_batches

A few details matter for behavior:

play.serial can be a list (e.g. [10, 20, "50%"]), not just a scalar.
pct_to_int converts percentage strings like "50%" relative to the total host count.
serial <= 0 means "take all remaining hosts in one last batch".
Once the list of serial values is exhausted, the last value is reused for all remaining batches.

This gives operators a simple, predictable language for rollout patterns while keeping the implementation confined to a single helper.

The subtle performance trap

The interesting part is not the semantics but the algorithmic cost. The batching loop repeatedly does all_hosts.pop(0). Popping from the front of a Python list is O(n), so doing it for every host turns the whole batching step into O(H²) for H hosts.

On a few hundred hosts, this is fine. On tens of thousands, startup time becomes noticeably dominated by "just preparing work" before any tasks run. That’s easy to miss because the orchestration layer is rarely where people look first for performance issues.

Aspect	Current behavior	Impact
Batch semantics	Integers, lists, and percentages via `pct_to_int`	Rich rollout control (staged, canary-like patterns)
Implementation detail	Repeated `pop(0)` from a list	`O(H²)` batching time for large inventories
Refactor direction	Index-based slicing (or deque)	Same semantics in `O(H)` time

Illustrative linear-time batching refactor

The report suggests refactoring to avoid mutating the list from the front. Conceptually, you switch to index-based slicing while preserving the user-visible behavior:

def _get_serialized_batches(self, play):
    all_hosts = self._inventory.get_hosts(play.hosts, order=play.order)
    all_hosts_len = len(all_hosts)

    serial_batch_list = play.serial or [-1]

    cur_item = 0
    serialized_batches = []
    index = 0

    while index < all_hosts_len:
        serial = pct_to_int(serial_batch_list[cur_item], all_hosts_len)

        if serial <= 0:
            serialized_batches.append(all_hosts[index:])
            break
        else:
            next_index = index + serial
            batch = all_hosts[index:next_index]
            if not batch:
                break
            serialized_batches.append(batch)
            index = next_index

        cur_item += 1
        if cur_item > len(serial_batch_list) - 1:
            cur_item = len(serial_batch_list) - 1

    return serialized_batches

Nothing about the orchestration contract changes—only the cost of getting there.

Rule of thumb: In orchestrators, pre-flight work (batching, sorting, partitioning) can become a visible bottleneck long before your workers are saturated. Scan for patterns like pop(0), repeated full scans, or nested loops over large collections.

Failures, Early Exit, and Retries

Batching defines how we roll out; failure handling defines when we stop and how we recover. PlaybookExecutor encodes these policies in a tight loop over batches plus a small helper for retry files.

Batch-level failure policies

Once batches are computed, the executor restricts the inventory to each batch and calls TaskQueueManager.run(). During that loop, it watches for flags and host counts that tell it to stop early:

self._tqm._unreachable_hosts.update(self._unreachable_hosts)

previously_failed = len(self._tqm._failed_hosts)
previously_unreachable = len(self._tqm._unreachable_hosts)

break_play = False
batches = self._get_serialized_batches(play)
if len(batches) == 0:
    self._tqm.send_callback('v2_playbook_on_play_start', play)
    self._tqm.send_callback('v2_playbook_on_no_hosts_matched')

for batch in batches:
    self._inventory.restrict_to_hosts(batch)
    try:
        result = self._tqm.run(play=play)
    except AnsibleEndPlay as e:
        result = e.result
        break

    if result & self._tqm.RUN_FAILED_BREAK_PLAY != 0:
        result = self._tqm.RUN_FAILED_HOSTS
        break_play = True

    failed_hosts_count = (
        len(self._tqm._failed_hosts) + len(self._tqm._unreachable_hosts)
        - (previously_failed + previously_unreachable)
    )

    if len(batch) == failed_hosts_count:
        break_play = True
        break

    previously_failed += len(self._tqm._failed_hosts) - previously_failed
    previously_unreachable += len(self._tqm._unreachable_hosts) - previously_unreachable
    self._unreachable_hosts.update(self._tqm._unreachable_hosts)

if break_play:
    break

The orchestration patterns here are reusable:

Failure as protocol, not exceptions: TaskQueueManager.run() returns bit flags like RUN_FAILED_BREAK_PLAY. The executor interprets those into higher-level actions (normalize to RUN_FAILED_HOSTS, then stop the play). That keeps decision logic in the orchestrator while letting the worker signal intent.
Batch-level circuit breaker: If every host in a batch failed or was unreachable, the executor stops iterating batches. There’s no point in continuing the rollout on a pattern that is clearly broken.
Cross-play state: self._unreachable_hosts accumulates unreachable hosts across plays. That state feeds later decisions like retry generation.

Retry files: a tiny feature with big UX impact

Ansible’s retry files are a deceptively small feature: after a run, you get a .retry file listing failed and unreachable hosts, which you can feed back via --limit @file.retry. In PlaybookExecutor, this is handled by a focused helper:

def _generate_retry_inventory(self, retry_path, replay_hosts):
    """Generate an inventory containing only failed/unreachable hosts."""
    try:
        makedirs_safe(os.path.dirname(retry_path))
        with open(retry_path, 'w') as fd:
            for x in replay_hosts:
                fd.write("%s\n" % x)
    except Exception as e:
        display.warning(
            "Could not create retry file '%s'.\n\t%s" % (retry_path, to_text(e))
        )
        return False

    return True

The orchestration logic around it lives in run(), once TaskQueueManager has reported its final host states:

if self._tqm is not None:
    if C.RETRY_FILES_ENABLED:
        retries = set(self._tqm._failed_hosts.keys())
        retries.update(self._tqm._unreachable_hosts.keys())
        retries = sorted(retries)
        if len(retries) > 0:
            if C.RETRY_FILES_SAVE_PATH:
                basedir = C.RETRY_FILES_SAVE_PATH
            elif playbook_path:
                basedir = os.path.dirname(os.path.abspath(playbook_path))
            else:
                basedir = '~/'

            (retry_name, ext) = os.path.splitext(os.path.basename(playbook_path))
            filename = os.path.join(basedir, "%s.retry" % retry_name)
            if self._generate_retry_inventory(filename, retries):
                display.display("\tto retry, use: --limit @%s\n" % filename)

A few design choices stand out:

A feature flag (C.RETRY_FILES_ENABLED) and configurable save path keep the core behavior opt-in and environment-aware.
Failed and unreachable hosts are treated the same for retry purposes—both are "try again later" candidates.
The orchestrator finishes with a concrete hint: to retry, use: --limit @file.retry, turning failure into a guided next step.

Conservative error handling at the edges

The retry helper catches Exception broadly and logs a warning instead of failing the run. For a CLI-oriented tool, that’s a pragmatic tradeoff: a filesystem glitch doesn’t get to break the entire playbook.

In an automation or API setting, you might tighten that up—distinguish PermissionError from other I/O issues, or expose a non-zero status when retry generation is considered part of the contract. The important part is that orchestration code is where those policy decisions live.

Callbacks and Observability

Beyond control flow, PlaybookExecutor also defines how runs are made observable. It doesn’t log or print for every event directly; instead it emits callback events that other components can subscribe to.

Observer pattern in practice

Throughout execution, the executor sends events like:

v2_playbook_on_start
v2_playbook_on_play_start
v2_playbook_on_no_hosts_matched
v2_playbook_on_vars_prompt
v2_playbook_on_stats

Different callback plugins can then render these as human-readable output, JSON logs, or metrics. The executor itself stays focused on sequencing and policy, not on presentation.

What to measure in an orchestrator

The report suggests a set of metrics that make this behavior visible in real deployments. Three are especially useful when you treat orchestration as a product:

Playbook duration: a gauge like playbook_executor.play_duration_seconds for each run, which includes orchestration overhead as well as remote execution. Tracking p95 against an SLO gives you a clear sense of when runs become too slow for teams.
Batches per play: a counter such as playbook_executor.batches_per_play. This shows whether serial is tuned sensibly (few huge batches versus many tiny ones) and how rollout patterns change over time.
Retry pressure: a metric like playbook_executor.retry_file_hosts_count, counting hosts that end up in retry files. Persistent high ratios indicate systemic problems rather than random flakiness.

Practical Patterns to Reuse

Stepping back from Ansible specifics, PlaybookExecutor is a compact example of why orchestration deserves deliberate design. The class doesn’t execute modules itself; it encodes policies that define how safe, observable, and usable the whole system feels.

1. Treat orchestration as a first-class product

Design and review the orchestrator with the same care you’d give any user-facing service. Decisions about batching, stopping conditions, retries, and prompts directly shape the operator’s experience and failure modes.

2. Use simple semantics backed by focused helpers

Features like serial and retry files have simple, predictable semantics at the playbook level and are implemented by small helpers such as _get_serialized_batches() and _generate_retry_inventory(). That keeps policies easy to reason about and localizes complexity.

3. Watch the cost of "preparing work"

The quadratic batching behavior is a reminder that orchestration code can become a bottleneck at scale. Anywhere you transform large host lists, queues, or shards, treat performance as a first-class concern and prefer linear-time algorithms when behavior allows.

4. Separate worker results from orchestration policy

Let your worker layer return a small set of status flags. Let your orchestrator decide what those mean: continue, break the batch, break the run, or generate retries. That separation makes it easier to evolve policies without rewriting low-level execution code.

5. Make observability pluggable via callbacks

By emitting callback events instead of hard-coding logs, PlaybookExecutor allows different environments to attach their own UX and monitoring behavior. Adopting the same observer-style pattern in your orchestrator keeps it adaptable as your tooling evolves.

If you approach your own automation systems with the mindset that "orchestration is the product", you naturally start to ask better questions: How do we limit blast radius? How do we know when to stop? How do we help people recover? PlaybookExecutor offers concrete answers to all three—and a set of patterns you can carry into your next executor design.

Zalt Blog

When Orchestration Becomes the Product

Got a specific orchestration problem?

Where PlaybookExecutor Sits in Ansible

Serial Batching: Safety vs. Scale

Serial as a blast-radius control

The subtle performance trap

Failures, Early Exit, and Retries

Batch-level failure policies

Retry files: a tiny feature with big UX impact

Conservative error handling at the edges

Callbacks and Observability

Observer pattern in practice

What to measure in an orchestrator

Practical Patterns to Reuse

1. Treat orchestration as a first-class product

2. Use simple semantics backed by focused helpers

3. Watch the cost of "preparing work"

4. Separate worker results from orchestration policy

5. Make observability pluggable via callbacks

Full Source Code

Read More

The Renderer That Conducts WebGL

The Context Object That Runs Your MCP Server

Free AI Tools

AI Executive Assistant

AI Personal Assistant

About the Author

Support this content

Share this article