We’re examining how Ansible turns playbooks, inventory, and plugins into a single, coherent automation run. The core of that behavior lives in PlaybookExecutor, the class behind the ansible-playbook command. I'm Mahmoud Zalt, an AI solutions architect, and we’ll walk through how this one orchestrator file shapes safety, performance, and operator experience—often more than the individual modules ever do.
Our focus is one lesson: treat orchestration as a first-class product. We’ll see how batching (serial), failure handling, retries, and callbacks work together, where subtle algorithmic choices start to hurt at scale, and which patterns you can reuse in your own automation systems.
Where PlaybookExecutor Sits in Ansible
To understand why orchestration design matters, it helps to see where PlaybookExecutor lives in the Ansible codebase and what it actually owns.
ansible/
lib/
ansible/
executor/
playbook_executor.py <-- PlaybookExecutor orchestrates playbooks
task_queue_manager.py <-- TaskQueueManager executes tasks per host
playbook/
__init__.py <-- Playbook.load provides Play objects
utils/
display.py <-- Display for user interaction
helpers.py <-- pct_to_int for serial batching
path.py <-- makedirs_safe for retry files
plugins/
loader.py <-- connection_loader, shell_loader, become_loader
_internal/_templating/
_engine.py <-- TemplateEngine for vars and prompts
PlaybookExecutor sits in the Ansible architecture.
Think of PlaybookExecutor as a dispatcher: each playbook is a train, each play is a carriage, and each batch of hosts is a compartment. The dispatcher decides which compartments move when (via serial), and records which ones had issues so you can send a "repair train" later (retry files).
The constructor wires together the collaborators it needs—inventory, variable manager, loader, passwords—and chooses between "planning" modes (list hosts, list tasks, list tags, syntax check) and actual execution:
class PlaybookExecutor:
"""Primary class for executing playbooks behind ansible-playbook."""
def __init__(self, playbooks, inventory, variable_manager, loader, passwords):
self._playbooks = playbooks
self._inventory = inventory
self._variable_manager = variable_manager
self._loader = loader
self.passwords = passwords
self._unreachable_hosts = dict()
if (context.CLIARGS.get('listhosts') or
context.CLIARGS.get('listtasks') or
context.CLIARGS.get('listtags') or
context.CLIARGS.get('syntax')):
self._tqm = None
else:
self._tqm = TaskQueueManager(
inventory=inventory,
variable_manager=variable_manager,
loader=loader,
passwords=self.passwords,
forks=context.CLIARGS.get('forks'),
)
TaskQueueManager is the assembly line that actually runs tasks on hosts. PlaybookExecutor decides whether to spin it up and, if so, in what shape: how many forks, which hosts per batch, when to stop, and how to surface results.
Serial Batching: Safety vs. Scale
One of the most important policies in any orchestrator is: How many things do we touch at once? In Ansible, that policy is expressed by the serial keyword in a play and implemented by PlaybookExecutor._get_serialized_batches().
Serial as a blast-radius control
serial lets you say "only work on N hosts at a time" (or a percentage). That’s a classic blast-radius control: if a deployment goes bad, it only breaks the current batch, not the entire fleet.
In code, the executor turns the host list into batches like this:
def _get_serialized_batches(self, play):
"""Return hosts subdivided into batches based on play.serial."""
all_hosts = self._inventory.get_hosts(play.hosts, order=play.order)
all_hosts_len = len(all_hosts)
serial_batch_list = play.serial
if len(serial_batch_list) == 0:
serial_batch_list = [-1]
cur_item = 0
serialized_batches = []
while len(all_hosts) > 0:
serial = pct_to_int(serial_batch_list[cur_item], all_hosts_len)
if serial <= 0:
serialized_batches.append(all_hosts)
break
else:
play_hosts = []
for x in range(serial):
if len(all_hosts) > 0:
play_hosts.append(all_hosts.pop(0))
serialized_batches.append(play_hosts)
cur_item += 1
if cur_item > len(serial_batch_list) - 1:
cur_item = len(serial_batch_list) - 1
return serialized_batches
A few details matter for behavior:
play.serialcan be a list (e.g.[10, 20, "50%"]), not just a scalar.pct_to_intconverts percentage strings like"50%"relative to the total host count.serial <= 0means "take all remaining hosts in one last batch".- Once the list of serial values is exhausted, the last value is reused for all remaining batches.
This gives operators a simple, predictable language for rollout patterns while keeping the implementation confined to a single helper.
The subtle performance trap
The interesting part is not the semantics but the algorithmic cost. The batching loop repeatedly does all_hosts.pop(0). Popping from the front of a Python list is O(n), so doing it for every host turns the whole batching step into O(H²) for H hosts.
On a few hundred hosts, this is fine. On tens of thousands, startup time becomes noticeably dominated by "just preparing work" before any tasks run. That’s easy to miss because the orchestration layer is rarely where people look first for performance issues.
| Aspect | Current behavior | Impact |
|---|---|---|
| Batch semantics | Integers, lists, and percentages via pct_to_int |
Rich rollout control (staged, canary-like patterns) |
| Implementation detail | Repeated pop(0) from a list |
O(H²) batching time for large inventories |
| Refactor direction | Index-based slicing (or deque) | Same semantics in O(H) time |
Illustrative linear-time batching refactor
The report suggests refactoring to avoid mutating the list from the front. Conceptually, you switch to index-based slicing while preserving the user-visible behavior:
def _get_serialized_batches(self, play):
all_hosts = self._inventory.get_hosts(play.hosts, order=play.order)
all_hosts_len = len(all_hosts)
serial_batch_list = play.serial or [-1]
cur_item = 0
serialized_batches = []
index = 0
while index < all_hosts_len:
serial = pct_to_int(serial_batch_list[cur_item], all_hosts_len)
if serial <= 0:
serialized_batches.append(all_hosts[index:])
break
else:
next_index = index + serial
batch = all_hosts[index:next_index]
if not batch:
break
serialized_batches.append(batch)
index = next_index
cur_item += 1
if cur_item > len(serial_batch_list) - 1:
cur_item = len(serial_batch_list) - 1
return serialized_batches
Nothing about the orchestration contract changes—only the cost of getting there.
Failures, Early Exit, and Retries
Batching defines how we roll out; failure handling defines when we stop and how we recover. PlaybookExecutor encodes these policies in a tight loop over batches plus a small helper for retry files.
Batch-level failure policies
Once batches are computed, the executor restricts the inventory to each batch and calls TaskQueueManager.run(). During that loop, it watches for flags and host counts that tell it to stop early:
self._tqm._unreachable_hosts.update(self._unreachable_hosts)
previously_failed = len(self._tqm._failed_hosts)
previously_unreachable = len(self._tqm._unreachable_hosts)
break_play = False
batches = self._get_serialized_batches(play)
if len(batches) == 0:
self._tqm.send_callback('v2_playbook_on_play_start', play)
self._tqm.send_callback('v2_playbook_on_no_hosts_matched')
for batch in batches:
self._inventory.restrict_to_hosts(batch)
try:
result = self._tqm.run(play=play)
except AnsibleEndPlay as e:
result = e.result
break
if result & self._tqm.RUN_FAILED_BREAK_PLAY != 0:
result = self._tqm.RUN_FAILED_HOSTS
break_play = True
failed_hosts_count = (
len(self._tqm._failed_hosts) + len(self._tqm._unreachable_hosts)
- (previously_failed + previously_unreachable)
)
if len(batch) == failed_hosts_count:
break_play = True
break
previously_failed += len(self._tqm._failed_hosts) - previously_failed
previously_unreachable += len(self._tqm._unreachable_hosts) - previously_unreachable
self._unreachable_hosts.update(self._tqm._unreachable_hosts)
if break_play:
break
The orchestration patterns here are reusable:
-
Failure as protocol, not exceptions:
TaskQueueManager.run()returns bit flags likeRUN_FAILED_BREAK_PLAY. The executor interprets those into higher-level actions (normalize toRUN_FAILED_HOSTS, then stop the play). That keeps decision logic in the orchestrator while letting the worker signal intent. - Batch-level circuit breaker: If every host in a batch failed or was unreachable, the executor stops iterating batches. There’s no point in continuing the rollout on a pattern that is clearly broken.
-
Cross-play state:
self._unreachable_hostsaccumulates unreachable hosts across plays. That state feeds later decisions like retry generation.
Retry files: a tiny feature with big UX impact
Ansible’s retry files are a deceptively small feature: after a run, you get a .retry file listing failed and unreachable hosts, which you can feed back via --limit @file.retry. In PlaybookExecutor, this is handled by a focused helper:
def _generate_retry_inventory(self, retry_path, replay_hosts):
"""Generate an inventory containing only failed/unreachable hosts."""
try:
makedirs_safe(os.path.dirname(retry_path))
with open(retry_path, 'w') as fd:
for x in replay_hosts:
fd.write("%s\n" % x)
except Exception as e:
display.warning(
"Could not create retry file '%s'.\n\t%s" % (retry_path, to_text(e))
)
return False
return True
The orchestration logic around it lives in run(), once TaskQueueManager has reported its final host states:
if self._tqm is not None:
if C.RETRY_FILES_ENABLED:
retries = set(self._tqm._failed_hosts.keys())
retries.update(self._tqm._unreachable_hosts.keys())
retries = sorted(retries)
if len(retries) > 0:
if C.RETRY_FILES_SAVE_PATH:
basedir = C.RETRY_FILES_SAVE_PATH
elif playbook_path:
basedir = os.path.dirname(os.path.abspath(playbook_path))
else:
basedir = '~/'
(retry_name, ext) = os.path.splitext(os.path.basename(playbook_path))
filename = os.path.join(basedir, "%s.retry" % retry_name)
if self._generate_retry_inventory(filename, retries):
display.display("\tto retry, use: --limit @%s\n" % filename)
A few design choices stand out:
- A feature flag (
C.RETRY_FILES_ENABLED) and configurable save path keep the core behavior opt-in and environment-aware. - Failed and unreachable hosts are treated the same for retry purposes—both are "try again later" candidates.
- The orchestrator finishes with a concrete hint:
to retry, use: --limit @file.retry, turning failure into a guided next step.
Conservative error handling at the edges
The retry helper catches Exception broadly and logs a warning instead of failing the run. For a CLI-oriented tool, that’s a pragmatic tradeoff: a filesystem glitch doesn’t get to break the entire playbook.
In an automation or API setting, you might tighten that up—distinguish PermissionError from other I/O issues, or expose a non-zero status when retry generation is considered part of the contract. The important part is that orchestration code is where those policy decisions live.
Callbacks and Observability
Beyond control flow, PlaybookExecutor also defines how runs are made observable. It doesn’t log or print for every event directly; instead it emits callback events that other components can subscribe to.
Observer pattern in practice
Throughout execution, the executor sends events like:
v2_playbook_on_startv2_playbook_on_play_startv2_playbook_on_no_hosts_matchedv2_playbook_on_vars_promptv2_playbook_on_stats
Different callback plugins can then render these as human-readable output, JSON logs, or metrics. The executor itself stays focused on sequencing and policy, not on presentation.
What to measure in an orchestrator
The report suggests a set of metrics that make this behavior visible in real deployments. Three are especially useful when you treat orchestration as a product:
-
Playbook duration: a gauge like
playbook_executor.play_duration_secondsfor each run, which includes orchestration overhead as well as remote execution. Tracking p95 against an SLO gives you a clear sense of when runs become too slow for teams. -
Batches per play: a counter such as
playbook_executor.batches_per_play. This shows whetherserialis tuned sensibly (few huge batches versus many tiny ones) and how rollout patterns change over time. -
Retry pressure: a metric like
playbook_executor.retry_file_hosts_count, counting hosts that end up in retry files. Persistent high ratios indicate systemic problems rather than random flakiness.
Practical Patterns to Reuse
Stepping back from Ansible specifics, PlaybookExecutor is a compact example of why orchestration deserves deliberate design. The class doesn’t execute modules itself; it encodes policies that define how safe, observable, and usable the whole system feels.
1. Treat orchestration as a first-class product
Design and review the orchestrator with the same care you’d give any user-facing service. Decisions about batching, stopping conditions, retries, and prompts directly shape the operator’s experience and failure modes.
2. Use simple semantics backed by focused helpers
Features like serial and retry files have simple, predictable semantics at the playbook level and are implemented by small helpers such as _get_serialized_batches() and _generate_retry_inventory(). That keeps policies easy to reason about and localizes complexity.
3. Watch the cost of "preparing work"
The quadratic batching behavior is a reminder that orchestration code can become a bottleneck at scale. Anywhere you transform large host lists, queues, or shards, treat performance as a first-class concern and prefer linear-time algorithms when behavior allows.
4. Separate worker results from orchestration policy
Let your worker layer return a small set of status flags. Let your orchestrator decide what those mean: continue, break the batch, break the run, or generate retries. That separation makes it easier to evolve policies without rewriting low-level execution code.
5. Make observability pluggable via callbacks
By emitting callback events instead of hard-coding logs, PlaybookExecutor allows different environments to attach their own UX and monitoring behavior. Adopting the same observer-style pattern in your orchestrator keeps it adaptable as your tooling evolves.
If you approach your own automation systems with the mindset that "orchestration is the product", you naturally start to ask better questions: How do we limit blast radius? How do we know when to stop? How do we help people recover? PlaybookExecutor offers concrete answers to all three—and a set of patterns you can carry into your next executor design.





