Complex AI agents rarely fail because of a single prompt or a single tool. They fail in the space between those pieces: the loops, the decisions, and the orchestration that glues everything together. In crewAI, that glue lives inside CrewAgentExecutor, a surprisingly rich class that turns raw LLMs and tools into reliable agents. I'm Mahmoud Zalt, an AI solutions architect, and we’ll walk through how this executor behaves like a control tower for your agents — and what we can reuse from its design when building our own orchestration code.
Setting the scene
We’re examining how crewAI runs a single agent to completion. crewAI is an orchestration framework for LLM‑powered agents; it doesn’t try to be an LLM or a tool library itself. At the center of its agents layer is CrewAgentExecutor, a class whose job is to decide when to call the LLM, when to call tools, how to handle errors, and when to stop.
project-root/
lib/
crewai/
src/
crewai/
agents/
base_agent_executor.py # Base lifecycle and shared logic
crew_agent_executor.py # This file: orchestrates agent + tools + LLM
core/
providers/
human_input.py # Human feedback provider used here
events/
event_bus.py # crewai_event_bus observed by executor
types/
logging_events.py # AgentLogsStartedEvent, AgentLogsExecutionEvent
tool_usage_events.py # ToolUsage* events from tool execution
utilities/
agent_utils.py # LLM response helpers, context handling
file_store.py # get_all_files/aget_all_files for multimodal
training_handler.py # CrewTrainingHandler for TRAINING_DATA_FILE
tool_utils.py # execute_tool_and_check_finality, async variant
i18n.py # I18N_DEFAULT for prompts and tool names
CrewAgentExecutor sits in the middle of the agents layer, orchestrating many utilities.
At a high level, a run looks like this:
invoke/ainvokeis called with a dict of inputs.- Prompts are formatted, multimodal files attached, and an initial message history is built.
- A main loop runs: call the LLM, interpret the result as either an
AgentAction(use a tool) orAgentFinish(we’re done). - Tool calls are executed, results logged and appended to messages.
- Human feedback and training data are optionally captured.
This is not a thin wrapper around an LLM. It’s the control tower for a single agent: it decides who talks when, tracks shared history, enforces limits, and tells everyone when the flight is over.
The agent loop as a control tower
Once the executor is wired up, the core question becomes: how does this control tower make sure a conversation actually lands? That logic lives in the agent loop.
The first decision in each run is whether to use native function calling or a ReAct‑style text protocol. The executor chooses a strategy up front:
def _invoke_loop(self) -> AgentFinish:
"""Execute agent loop until completion."""
use_native_tools = (
hasattr(self.llm, "supports_function_calling")
and callable(getattr(self.llm, "supports_function_calling", None))
and self.llm.supports_function_calling()
and self.original_tools
)
if use_native_tools:
return self._invoke_loop_native_tools()
return self._invoke_loop_react()
This is a straightforward Strategy pattern: the goal (“run the agent to completion”) is fixed, but the algorithm depends on LLM capabilities. The rest of the class is structured around this switch.
The ReAct path exposes the full machinery of the control tower:
def _invoke_loop_react(self) -> AgentFinish:
formatted_answer = None
while not isinstance(formatted_answer, AgentFinish):
try:
if has_reached_max_iterations(self.iterations, self.max_iter):
formatted_answer = handle_max_iterations_exceeded(
formatted_answer,
printer=PRINTER,
messages=self.messages,
llm=cast("BaseLLM", self.llm),
callbacks=self.callbacks,
verbose=self.agent.verbose,
)
break
enforce_rpm_limit(self.request_within_rpm_limit)
answer = get_llm_response(
llm=cast("BaseLLM", self.llm),
messages=self.messages,
callbacks=self.callbacks,
printer=PRINTER,
from_task=self.task,
from_agent=self.agent,
response_model=self.response_model,
executor_context=self,
verbose=self.agent.verbose,
)
# ... parse into AgentAction or AgentFinish ...
if isinstance(formatted_answer, AgentAction):
tool_result = execute_tool_and_check_finality(...)
formatted_answer = self._handle_agent_action(
formatted_answer, tool_result
)
self._invoke_step_callback(formatted_answer)
self._append_message(formatted_answer.text)
except OutputParserError:
formatted_answer = handle_output_parser_exception(...)
except Exception as e:
if e.__class__.__module__.startswith("litellm"):
raise e
if is_context_length_exceeded(e):
handle_context_length(...)
continue
handle_unknown_error(PRINTER, e, verbose=self.agent.verbose)
raise e
finally:
self.iterations += 1
if not isinstance(formatted_answer, AgentFinish):
raise RuntimeError("Agent execution ended without reaching a final answer.")
self._show_logs(formatted_answer)
return formatted_answer
A few orchestration choices stand out:
-
Termination is explicit.
has_reached_max_iterationsandhandle_max_iterations_exceededguarantee the loop ends. You never silently spin as the LLM keeps requesting tools. -
Rate limiting is at the loop boundary.
enforce_rpm_limitruns once per iteration, so request budgets are enforced where you can see them, not buried in a client wrapper. -
Context length is a handled failure mode.
is_context_length_exceededandhandle_context_lengthare integrated into the loop. Instead of letting providers throw and crash the run, the executor trims or adjusts history and retries. -
Parser failures are treated as normal.
OutputParserErroris caught and normalized viahandle_output_parser_exception, acknowledging that ReAct parsing is probabilistic and must be retried.
The result is simple but critical: the loop either finishes with a valid AgentFinish or fails loudly with a clear error. For production agents, that boring predictability is the difference between “works in a notebook” and “survives real users.”
Tool calls as a disciplined kitchen
Once the loop decides a tool should run, the executor shifts from control tower to restaurant kitchen. The LLM places orders (tool calls), the executor dispatches them to functions, and then plates the result back into the shared conversation.
Native tools are where this kitchen is most structured. The central worker is _execute_single_native_tool_call, which concentrates argument handling, limits, caching, hooks, and events in one place:
def _execute_single_native_tool_call(
self,
*,
call_id: str,
func_name: str,
func_args: str | dict[str, Any],
available_functions: dict[str, Callable[..., Any]],
original_tool: Any | None = None,
should_execute: bool = True,
) -> dict[str, Any]:
args_dict, parse_error = parse_tool_call_args(
func_args, func_name, call_id, original_tool
)
if parse_error is not None:
return parse_error
max_usage_reached = False
if not should_execute and original_tool:
max_usage_reached = True
elif (
should_execute
and original_tool
and (max_count := getattr(original_tool, "max_usage_count", None)) is not None
and getattr(original_tool, "current_usage_count", 0) >= max_count
):
max_usage_reached = True
from_cache = False
result: str = "Tool not found"
input_str = json.dumps(args_dict) if args_dict else ""
if self.tools_handler and self.tools_handler.cache:
cached_result = self.tools_handler.cache.read(tool=func_name, input=input_str)
if cached_result is not None:
result = str(cached_result) if not isinstance(cached_result, str) else cached_result
from_cache = True
# Emit start event, run hooks, execute or skip, emit finished/error events,
# and return a structured result dict.
This function encapsulates several cross‑cutting concerns:
- Argument parsing is centralized via
parse_tool_call_args, so provider‑specific quirks don’t leak into the loop. - Usage limits (
max_usage_count) live next to the tool, not in the control flow. - Caching is delegated to
ToolsHandler.cache, but controlled here, with an optionalcache_functionpolicy on the tool. - Hooks around execution use
ToolCallHookContext, enabling policy or tracing without touching core logic. - Events (
ToolUsageStartedEvent,ToolUsageFinishedEvent,ToolUsageErrorEvent) are emitted predictably, baking observability into each call.
Conceptually, each tool call is a Command: an executable unit with metadata that can be logged, cached, and decorated. The executor is the command dispatcher.
After execution, the result is stitched back into the conversation and may even terminate the run:
def _append_tool_result_and_check_finality(
self, execution_result: dict[str, Any]
) -> AgentFinish | None:
call_id = cast(str, execution_result["call_id"])
func_name = cast(str, execution_result["func_name"])
result = cast(str, execution_result["result"])
original_tool = execution_result["original_tool"]
tool_message: LLMMessage = {
"role": "tool",
"tool_call_id": call_id,
"name": func_name,
"content": result,
}
self.messages.append(tool_message)
if (
original_tool
and hasattr(original_tool, "result_as_answer")
and original_tool.result_as_answer
):
return AgentFinish(
thought="Tool result is the final answer",
output=result,
text=result,
)
return None
This ties into an important metaphor: the message history is a shared notebook. User, assistant, and tools all write into it. The executor keeps the notebook coherent and respects tools that declare, via result_as_answer, “this output is the final answer.”
ReAct vs native tools: one brain, two strategies
ReAct and native tools look different, but the executor treats them as two strategies for the same mental loop: repeatedly “think → maybe act → think again” until you reach AgentFinish.
With native tools, the loop leans on provider‑level structured calling. It converts internal tools into a provider schema, then interprets responses as either tool calls or final text:
openai_tools, available_functions, self._tool_name_mapping = (
convert_tools_to_openai_schema(self.original_tools)
)
while True:
# ... max_iter, rpm ...
answer = get_llm_response(
llm=cast("BaseLLM", self.llm),
messages=self.messages,
callbacks=self.callbacks,
printer=PRINTER,
tools=openai_tools,
available_functions=None,
...,
)
if isinstance(answer, list) and answer and self._is_tool_call_list(answer):
tool_finish = self._handle_native_tool_calls(answer, available_functions)
if tool_finish is not None:
return tool_finish
continue
if isinstance(answer, str):
formatted_answer = AgentFinish(thought="", output=answer, text=answer)
# ... log, append, return ...
Under the hood, helpers like _is_tool_call_list and _parse_native_tool_call recognize provider‑specific shapes (OpenAI, Anthropic, Bedrock, Gemini) and normalize them to simple tuples like (call_id, func_name, func_args). That’s a clean Adapter pattern: external protocol diversity, internal uniformity.
A subtle part of this design is how it treats multiple tool calls in one response. Should they run in parallel? The executor encodes the answer as a simple policy:
if len(parsed_calls) > 1:
has_result_as_answer_in_batch = any(
bool(
original_tools_by_name.get(func_name)
and getattr(original_tools_by_name.get(func_name), "result_as_answer", False)
)
for _, func_name, _ in parsed_calls
)
has_max_usage_count_in_batch = any(
bool(
original_tools_by_name.get(func_name)
and getattr(original_tools_by_name.get(func_name), "max_usage_count", None)
is not None
)
for _, func_name, _ in parsed_calls
)
# Preserve sequential behavior when semantics demand it.
if has_result_as_answer_in_batch or has_max_usage_count_in_batch:
logger.debug("Skipping parallel native execution...")
else:
# Build execution_plan and submit to ThreadPoolExecutor(...)
result_as_answer and usage limits.The trade‑offs are explicit:
- Correctness. Tools that cap their usage or directly answer the user should not run concurrently with casual threading around shared counters.
- Performance. Clearly independent tools can be executed in parallel (up to a fixed worker limit) to cut tail latency.
- Simplicity. Instead of a general DAG, the executor uses simple booleans on tools to decide whether parallelism is even allowed.
This is a reusable pattern: encode constraints as properties on tools, and let the orchestrator decide if and how to parallelize. You keep orchestration logic generic while still respecting domain semantics.
Hard‑earned lessons you can reuse
Stepping back, CrewAgentExecutor is a large class. Sync and async loops are duplicated, and inputs depend on specific dict keys like "input", "tool_names", and "tools" without strong validation. You could extract helpers like a dedicated ToolCallExecutor or TrainingRecorder to slim it down.
But the more important story is what this file teaches about building agent executors in general: how to design the loop as a control tower rather than a ball of glue. Here are the core lessons worth carrying into your own systems.
1. Treat the executor as a control tower, not a Swiss army knife
The executor already coordinates many concerns: LLM orchestration, tools, hooks, training data capture, human feedback, and logging. It works, but you can see the pressure on class size and complexity.
In your own designs, keep the control‑tower role but give it collaborators from day zero: one object responsible for the loop and messaging; separate components for tool execution, training recording, and human‑in‑the‑loop prompts. The orchestrator should coordinate flights, not repair engines.
2. Make the agent loop boringly predictable
The main loops here are not fancy, but they are deliberate:
- Bounded iterations via
max_iterand an explicit iteration counter. - Dedicated handling of
OutputParserErrorand context‑length errors, with clear retry behavior. - A strong invariant: runs either end in
AgentFinishor raise aRuntimeErrorrather than silently stopping.
For LLM systems, that kind of predictable loop is a feature. You want the non‑determinism in the model’s answers, not in your control flow.
3. Centralize tool semantics and policy
Tool semantics in this executor are funneled through a small set of functions and properties:
- Caching decisions through
ToolsHandler.cacheand optionalcache_functionhooks. - Usage constraints via
max_usage_countandcurrent_usage_count. - Answer semantics through
result_as_answer. - Hooks and events around every call for policy, tracing, and logging.
That centralization makes it possible to reason about performance, safety, and correctness in one place. If your tools have side effects, this is also the right layer to add idempotency guards or audit logging without touching the loop itself.
4. Hide provider quirks behind adapters
The native tools implementation has to deal with OpenAI’s function calls, Anthropic’s tool_use, Bedrock’s toolUseId, and Gemini’s function_call formats. The executor acknowledges these differences only in narrowly scoped helpers like _is_tool_call_list and _parse_native_tool_call, then moves on with a simple internal representation.
That’s textbook Adapter pattern. If you plan to support multiple providers, pick a small, clean internal schema for tool calls early, and treat every provider response as an input format to be adapted. Don’t let provider quirks leak into your main loop.
5. Design for observability from day one
Finally, CrewAgentExecutor shows what it looks like when observability is part of the orchestration contract:
- Every agent run emits start and execution events on
crewai_event_bus(AgentLogsStartedEvent,AgentLogsExecutionEvent). - Every tool emits start, finish, and error events, which can feed logs, metrics, or tracing systems.
- Callbacks and hooks are first‑class, so external systems can attach behavior without patching core code.
The same concerns you see in the code — iterations, LLM calls, tool execution, context truncation, and errors — are the ones you should expose as metrics and alerts in your own executor. That alignment between control flow and telemetry is what makes production debugging tractable.
CrewAgentExecutor may look like “just another big class”, but read as a story, it’s about how to turn a raw LLM and a pile of tools into a dependable agent: a single control loop, two tool strategies, and a disciplined approach to limits, errors, and observability. The primary lesson is to design your agent loop as a control tower — a focused orchestrator that keeps everyone talking in the right order until the plane lands safely.
If you’re designing your own executors, a few concrete takeaways:
- Give the loop clear termination rules and explicit error‑recovery paths, especially for parser and context‑length failures.
- Centralize tool execution behind a small API that owns semantics, limits, caching, hooks, and events.
- Hide provider quirks behind adapters and line up your telemetry with the control flow you actually care about.
As agents grow more complex, this control‑tower mindset becomes the difference between orchestrators that can be trusted in production and ones that remain fragile prototypes.
