How to Choose an AI Agent Orchestration Framework (or Skip One)

Which Agent Framework Should You Use?

Use no framework at all until you have a working proof-of-concept without one. The orchestration design, how agents hand off state, how failures are caught, how tools are scoped, matters far more than which library manages the graph. Once you have that design, a framework is just a runtime detail.

I am Mahmoud Zalt, an independent senior AI systems architect with 16+ years building production software since 2010. Years ago I designed Apiato, an open-source framework that thousands of teams build APIs on, so I have opinions about what a framework should and should not decide for you. I now apply that lens at Sista AI, the company I founded, where autonomous agents have run in production for the past year. I advise engineering teams on production AI architecture through my AI architecture advisory service. What follows is the framework selection criteria I actually apply, including the cases where I tell clients to skip every named framework and write 200 lines of plain code instead.

The Real Question Is Not Which Framework

Most teams reach for LangGraph or CrewAI before they have answered three prior questions:

Is the graph static or dynamic? If the set of agents and their connections is known at design time, you do not need a graph runtime. A plain function pipeline is faster to debug and cheaper to run.
Who owns state between steps? Passing state through a framework channel you do not control makes observability and rollback painful. If you cannot explain exactly what is in state at every node transition, the framework is hiding complexity, not reducing it.
What is the failure surface? Agent systems fail at tool calls, at context window limits, at malformed LLM output, and at retry storms. A framework that does not give you explicit hooks at each of those failure points is a liability.

Answer these first. Then picking or skipping a framework becomes obvious.

LangGraph vs CrewAI vs Custom: A Honest Comparison

Here is how the three options break down across the criteria that actually matter in production:

Criterion	LangGraph	CrewAI	Custom / plain code
Graph shape	Explicit directed graph, cycles allowed	Role-based crews, implicit routing	Whatever you design
State management	Typed state schema per node	Crew-level shared context	Explicit, you own it
Observability	LangSmith integration, trace per run	Built-in callbacks, less granular	You wire OpenTelemetry yourself
Tool / MCP support	First-class, schema-validated	First-class, less strict typing	You implement the contract
Human-in-the-loop	Interrupt / resume built in	Manual step override only	Design it exactly as needed
Streaming	Token and event streaming	Limited native streaming	Trivial with the SDK directly
Lock-in risk	Medium: LangChain dependency chain	Low-medium: cleaner abstractions	None
Good fit	Complex stateful graphs, retry logic, HITL	Role-task decomposition, simpler flows	Simple pipelines, cost-sensitive, or unusual topology

My default recommendation: if the workflow has more than 5 conditional branches or requires human approval at runtime, LangGraph earns its weight. If it is a straightforward planner-executor pattern with 2-4 agents and no cycles, CrewAI or plain code is faster to ship and cheaper to maintain.

The Architecture That Survives a Framework Swap

The teams that get burned by framework lock-in all made the same mistake: they let the framework own the domain logic. The fix is a three-layer separation that I enforce on every engagement:

Layer 1: Orchestration contract

Define a thin interface for what an 'agent step' means in your system: it receives a typed input context, it produces a typed output context, and it declares the tools it may call. This interface lives in your domain code, not in LangGraph or CrewAI types.

Layer 2: Framework adapter

The framework adapter wraps your domain agents in whatever node or crew the framework expects. It handles retries, timeout, and serialization. If LangGraph ships a breaking change, you rewrite the adapter, not the domain agents.

Layer 3: Infrastructure

Tracing (OpenTelemetry or LangSmith), checkpointing (Redis or Postgres for resumable runs), and secret injection (never pass API keys through agent state). These are wired at the infrastructure layer, invisible to agents.

A team I worked with rebuilt a 6-agent document-processing pipeline from CrewAI to a custom async Python scheduler in under a week because their domain agents were already isolated. The migration cost was one afternoon. If the domain logic had been CrewAI-specific, it would have been a rewrite.

When to Skip a Framework Entirely

You probably do not need a framework if:

The flow is a DAG with no runtime branching. A linear pipeline of async calls with error handling is 150 lines of Python or TypeScript and zero new dependencies.
You are calling one LLM with tools. The Anthropic SDK and OpenAI SDK both support tool-use natively. Wrapping them in a framework adds a dependency and a debugging layer for no gain.
Latency is a hard constraint. Every framework adds overhead: serialization, state checkpointing, graph traversal. For sub-500ms response requirements, hand-roll the pipeline.
The team is small and the flow is stable. Framework abstractions pay off when the graph is complex or evolving. For a stable 3-step flow that two engineers will maintain, the framework is overhead, not leverage.

I have seen startups ship a 'multi-agent system' that was genuinely just three sequential LLM calls with a switch statement. They were right to keep it that way. The switch statement is readable, testable, and has no GitHub issue tracker.

Evals, Guardrails, and Observability in Any Framework

The framework decision is independent of these three production requirements. You need all three regardless of what you pick.

Evals

Run evals on individual agent steps, not just end-to-end. A failing pipeline is almost impossible to debug without step-level golden datasets. Use a 30-50 example eval set per agent, score on task success and output schema validity, and run it in CI on every model or prompt change. LangSmith, Braintrust, and PromptFoo all work here; pick the one your team will actually run.

Guardrails

Validate LLM output schemas before passing them to the next agent. A malformed JSON blob from step 3 crashing step 7 is a common failure mode that no framework prevents by default. Use Pydantic or Zod to validate at every boundary. For tool calls, validate both the input the LLM constructs and the output the tool returns.

Observability

Emit a trace span for every agent invocation with: model used, prompt token count, completion token count, latency, tool calls made, and whether a retry occurred. Aggregate these per workflow run so you can answer 'which step is costing the most?' and 'where are we hitting rate limits?' without grepping logs.

Tool-Calling, MCP, and Security

Model Context Protocol (MCP) is the right abstraction for external tool integration in 2025. It gives you a typed, discoverable contract between the agent and the tool, and both LangGraph and CrewAI support it. If you are building custom tools, implement them as MCP servers from the start rather than as ad-hoc function schemas. The migration cost later is real.

Security rules that apply regardless of framework:

Scope tools tightly. An agent that can only read from a specific S3 prefix is safer than one with broad read access. Express scope in the tool schema, not in a prompt instruction the LLM can ignore.
Never pass secrets through agent state. Inject credentials at the infrastructure layer. Agent state is logged, serialized, and sometimes stored. A Postgres connection string has no business being in a LangGraph channel.
Human-in-the-loop before destructive actions. Any tool call that writes, deletes, sends, or charges should have an approval gate in non-automated contexts. LangGraph has first-class interrupt/resume for this; in custom code, add a single approval function at the infrastructure layer.
Rate limit and cap costs per run. Set a maximum token budget per workflow run and hard-stop when exceeded. An LLM that retries in a loop can run up hundreds of dollars in minutes. This is an infrastructure concern, not a framework concern.

Retrieval and Memory: Where Teams Overcomplicate It

Most agent systems do not need a sophisticated memory architecture. They need three things:

Short-term context: the current run's state, kept in the framework channel or a plain dict, discarded at end of run.
Retrieval-augmented context: a vector store or keyword search queried at relevant steps. Pinecone, pgvector, or Elasticsearch depending on your scale. This is a tool call, not a special framework feature.
Long-term user/entity memory: structured rows in Postgres keyed to a user or session ID. Query them explicitly; do not stuff them into the system prompt unconditionally.

What teams get wrong: they reach for a 'memory module' in the framework before deciding what memory is for. Memory is a tool with a retrieval contract. Design the contract first. The storage backend is a second-order decision.

For retrieval, hybrid search (dense vector plus sparse BM25) consistently outperforms pure vector search on domain-specific corpora. If your RAG accuracy is below 70% on your eval set, try hybrid search before tuning prompts or switching models.

Frequently Asked Questions

Is LangGraph production-ready in 2025?

Yes, with caveats. LangGraph Cloud adds managed checkpointing and deployment but the open-source version requires you to wire your own persistence backend (Redis or Postgres) for resumable runs. The framework itself is stable; the operational complexity is in the infrastructure around it, not the library.

Is CrewAI good for enterprise use?

CrewAI is well-suited for straightforward role-task pipelines and has a lower learning curve than LangGraph. For complex state machines with conditional routing, retry budgets, and human approvals, LangGraph gives you more control. Enterprise use also demands audit logging and fine-grained IAM on tool calls, which both frameworks leave to you.

When should I build a custom agent framework instead of using an existing one?

Almost never build a full framework. Build a thin orchestration layer over direct SDK calls when: the workflow is a simple DAG, your latency requirements rule out framework overhead, or you have unusual execution semantics (streaming to a UI mid-run, per-step cost accounting, or multi-tenant isolation). Keep it under 500 lines or you are reimplementing the frameworks you avoided.

How do I switch frameworks later without rewriting everything?

Isolate domain agents behind a typed interface that your application code depends on. The framework adapter implements that interface. When you swap frameworks, you rewrite the adapter, not the business logic. This takes a day, not a sprint, if you enforce the separation from the start.

What is the biggest cost driver in a multi-agent system?

Prompt bloat at each agent hop is usually the largest cost driver, not the number of agents. Each agent that receives the full history of all previous agents multiplies your input token count. Instead, pass only the structured output of the previous step, not the raw transcript. For a 5-agent pipeline this alone can cut token costs by 60-80%.

Do I need a vector database for a production AI agent?

Only if retrieval is part of the workflow. Many production agent systems have no vector store at all. Start with pgvector on your existing Postgres instance. Migrate to a dedicated vector DB only when query latency or index size makes that necessary, which for most applications means north of 10 million vectors at sub-100ms SLA requirements.

Get Architecture Clarity Before You Pick a Framework

The teams I work with who spend weeks evaluating LangGraph versus CrewAI almost always have the same underlying problem: they have not yet defined the orchestration contract, the failure model, or the observability requirements. Those decisions take a day with the right guidance and they make the framework choice obvious or irrelevant.

If you are building an AI agent system and want to make the right structural decisions before committing to a framework, a codebase, or a cloud vendor, that is exactly what my AI architecture advisory service covers. One focused engagement saves months of painful refactoring. You can also read more about my background or see what I have shipped before reaching out.

Book an AI architecture advisory session and get the framework decision right the first time.

How to Choose an AI Agent Orchestration Framework (or Skip One)

Are you a software engineer moving into AI?

AI Personal Assistant

AI Marketing Manager

AI Sales Representative

AI Support Specialist