Agent vs Workflow vs Chatbot: When You Actually Need Autonomy

The One-Sentence Decision Rule

If you can fully flowchart the path from input to output before running it, build a workflow, not an agent. Autonomy is the right tool only when the path itself cannot be known in advance.

I am Mahmoud Zalt, an independent senior AI systems architect with 16+ years building production software since 2010. Through Sista AI, the company I founded, I have spent a year deciding agent-versus-workflow questions for real, running autonomous agents in production where the wrong call shows up on the invoice. I spend the bulk of my client work on exactly this question, helping companies decide what shape their AI system actually needs before they burn months on the wrong architecture. You can read more about me or explore my AI agent development and technology deep-dive services.

Three Terms, Three Distinct Architectures

Before the decision framework, the vocabulary has to be precise. These three things are not on a spectrum from simple to complex. They are different architectural patterns with different cost, reliability, and maintenance profiles.

Chatbot

A stateless or session-stateful interface over an LLM. The model generates a reply. There is no tool-calling, no loop, no plan. A support FAQ bot, a documentation assistant, a lead-capture conversationalist. Cheapest to build and most predictable.

Workflow (Deterministic Automation with LLM steps)

A pre-defined sequence of steps where some steps call an LLM and some call APIs, databases, or compute. The graph is fixed. You control every branch. The LLM is used for tasks it is genuinely good at: classification, extraction, summarization, generation. Tools like n8n, Zapier, Temporal, and AWS Step Functions are workflow engines. You wire the intelligence into a known path.

Agent (Autonomous LLM-driven orchestration)

The LLM itself decides the next step at runtime. It selects tools, sequences calls, and determines when the task is done. The path is not pre-defined. The model is the orchestrator. Frameworks like LangGraph, AutoGen, CrewAI, and raw tool-calling loops implement this pattern. It is the most powerful, the most expensive per-task, and the most likely to fail in ways that are hard to predict.

The Decision Framework: Is the Path Knowable?

The single question that drives architecture: Can you draw a complete flowchart of every decision point and branch before you run the system?

Question	Yes	No
Can you enumerate every branch the system might take?	Workflow	Agent candidate
Does the system need to decide what tool to use based on prior output?	Workflow if the choice set is small and enumerable	Agent
Can you write explicit error handling for every failure mode?	Workflow	Agent with guardrails
Is the task repeatable with identical inputs producing identical outputs?	Workflow	Agent (non-determinism accepted)
Does the scope of the problem expand or contract during execution?	Workflow with conditional branches	Agent

The practical version: sit down and try to draw the flowchart. If you can finish it, ship a workflow. If you keep hitting boxes that say 'LLM decides', you have an agent. The depth and frequency of those boxes tells you how much autonomy you actually need.

When a Workflow Is the Right Answer (Most of the Time)

Workflows are underused and undervalued. They are not the 'boring' option, they are the option that works reliably at 2 a.m. when nobody is watching the system. Here are the cases where I always push teams toward a workflow.

Document processing pipelines

Extract fields from a PDF, validate them against a schema, write to a database, send a webhook. Every step is known. The LLM handles extraction and normalization, but the sequence is fixed. This runs reliably at scale with deterministic retry logic and near-zero prompt engineering drift over time.

Content moderation and classification

Inbound text arrives, the LLM classifies it into one of N categories, a downstream branch handles each category. The categories are pre-defined by the business. The LLM fills one node in a fixed graph.

Scheduled enrichment jobs

Pull records, enrich each with an LLM call (summaries, tags, embeddings), write back. Batch size is known. Failure handling is explicit. Cost is predictable because you control every LLM call.

Multi-step form or intake flows

A user provides information across multiple steps. Each step validates, transforms, or generates content. The form logic is deterministic. The LLM generates draft text or validates user input at specific nodes.

In all of these, the value of using a workflow is: you know exactly what happens, you can test each node independently, you can replace any node (swap the LLM, change the model, add a cache), and you have a clear cost model per document or per event.

When You Actually Need an Agent

Agents are the right tool in a narrow but important set of scenarios. The common thread: the task requires making decisions about its own execution that cannot be pre-programmed because the decision depends on information that only exists at runtime.

Open-ended research and synthesis

A user asks: 'Find me the five most relevant academic papers on X, extract the key claims, and flag any contradictions.' The number of searches, the relevance threshold, the decision to dig deeper on one source, and the synthesis step all depend on what the model finds. You cannot flowchart this in advance because the path depends on the retrieved content.

Code generation with verification loops

Generate code, run it, observe the output, decide whether to fix or proceed. The loop count is not known in advance. The tool calls (write file, run test, read stderr, patch file) are chosen based on intermediate results. This is genuinely agentic work.

Multi-tool orchestration over ambiguous inputs

A user uploads a spreadsheet and asks a free-form question. The agent must decide: does this need SQL, a chart, a narrative summary, or all three? Which columns are relevant? Does a follow-up query change what was already computed? The decision graph is not enumerable beforehand.

The honest cost of autonomy

Agents introduce non-determinism, higher latency (multiple LLM calls per task), higher cost (often 5x to 20x more tokens than a well-designed workflow for the same business outcome), harder observability, and novel failure modes like tool-call loops, hallucinated tool arguments, and runaway subtasks. Budget for evals, tracing (LangSmith, Langfuse, Phoenix), and human-in-the-loop checkpoints on any action that cannot be undone.

What Teams Get Wrong in Production

These are the patterns I see repeatedly when I come into an engagement where an AI system is underperforming or costing too much.

Building an agent when the path was always knowable

A team builds a LangChain agent to process invoices. After three months, the agent sometimes skips validation steps, sometimes calls the wrong tool, and costs $0.40 per invoice. The actual business logic had six deterministic steps. A Temporal workflow with LLM nodes at step 2 and step 5 would have been $0.04 per invoice, fully auditable, and testable. The autonomy added no value because the path was always fixed.

No evals, no baselines

Teams ship agents without a golden test set. Within weeks, a prompt tweak for one use case breaks another. Production is the eval suite. An eval harness (even 50 representative examples with expected outputs and LLM-as-judge scoring) catches regressions before they reach users. This applies to workflows too, but the surface area for agents is far larger.

Missing observability on tool calls

Tool calls in an agent loop are the most dangerous failure point. A tool that deletes records, sends emails, or charges a card needs a trace of every invocation: input, output, timestamp, latency, model version, and the conversation turn that triggered it. Without this, debugging a production incident is guesswork. Use OpenTelemetry spans or a dedicated LLM observability platform from day one.

No human-in-the-loop on irreversible actions

Autonomy should not extend to irreversible side effects without a confirmation gate. Any agent action that cannot be undone (send email, charge card, delete record, call external API with write access) needs either a human approval step or a strict pre-condition check built into the tool itself, not left to the model's judgment.

Prompt drift eroding reliability over months

A workflow prompt is a fixed transformation. An agent prompt is an instruction set for a planner that will be called many times across many contexts. Agent system prompts accumulate informal edits that shift behavior unpredictably. Version-control your prompts, tag them with the model they were tuned for, and re-run your eval suite on every change.

The Hybrid Pattern: Deterministic Skeleton, Agentic Cores

Most production systems that work well are not pure agents and not pure workflows. They are deterministic orchestration with agentic sub-processes at specific nodes where the path genuinely cannot be pre-defined.

A worked example: an AI-assisted hiring pipeline.

Step 1 (workflow): ingest resume PDF, extract structured fields via LLM call, validate schema, write to DB.
Step 2 (workflow): rule-based filter: does the candidate meet minimum years of experience?
Step 3 (agentic core): given the structured resume and the job description, the model calls a search tool to check public work samples, reasons about fit, and writes a structured assessment. The number of searches and the depth of reasoning are not pre-defined.
Step 4 (workflow with human-in-the-loop): assessment is staged for recruiter review before any email is sent. Human approves or edits.
Step 5 (workflow): send approved communication, log outcome, update CRM.

Steps 1, 2, 4, and 5 are deterministic. Step 3 is agentic. The outer system is auditable, cost-predictable, and testable. The autonomy is scoped to the one node where it earns its complexity.

Cost, Security, and the MCP Layer

Cost modeling

Budget by the call, not by the month. A well-designed workflow has a fixed token cost per unit of work. An agent has a distribution of costs, and the tail of that distribution (a loop that does 30 tool calls before giving up) can be very long. Set hard limits: max tool calls per task (8 to 12 is a reasonable default), max retries per tool, and a timeout that kills the run and notifies a human rather than burning tokens forever.

Tool surface and the principle of least privilege

Every tool exposed to an agent is an attack surface. Model Context Protocol (MCP) is becoming the standard for connecting agents to external systems. Whether you use MCP servers or raw function-calling, the rule is the same: give the model the narrowest possible tool set for the task at hand. A read-only database query tool and a write tool should never be the same tool. An agent that only needs to search the web should not have a tool that can send emails. Scope the tool set per agent role, not per deployment.

Prompt injection in agentic systems

Agents that read external content (web pages, documents, emails) are vulnerable to prompt injection: malicious instructions embedded in that content that redirect the agent's behavior. Defensive measures: sanitize retrieved content before feeding it into the context window, never trust external content with tool-call authority, and add a detection layer (a second LLM call that checks whether the agent's planned next action is consistent with the original user intent) on any high-stakes workflow.

Frequently Asked Questions

What is the difference between an AI agent and a workflow automation tool?

A workflow tool executes a pre-defined sequence of steps you designed. An AI agent uses an LLM to decide the sequence at runtime. The distinction matters for cost, reliability, and auditability. Workflows are deterministic; agents are probabilistic planners. Use workflows when you can enumerate the path; use agents when the path depends on intermediate results.

Do I need LangChain or AutoGen for my AI project?

Probably not. Most business automation tasks that reach me can be solved with direct LLM API calls inside a conventional application framework (a job queue, a state machine, a simple API). LangChain and AutoGen add value when you genuinely need an agent loop with dynamic tool selection. If you are using them as a convenient way to call the OpenAI API, you are adding abstraction layers that will hurt you when debugging production failures.

How much does it cost to run an AI agent vs a workflow?

A production workflow with LLM nodes typically costs $0.01 to $0.10 per unit of work, depending on model tier and token volume. An agent doing equivalent work often costs 5x to 20x more because it makes multiple planning calls, multiple tool calls, and sometimes loops. The cost difference is acceptable when the agent is solving a problem the workflow cannot. It is not acceptable when the agent is solving a problem a workflow would have handled fine.

When should I use a chatbot instead of an agent?

Use a chatbot when the user interaction is conversational and the system does not need to take actions in external systems. A chatbot answers questions, drafts text, explains concepts. Once the system needs to read from a database, call an API, write a file, or execute code, you have crossed into tool-calling territory and the architecture decision between workflow and agent applies.

What is the best way to evaluate an AI agent before shipping?

Build a golden test set of 30 to 100 representative tasks with expected tool call sequences and expected outputs. Run the agent against this set before every prompt change and every model version change. Score with a combination of exact match (for structured outputs), LLM-as-judge (for prose quality), and tool call trace comparison (for reasoning faithfulness). Ship nothing to production that regresses more than 5% on this set without deliberate sign-off.

Can I build an AI agent without a framework like LangChain?

Yes, and for many production systems it is the better choice. A tool-calling loop is fewer than 50 lines of Python: call the model with a tool list, check if the response includes a tool call, execute the tool, append the result to the conversation, call the model again, repeat until the model returns a final answer. This loop is easy to instrument, easy to debug, and has no hidden abstractions. Add a framework when the framework's features (pre-built integrations, agent memory, multi-agent routing) are features you will actually use.

Get an Expert Second Opinion Before You Commit to an Architecture

The most expensive AI project mistake I see is not choosing the wrong model or the wrong framework. It is choosing the wrong architectural pattern and building three months of infrastructure around it before discovering the mismatch. A two-hour technology deep dive can give you a concrete architectural recommendation, a cost model, and a scope-of-work definition before your team writes a line of production code. If you are already in production and the system is underperforming, I can diagnose what went wrong and give you a concrete migration path.

Browse my projects to see how I build, read more about my background, or reach out directly via the contact page. I work with a small number of clients at a time so the engagement is substantive.

Schedule a Technology Deep Dive to Get Your Architecture Right