MCP and Tool Calling Explained: How AI Agents Take Real Actions

What MCP and Tool Calling Actually Are

Tool calling is the mechanism that lets an LLM pause its text generation, declare 'I need to run a function,' pass structured arguments to your code, receive a result, and then continue reasoning with that result. The Model Context Protocol (MCP) is Anthropic's open standard that wraps that mechanism in a consistent JSON-RPC wire format so any model can talk to any tool server without custom glue code. Short version: tool calling is the capability; MCP is the USB-C connector that standardises it.

I am Mahmoud Zalt, an independent senior AI systems architect with 16+ years building production software since 2010. I founded Sista AI, where for the past year tool calling has been the wiring that lets a production workforce of autonomous agents actually do things rather than just talk. I design and ship production AI agent systems for engineering teams, including tool-calling pipelines, MCP server implementations, retrieval layers, and the guardrails that keep them safe. Full background at /about. If you need this built, see my AI agent development service.

How Tool Calling Actually Works: Step by Step

The flow is simpler than most diagrams make it look. Here is the literal exchange for a calendar-booking agent:

You send the LLM a system prompt that lists available tools as JSON schemas: name, description, and a parameters object using JSON Schema.
The model responds with a tool_use content block (Anthropic) or a tool_calls array (OpenAI-compatible). It does NOT call anything itself. It outputs a structured declaration of intent.
Your application code intercepts that declaration, validates the arguments, executes the real function (a DB query, an API call, a shell command), and appends the result as a tool_result message.
You send the full conversation back to the model. The model reads the result and continues.

Two things teams consistently get wrong here. First, they treat the model output as trusted. It is not. The model may hallucinate argument values or call a tool you listed but did not intend to expose in this context. Always validate every argument against your schema before execution. Second, they forget that every tool call is a round-trip to the API. A chain of five tool calls is five inference requests plus five function executions. Latency and cost stack fast.

A Minimal TypeScript Example

const tools = [{
  name: 'get_calendar_slots',
  description: 'Return available 30-min slots for a given date.',
  input_schema: {
    type: 'object',
    properties: {
      date: { type: 'string', format: 'date' },
      timezone: { type: 'string' }
    },
    required: ['date', 'timezone']
  }
}];

// After the model returns tool_use, you route it:
if (block.type === 'tool_use' && block.name === 'get_calendar_slots') {
  const parsed = slotSchema.safeParse(block.input); // Zod validation
  if (!parsed.success) return errorResult(block.id, parsed.error);
  const slots = await calendarService.getSlots(parsed.data);
  return { type: 'tool_result', tool_use_id: block.id, content: JSON.stringify(slots) };
}

That validation step before calendarService.getSlots is non-negotiable in production.

What MCP Adds on Top of Raw Function Calling

Raw function calling works fine when you control the model, the orchestrator, and the tool implementations all in one codebase. The moment those three things live in different codebases, teams, or vendors, you have a protocol problem. Every new tool needs a custom integration. Every new model needs its own adapter. MCP solves this with three primitives:

Primitive	What It Is	Example
Tools	Functions the model can call	`search_docs`, `create_ticket`
Resources	Data the model can read (like GET endpoints)	file contents, DB rows, config objects
Prompts	Reusable prompt templates the server exposes	'summarise this resource in 3 bullets'

An MCP server is a lightweight process (Node, Python, Go, anything) that speaks JSON-RPC 2.0 over stdio or SSE. The client (your orchestrator or Claude Desktop) calls tools/list to discover what is available, then tools/call to invoke one. The server handles auth, rate limits, schema validation, and result formatting. The model never sees your database credentials; it only sees tool schemas and sanitised results.

This is the 'USB-C' analogy made concrete: write one MCP server for your internal Jira, and every model that supports MCP (Claude, any OpenAI-compatible client, Cursor, Zed) can use it without a single line of adapter code per integration.

When MCP Is Worth It (and When It Is Overkill)

I get asked this on almost every engagement. The honest answer: MCP earns its complexity at a specific threshold. Here is how I frame the decision.

Use a plain function-calling endpoint when:

You have one model, one orchestrator, and fewer than five tools, all in the same repo.
The tools are tightly coupled to your business logic and will never be reused outside this agent.
You are in early prototype phase and iteration speed matters more than standardisation.
Your team has no existing MCP tooling and the learning curve would slow a 2-week sprint.

Adopt MCP when:

Multiple models or agent frameworks need the same tools (Claude + OpenAI fallback, Cursor + your internal orchestrator).
You are building an internal platform where different teams publish tools and different teams consume them.
You want discovery at runtime: the orchestrator calls tools/list and adapts its behaviour based on what the server exposes, without a redeploy.
Security isolation matters: the MCP server process boundary means the orchestrator cannot accidentally access credentials it was not meant to see.
You need to version and deprecate tool APIs independently of your model prompt.

A real example from a project: a team had Claude booking meetings, querying a CRM, and filing support tickets. All three lived in separate micro-services owned by separate teams. Building one MCP server per micro-service (three small Node processes) and a single MCP-aware orchestrator was cleaner than the alternative: three bespoke function-call handlers in the orchestrator, each with its own auth wiring. The MCP route added a week of setup and saved months of maintenance.

Security and Guardrails: What You Cannot Skip

Tool calling is where AI systems go from 'interesting demo' to 'production liability.' The attack surface is real. Here is what I enforce on every system I ship:

Input validation at the tool boundary

Validate every argument the model passes before it touches your infrastructure. Use a schema library (Zod, Pydantic, etc.). Reject and return an error result, never silently coerce. A model that has been injected with adversarial content in a retrieved document can try to pass ../../../etc/passwd as a file path argument. Your tool handler must catch that before the filesystem call.

Tool-level permission scoping

Never give an agent access to tools it does not need for the current task. Build a context-aware tool registry that exposes only the relevant subset. An agent answering customer questions should not have access to delete_user_account even if that tool exists in your system.

Human-in-the-loop checkpoints

For any irreversible action (sending an email, charging a card, deleting a record), require an explicit confirmation step before the tool executes. Do not let the agent chain through it autonomously. I implement this as a confirm_action tool that surfaces a structured payload to the UI and waits for a human approval event before proceeding.

Rate limiting and circuit breakers

Wrap every external API tool call in a circuit breaker. A runaway agent loop (model keeps calling the same tool expecting a different result) can exhaust your API quota or trigger downstream rate limits within minutes.

Prompt injection via tool results

Tool results re-enter the context window. If an attacker can control the content returned by a tool (a search result, a fetched webpage, a DB record), they can inject instructions. Sanitise tool results before returning them to the model. Strip HTML, truncate to reasonable lengths, and never concatenate raw user-controlled strings directly into a result object.

Observability: You Cannot Debug What You Cannot See

An agent that silently fails or produces wrong answers is worse than a simple error. These are the three observability layers I add to every tool-calling pipeline in production:

Structured tool call logging

Log every tool invocation as a structured event: timestamp, tool name, input arguments (after PII scrubbing), result summary, latency, and the trace ID that links it back to the parent conversation. This lets you replay any agent run and see exactly where it went wrong.

Evals on tool routing

Build a small eval set (30 to 50 golden examples) that tests whether the model routes to the correct tool given a user intent. Run this on every prompt change. I have seen prompt tweaks improve answer quality while silently breaking tool selection. Without evals you will not catch that for weeks.

Cost tracking per tool

Each tool call adds tokens (the result goes back into the context) and may have its own API cost. Track token usage per tool type. You will almost always find one tool that is called far more than expected and whose results are unusually long, driving 40-60% of your inference cost. Truncation or summarisation on that one tool often cuts costs significantly without hurting quality.

Retrieval as a Tool: RAG Inside an MCP Server

One of the most common patterns I build is a retrieval tool backed by a vector database, wrapped as an MCP server. The model calls search_knowledge_base with a query string, the MCP server embeds the query, hits Pinecone or pgvector, and returns the top-k chunks as a structured result. The model then synthesises an answer from those chunks.

Why MCP here specifically? Because the same retrieval server gets reused across a customer-facing agent, an internal Slack bot, and a Cursor extension inside the engineering IDE. One server, three consumers, zero custom glue per consumer.

Three things that matter in production retrieval tools:

Return source metadata with every chunk: document ID, title, section, last-modified date. The model will cite sources if you give it the data.
Cap result token length: a top-5 retrieval result that returns 10,000 tokens per chunk will blow your context window. Chunk at 400-600 tokens at index time, not at query time.
Re-rank before returning: cosine similarity retrieval is good but cross-encoder re-ranking on the top-20 before returning top-5 consistently improves answer quality by a measurable margin in evals.

Frequently Asked Questions

What is the difference between tool calling and function calling?

They are the same capability with different names. OpenAI introduced the term 'function calling' in 2023. Anthropic uses 'tool use.' The broader industry is converging on 'tool calling.' MCP standardises the protocol layer on top of whichever term your API uses.

Does MCP work with OpenAI models or only Claude?

MCP is an open protocol, not a Claude-only feature. There are MCP clients for OpenAI-compatible APIs, and several orchestration frameworks (LangChain, AutoGen, Rivet) have MCP adapters. The Claude Desktop and Claude Code CLI have native MCP support. For raw OpenAI API calls you can translate MCP tool definitions to OpenAI's function schema format with a thin adapter layer.

How do I prevent an AI agent from calling a tool it should not?

Three mechanisms, in order of effectiveness: (1) Do not include the tool in the tools array for that request. The model cannot call what it cannot see. (2) Validate the call at the tool handler and return an error result if the caller context does not have permission. (3) Add a system-prompt instruction like 'only call tools that are directly required to answer the current user request.' Mechanism 1 is the only one you can rely on for security. Mechanisms 2 and 3 are defence in depth.

What is the token cost of tool calling?

Tool schemas are injected into your input tokens on every request. A typical tool schema runs 100 to 300 tokens. If you have 20 tools and send them on every call, that is 2,000 to 6,000 extra input tokens per request. At Claude Sonnet pricing that is a few cents per 1,000 calls, but it adds up at scale. The fix: dynamically select only relevant tools per request rather than sending the full registry every time.

When should I use MCP versus building a custom REST endpoint?

Use a custom REST endpoint when you have one model consumer and tight coupling is fine. Use MCP when you have multiple model consumers, need runtime tool discovery, want the security of process isolation, or are building a platform where other teams will publish and consume tools independently.

What is a good way to test tool-calling pipelines?

Three layers: unit tests that mock the model response and verify your tool handler validates and executes correctly; integration tests that use a real model call with a deterministic input and assert the correct tool was selected; and an eval suite of 30 to 50 golden examples that you run on every prompt or schema change. The eval suite is the most valuable and the most skipped.

Build Tool-Calling Systems That Are Production-Ready From Day One

Tool calling and MCP are not hard to prototype. They are hard to get right at production quality, with proper validation, security boundaries, cost control, and observability. Most teams underestimate the guardrail layer and ship something that works in demos but fails in production on edge-case inputs or adversarial content. If you are building an AI agent that needs to take real actions in your systems and you want it done correctly, that is exactly what I do. See my AI agent development service for how I approach these engagements, or contact me directly to talk through your specific system. Work with me on your AI agent architecture.

MCP and Tool Calling Explained: How AI Agents Take Real Actions

Are you a software engineer moving into AI?

AI Personal Assistant

AI Marketing Manager

AI Sales Representative

AI Support Specialist