LLM Observability and Tracing: What to Log Before Production

How to Monitor, Trace, and Debug an LLM Application in Production

The short answer: instrument every step of the pipeline as a named span, capture inputs and outputs at each boundary, attach cost and latency to every model call, and route anomalies into an eval loop you can query later. That combination turns 'the AI said something weird on Thursday' from a five-day archaeology project into a five-minute lookup.

I am Mahmoud Zalt, an independent senior AI systems architect with 16+ years of production software experience since 2010. At Sista AI, which I co-founded, a year of running autonomous agents in production taught me that you cannot fix what you cannot trace, which is why observability is non-negotiable. Most of the LLM systems I help teams through as an AI architecture advisor have the same blind spot: they ship with logging that would be acceptable for a REST API but is completely inadequate for a non-deterministic, multi-step, retrieval-augmented pipeline. This article gives you the observability blueprint I apply on day one of any engagement.

Why LLM Observability Is Fundamentally Different

A conventional API call has one input, one output, one latency number, one status code. An LLM pipeline has a chain of non-deterministic steps: query rewriting, vector retrieval, prompt assembly, model inference, tool calls, output parsing, and sometimes recursive agent loops. A failure anywhere looks the same to the end user: a bad response. Without traces you cannot tell whether the retriever surfaced the wrong chunks, the prompt exceeded the context window, the model hallucinated a tool argument, or the output parser silently swallowed an error.

There is also the cost dimension. GPT-4o at $5 per million input tokens means a single misconfigured retriever that stuffs 20 unnecessary chunks into every prompt can cost you $3,000 a month in wasted context. You will not see that without per-call token accounting.

The Three Classes of LLM Failure

Retrieval failures: wrong chunks, stale embeddings, poor reranking, missing metadata filters. These produce confidently wrong answers that pass all static tests.
Model failures: hallucination, instruction drift, context-length truncation, temperature instability. Often intermittent and hard to reproduce without the exact prompt that triggered them.
Integration failures: tool-call argument errors, MCP server timeouts, output-parser mismatches, downstream API failures silently absorbed by catch blocks.

Each class needs different signals. A single 'LLM call succeeded' log line catches none of them.

The Trace Schema: What to Capture Per Span

Structure every pipeline run as a root trace with child spans. One root trace per user request. Each meaningful operation is its own span with a start timestamp, end timestamp, and typed payload. Here is the exact field set I require on every project.

Root Trace Fields

trace_id: UUID, propagated through every span and into every downstream service call.
session_id: groups traces for a single conversation or workflow run.
user_id: pseudonymized. Never log raw PII; hash or tokenize at ingest.
pipeline_version: the deployed git SHA or semantic version of your pipeline code. Critical for before-and-after regression analysis.
environment: prod, staging, shadow.
total_latency_ms: wall-clock time from request received to response sent.
total_cost_usd: sum of all model-call costs in the trace, calculated at ingest from token counts times current price schedule.
outcome: success, error, partial, or fallback.

Retrieval Span Fields

query_text: the exact string sent to the vector store, after any rewriting.
query_embedding_model: model name and version used to embed the query.
top_k_requested and top_k_returned: if these diverge, your index has gaps.
chunks: array of objects, each with chunk_id, source_doc_id, score, and text_length. Log scores, not just IDs. You need score distributions to tune thresholds.
reranker_applied: boolean plus reranker model name if true.
latency_ms.

Model Call Span Fields

model: full model ID, e.g. gpt-4o-2024-08-06, not just 'gpt-4o'. Model point-releases change behavior.
prompt_tokens, completion_tokens, cached_tokens: from the response headers/body, not estimated.
cost_usd: computed at ingest.
latency_ms and time_to_first_token_ms: TTFT matters for streaming UX.
temperature, max_tokens, top_p: the parameters actually sent, not the defaults you think you set.
finish_reason: stop, length, tool_calls, content_filter. A spike in length means your context is overflowing. A spike in content_filter means your prompt needs rewriting.
system_prompt_hash: SHA-256 of the system prompt. Log the hash, store the full text in a versioned prompt registry. This keeps spans small while letting you reconstruct any prompt exactly.
user_message: the assembled user turn, truncated to 2,000 chars for the span record. Store full text in cold storage keyed by trace ID.
assistant_message: same truncation strategy.

Tool Call Span Fields

Each tool call (including MCP tool calls) gets its own child span inside the model call span that triggered it.

tool_name and tool_version.
arguments: the exact JSON the model generated. This is where hallucinated arguments show up.
result_summary: truncated result or error message.
latency_ms.
success: boolean. Track tool failure rates per tool per pipeline version.

A Worked Example: From Mystery to Root Cause in 5 Minutes

A team ships a RAG-based customer support bot. After a week in production, users report it sometimes answers questions about the wrong product. Without tracing, the debugging process is: read complaint tickets, reproduce manually, add print statements, redeploy, wait for recurrence. Three to five days, minimum.

With proper tracing, the query looks like this:

SELECT trace_id, r.query_text, r.chunks[0].score, r.chunks[1].score, m.finish_reason
FROM traces
WHERE outcome = 'success'
  AND m.finish_reason = 'stop'
  AND r.chunks[0].score < 0.72
  AND DATE(created_at) >= '2026-06-13'
ORDER BY r.chunks[0].score ASC
LIMIT 50;

The result: 340 traces where the top retrieval score was below 0.72. In 80% of those, the second chunk came from a different product line. Root cause: the query rewriter was stripping product-name tokens from ambiguous short queries. Fix: add product context to the rewriter prompt. Deploys in two hours. The traces prove it worked: average top-chunk score rose from 0.68 to 0.81 the following day.

None of that is possible if you only log 'retrieval completed' with a count.

Cost and Latency Budgets: Numbers That Matter

Observability without thresholds is just a data lake. Define budgets per pipeline tier and alert when you breach them.

Pipeline Type	P50 Latency	P99 Latency	Cost Per Call Target
Simple Q&A (RAG, single retrieval)	800ms	2,500ms	$0.003
Multi-step agent (2-4 tool calls)	3,000ms	8,000ms	$0.02
Document analysis (long context)	5,000ms	15,000ms	$0.15
Autonomous agent loop (>4 steps)	10,000ms	30,000ms	$0.50

Track cost at the per-user, per-pipeline, and per-deployment-version level. A new prompt version that raises average cost by 15% without a measurable quality improvement is a regression, even if it 'feels better' in manual testing.

What Teams Get Wrong About Cost Tracking

They calculate costs in application code using estimated token counts. The model provider's reported token counts are the authoritative numbers. Always capture prompt_tokens and completion_tokens from the response object. For providers that support prompt caching (Anthropic Claude, OpenAI with cached input), separately track cached_tokens because the price is 50-90% lower. Teams that lump cached and uncached tokens together systematically overestimate cost and miss the signal that their caching is broken.

Wiring Observability Into Your Eval Loop

Traces are only valuable if they feed a continuous eval loop. Here is the minimal architecture I wire on every project.

Online Eval (Real-Time Scoring)

Run a lightweight eval on every trace immediately after completion. This does not need to be a powerful model. A fast, cheap model (Claude Haiku, GPT-4o-mini) scoring 4 or 5 dimensions is sufficient for real-time alerting. Score: answer groundedness (is the claim supported by the retrieved chunks?), answer completeness, instruction adherence, format correctness, and safety/refusal correctness. Log scores as fields on the root trace. Alert when the rolling 5-minute average groundedness score drops below 0.75.

Offline Eval (Batch Regression)

Before every deployment: run a fixed golden dataset of 200-500 representative queries through both the current production pipeline and the candidate pipeline. Compare score distributions, cost per query, and P99 latency. Block the deploy if any dimension regresses by more than 5% relative. This is the single change that most dramatically improves pipeline reliability for teams I work with. They ship prompt changes like config changes, without regression testing, then wonder why quality degrades over weeks.

Human Review Sampling

Route 2-5% of production traces to a human review queue, stratified by: low eval scores, high cost outliers, user negative feedback signals (thumbs down, retry, re-phrasing), and new user segments. Human labels feed back into your golden dataset and recalibrate your online eval model. Without this loop your eval model drifts out of alignment with real user expectations over months.

Guardrails, Security, and Human-in-the-Loop

Observability and guardrails are the same concern viewed from different angles. Observability tells you what happened. Guardrails enforce what is allowed to happen. They share the same instrumentation layer.

Input Guardrails to Log

Prompt injection detection score: even if you pass the check, log the score. Distributions tell you when attackers are probing.
PII detection before sending user input to the model. Log a boolean flag and the detected entity types (not the values), never the raw PII.
Input length and language. Unusually long inputs are often adversarial or accidental abuse.

Output Guardrails to Log

Policy violation score from your content classifier.
Hallucination risk flag from your groundedness check.
Whether the response was modified, rejected, or passed through unaltered. Log all three states separately.

Human-in-the-Loop Triggers

Define explicit conditions under which the system pauses and routes to a human: any tool call that writes to a database or sends an email; any agent that has accumulated more than 6 steps in a single run; any response with a groundedness score below 0.6. Log every HITL trigger with the reason code. If a specific reason code fires more than 1% of the time, that is a pipeline problem to fix, not a workload for human reviewers.

Tooling Choices: What to Use and What to Skip

You need less tooling than vendors want you to buy. Here is the honest breakdown.

What Actually Works in Production

LangSmith (if using LangChain): native integration, good UI for trace inspection, eval harness included. Expensive at scale but saves setup time early.
Langfuse: open-source, self-hostable, excellent trace UI, native prompt versioning, growing eval support. My default recommendation for teams that want data sovereignty or are past the LangChain ecosystem.
OpenTelemetry spans into your existing APM (Datadog, Honeycomb, Grafana): works well if your team already has APM discipline. Requires more manual instrumentation but avoids a new tool dependency.
Custom Postgres + pgvector + a Grafana dashboard: entirely viable for teams under 100k traces per day who want full control. More setup, no ongoing SaaS cost.

What You Do Not Need

You do not need a purpose-built LLM observability platform on day one. You need a structured log schema, a queryable store, and a dashboard showing cost, latency, and eval scores over time. Start with Langfuse or structured JSON logs into your existing stack. Add specialized tooling only when you hit a specific gap. Every team I have seen buy an enterprise LLM observability platform before they have a working eval loop is solving the wrong problem in the wrong order.

Frequently Asked Questions

What is the minimum viable LLM logging setup for a new production deployment?

At minimum: a trace ID on every request, the full assembled prompt and completion stored in cold storage keyed by trace ID, token counts and cost from the API response, retrieval chunk scores if you are doing RAG, and finish reason on every model call. That set alone covers 80% of the debugging surface. Add eval scores in the second iteration, not the first.

How do I trace an agent that makes recursive or parallel tool calls?

Use a parent-child span model where each tool call is a child of the model call span that generated it, and recursive model calls are children of the tool call span that triggered them. The trace ID is constant across all spans. The span ID and parent span ID together reconstruct the call tree. OpenTelemetry's span context propagation handles this natively. Langfuse supports nested observations with the same model. The key mistake is flattening everything into a single list of log lines, which makes it impossible to reconstruct the execution order of parallel branches.

How do I monitor LLM costs without tracking every single token?

You do need to track every token, but you do not need to do it in your application hot path. Write token counts from the API response into your trace store asynchronously. Run a nightly job that multiplies token counts by the current price schedule per model. Alert on daily cost anomalies, not per-request anomalies, to reduce noise. The one exception: if a single request type has a known cost ceiling (e.g., document analysis bounded at $0.50), add a synchronous guard that rejects inputs that would predictably exceed that ceiling before calling the model.

What is the difference between LLM tracing and LLM evaluation, and how do they connect?

Tracing captures what happened (inputs, outputs, latency, cost, intermediate steps). Evaluation scores the quality of what happened (correctness, groundedness, safety, adherence). They connect through the trace record: evaluation runs against trace data, and evaluation scores are written back as fields on the trace. The trace is the unit of both debugging and quality measurement. You cannot run a credible offline eval without production traces to draw your golden dataset from.

How should I handle PII in LLM traces?

Never log raw PII into your trace store. Apply a PII detection and masking step before writing to the trace: replace detected entities with typed placeholders like [EMAIL] or [NAME], log only the entity types detected (not values), and store a flag indicating whether masking was applied. If you need the original input for debugging a specific incident, use a separate audit-log with stricter access controls and a short retention window (30 days max). The trace store should be safe to query by any engineer on the team without PII exposure risk.

How do I know if my retrieval quality is degrading without manually reviewing every query?

Track the distribution of top-chunk retrieval scores over time as a rolling metric. A healthy RAG system has a stable median top-chunk score and a low tail (below 0.65) rate. When the tail rate rises above your baseline by more than 10% relative over a 24-hour window, that is a retrieval degradation signal. Common causes: embedding model change, document corpus update that introduced formatting inconsistencies, or a query pattern shift in your user base. Each of these has a different fix. The score distribution tells you something changed; the trace records tell you which queries are affected and what chunks they retrieved.

Start Tracing Before You Need It

Every team I consult with that skipped proper observability has the same regret: they spent weeks debugging in production what would have been a 10-minute trace lookup. Instrumentation is not overhead, it is the prerequisite for operating a non-deterministic system responsibly. The schema described here takes one to two days to implement correctly. The alternative is shipping blind and paying with engineering time, user trust, and uncontrolled costs.

If you are building or scaling an LLM system and want an experienced set of eyes on your architecture before you ship, that is exactly what I do as an independent AI architecture advisor. Or reach out directly through the contact page and describe what you are building.

Work with me on your AI architecture

Zalt Blog

Are you a software engineer moving into AI?

AI Personal Assistant

AI Marketing Manager

AI Sales Representative

AI Support Specialist