Monitoring Long-Running Agents: Observability Patterns

You can’t debug what you can’t observe. For traditional web services, observability is largely solved: structured logs, distributed traces, and metrics cover most failure modes. For long-running agents, the standard toolkit falls short in meaningful ways.

Why Standard APM Isn’t Enough

A typical distributed trace for a web request has a clear structure: one root span, a predictable set of child spans (database queries, cache lookups, downstream calls), and a completion time measured in milliseconds. The trace is complete, finite, and interpretable.

An agent trace for a four-hour research task has thousands of spans representing LLM calls, tool executions, memory reads and writes, and reasoning steps. It’s non-deterministic — the same task will produce a different trace every time. And “correctness” isn’t captured by latency or error rate alone; the agent can complete successfully (no errors, acceptable latency) while producing a wrong or useless result.

Standard APM metrics answer “did it run?” Agent observability needs to answer “did it work?”

The Agent Observability Stack

Structured Step Logging

Log every agent step as a structured event with a consistent schema:

{
  "task_id": "task_abc123",
  "step": 14,
  "step_type": "tool_call",
  "tool": "web_search",
  "input": {"query": "LLM inference optimization 2026"},
  "output_tokens": 1240,
  "latency_ms": 340,
  "success": true,
  "timestamp": "2026-01-14T09:23:41Z",
  "cumulative_tokens": 28400,
  "cumulative_tool_calls": 9
}

This schema enables SQL-like queries over agent behavior: “show me all tasks where web_search was called more than 20 times” or “find tasks where step count exceeded 100 without completion.”

Reasoning Trace Capture

Beyond step logs, capture the agent’s actual reasoning — the content of its thinking steps, the plans it makes, the decisions it evaluates. This is the data that makes debugging possible.

Reasoning traces are large and expensive to store indefinitely. A practical strategy: store full reasoning traces for a rolling 14-day window for all tasks, and indefinitely for tasks that were marked as failures or flagged for review.

Health Metrics

Track agent-specific metrics that don’t exist in traditional monitoring:

Step rate (steps/minute) — a sharp drop indicates the agent is stuck or waiting
Token burn rate (tokens/minute) — a spike indicates unexpected looping
Tool error rate per tool type — identifies brittle integrations
Context utilization (% of max context used) — predicts context overflow failures
Planning vs. execution ratio — too much planning relative to action may indicate the agent is confused

Task Outcome Tracking

Track task outcomes explicitly, separate from technical metrics:

class TaskOutcome(Enum):
    COMPLETED_SUCCESS = "completed_success"
    COMPLETED_PARTIAL = "completed_partial"
    FAILED_TIMEOUT = "failed_timeout"
    FAILED_BUDGET = "failed_budget"
    FAILED_ERROR = "failed_error"
    FAILED_STUCK = "failed_stuck"
    CANCELLED = "cancelled"

Aggregate these by task type, user, and time period. Completion rate per task type is often the single most useful metric for tracking agent reliability improvements over time.

Detecting the “Stuck Agent” Pattern

One of the most common agent failure modes is the stuck loop: the agent is taking steps, the steps are technically succeeding, but the agent isn’t making progress toward the goal. This is invisible to error rate monitoring.

Detect stuck agents by tracking semantic progress, not just step count. A simple heuristic: if the last 5 steps all called the same tool with similar inputs and produced similar outputs, the agent is likely looping. Pause it and alert.

More sophisticated: use a lightweight model to score each step’s contribution to the stated goal. If progress scores are flat or declining for N consecutive steps, intervene.

Alerting Strategy

Alert on deviations from baseline, not absolute thresholds. Agent workloads are variable by nature — a Tuesday research spike will look like an anomaly if your alert threshold is based on Monday traffic.

Use rolling baselines:

Alert if step count is 3× the rolling 7-day p95 for this task type
Alert if a task has been running 2× longer than the p90 completion time
Alert if tool error rate exceeds 2× the rolling 24-hour average

These relative alerts are more signal, less noise than static thresholds.

Human-in-the-Loop Checkpoints

Not all monitoring needs to be automated. For high-stakes agent tasks, build in explicit checkpoints where a human reviews the agent’s plan before execution proceeds. The checkpoint log becomes part of your observability record: who approved what, at what step, and why.

Good observability makes these reviews efficient. A reviewer who can see a clean summary of the agent’s reasoning, its current plan, and its resource usage can make an informed go/no-go decision in under a minute. Bad observability means digging through hundreds of log lines — and most reviewers will just approve to move on.

Observability is an investment that compounds. The time you spend building good agent observability today is the time you don’t spend debugging mysterious failures in production next month.