Rate Limiting and Cost Control for LLM-Powered Agents

LLM API costs scale with usage in a way that server costs don’t. A bug in your web server might make it use more CPU — that’s bounded by your instance size. A bug in your agent that causes it to loop on an LLM call will keep accumulating costs until you notice and intervene. At $15/million tokens for frontier models, an agent in a loop can spend $100 in ten minutes.

Cost control is not a nice-to-have. It’s a production requirement.

Layer Your Controls

Effective cost control requires multiple independent layers. No single control is reliable on its own.

Layer 1: Provider-Level Spend Limits

Every major LLM API provider offers account-level spend limits and alerts. Set these first — they’re your backstop against catastrophic runaway. Configure:

Hard monthly spend limit (causes requests to fail once exceeded)
Alert at 50%, 75%, and 90% of expected monthly spend

This is a blunt instrument — it affects all usage, not just runaway tasks — but it’s the safest backstop. If your application’s per-task controls fail, the provider limit catches it.

Layer 2: Per-Task Token Budgets

Each agent task should have an explicit token budget. The task manager enforces this budget and terminates (or pauses) the task when exceeded.

class TaskBudget:
    max_input_tokens: int = 500_000
    max_output_tokens: int = 100_000
    max_tool_calls: int = 100
    max_wall_time_seconds: int = 3600

    def check(self, usage: TaskUsage) -> BudgetStatus:
        if usage.input_tokens > self.max_input_tokens:
            return BudgetStatus.EXCEEDED_INPUT
        if usage.tool_calls > self.max_tool_calls:
            return BudgetStatus.EXCEEDED_TOOL_CALLS
        # ...

When a budget is exceeded, don’t just terminate — checkpoint the task state first. A task that ran for 45 minutes and then hit a token limit has produced partial results that may be valuable.

Layer 3: Per-User and Per-Tenant Limits

If multiple users or tenants share your agent infrastructure, enforce per-user limits. A single user should not be able to exhaust resources at the expense of others.

Track usage at the user/tenant level with rolling windows:

user_id:123:tokens:2026-01-20  →  current usage
user_id:123:tokens:daily_limit  →  500,000

When a user approaches their daily limit, return a clear error (or warning) rather than silently failing mid-task. Users can make rational decisions about usage when they have visibility; they can’t when limits are opaque.

Layer 4: Anomaly Detection

Establish baseline token usage per task type and alert when a task deviates significantly. A research task that normally uses 50,000 tokens should trigger a review if it’s using 500,000. This catches novel failure modes that your static limits didn’t anticipate.

A simple heuristic: if a task exceeds 3× the p95 token usage for its type, pause it and require human review before continuing.

Managing Tool Call Costs

LLM token costs are the obvious cost driver, but tool calls accumulate cost too: web search APIs, code execution compute, external service calls. Track and limit these independently.

tool_budgets = {
    "web_search": {"calls_per_task": 20, "cost_per_call": 0.01},
    "code_exec": {"calls_per_task": 30, "cpu_seconds_per_task": 60},
    "external_api": {"calls_per_task": 50},
}

Some tools are expensive in ways that aren’t immediately obvious. A code execution environment that spins up a container per call has cold-start overhead. Web search APIs charge per query. Build cost awareness into your tool registry so task managers can make informed decisions about which tools to allow for which task types.

Real-Time Cost Visibility

Build a cost dashboard that shows spending in real time, broken down by:

Task type
User/tenant
LLM model
Tool type

When costs spike, you want to identify the cause in seconds, not by auditing logs. A dashboard that shows “task type ‘deep research’ is 5× over expected cost today, driven by user_id:456” gives you everything you need to act.

The Fail-Open vs. Fail-Closed Decision

When your cost control infrastructure itself fails (the token counter service is down, the budget check times out), what do you do?

Fail-open (allow the request): better user experience, but catastrophic cost exposure if your controls are down for an extended period.

Fail-closed (deny the request): worse user experience, but bounded cost exposure.

For internal tools and developer-facing products, fail-open with aggressive monitoring is often acceptable. For consumer-facing products or high-volume automated pipelines, fail-closed is the safer default.

Make this decision explicitly for each component in your cost control stack, and document it. “We fail open on the per-user Redis counter but fail closed on the task budget enforcer” is a policy decision that should be written down.