Rate Limiting and Cost Control for LLM-Powered Agents
Without explicit cost controls, a single runaway agent task can generate hundreds of dollars in API costs. Here's how to build budget management into your agent infrastructure.
LLM API costs scale with usage in a way that server costs don’t. A bug in your web server might make it use more CPU — that’s bounded by your instance size. A bug in your agent that causes it to loop on an LLM call will keep accumulating costs until you notice and intervene. At $15/million tokens for frontier models, an agent in a loop can spend $100 in ten minutes.
Cost control is not a nice-to-have. It’s a production requirement.
Layer Your Controls
Effective cost control requires multiple independent layers. No single control is reliable on its own.
Layer 1: Provider-Level Spend Limits
Every major LLM API provider offers account-level spend limits and alerts. Set these first — they’re your backstop against catastrophic runaway. Configure:
- Hard monthly spend limit (causes requests to fail once exceeded)
- Alert at 50%, 75%, and 90% of expected monthly spend
This is a blunt instrument — it affects all usage, not just runaway tasks — but it’s the safest backstop. If your application’s per-task controls fail, the provider limit catches it.
Layer 2: Per-Task Token Budgets
Each agent task should have an explicit token budget. The task manager enforces this budget and terminates (or pauses) the task when exceeded.
class TaskBudget:
max_input_tokens: int = 500_000
max_output_tokens: int = 100_000
max_tool_calls: int = 100
max_wall_time_seconds: int = 3600
def check(self, usage: TaskUsage) -> BudgetStatus:
if usage.input_tokens > self.max_input_tokens:
return BudgetStatus.EXCEEDED_INPUT
if usage.tool_calls > self.max_tool_calls:
return BudgetStatus.EXCEEDED_TOOL_CALLS
# ...
When a budget is exceeded, don’t just terminate — checkpoint the task state first. A task that ran for 45 minutes and then hit a token limit has produced partial results that may be valuable.
Layer 3: Per-User and Per-Tenant Limits
If multiple users or tenants share your agent infrastructure, enforce per-user limits. A single user should not be able to exhaust resources at the expense of others.
Track usage at the user/tenant level with rolling windows:
user_id:123:tokens:2026-01-20 → current usage
user_id:123:tokens:daily_limit → 500,000
When a user approaches their daily limit, return a clear error (or warning) rather than silently failing mid-task. Users can make rational decisions about usage when they have visibility; they can’t when limits are opaque.
Layer 4: Anomaly Detection
Establish baseline token usage per task type and alert when a task deviates significantly. A research task that normally uses 50,000 tokens should trigger a review if it’s using 500,000. This catches novel failure modes that your static limits didn’t anticipate.
A simple heuristic: if a task exceeds 3× the p95 token usage for its type, pause it and require human review before continuing.
Managing Tool Call Costs
LLM token costs are the obvious cost driver, but tool calls accumulate cost too: web search APIs, code execution compute, external service calls. Track and limit these independently.
tool_budgets = {
"web_search": {"calls_per_task": 20, "cost_per_call": 0.01},
"code_exec": {"calls_per_task": 30, "cpu_seconds_per_task": 60},
"external_api": {"calls_per_task": 50},
}
Some tools are expensive in ways that aren’t immediately obvious. A code execution environment that spins up a container per call has cold-start overhead. Web search APIs charge per query. Build cost awareness into your tool registry so task managers can make informed decisions about which tools to allow for which task types.
Real-Time Cost Visibility
Build a cost dashboard that shows spending in real time, broken down by:
- Task type
- User/tenant
- LLM model
- Tool type
When costs spike, you want to identify the cause in seconds, not by auditing logs. A dashboard that shows “task type ‘deep research’ is 5× over expected cost today, driven by user_id:456” gives you everything you need to act.
The Fail-Open vs. Fail-Closed Decision
When your cost control infrastructure itself fails (the token counter service is down, the budget check times out), what do you do?
Fail-open (allow the request): better user experience, but catastrophic cost exposure if your controls are down for an extended period.
Fail-closed (deny the request): worse user experience, but bounded cost exposure.
For internal tools and developer-facing products, fail-open with aggressive monitoring is often acceptable. For consumer-facing products or high-volume automated pipelines, fail-closed is the safer default.
Make this decision explicitly for each component in your cost control stack, and document it. “We fail open on the per-user Redis counter but fail closed on the task budget enforcer” is a policy decision that should be written down.