GPU vs CPU for LLM Inference: When to Upgrade

GPU clusters are expensive. H100 time costs real money, and it’s easy to provision GPU capacity out of habit rather than necessity. But the right answer isn’t always CPU — some workloads genuinely benefit from local GPU inference, and running them on CPU creates a latency ceiling that limits what your agents can do.

Here’s a framework for making the right call.

When CPU Is Sufficient

For the majority of agent deployments, inference happens via API call to a hosted provider (OpenAI, Anthropic, Google). In this model, the agent itself is CPU-only code: it orchestrates tool calls, manages context, and calls the inference API over HTTP. The GPU work happens at the provider’s data center.

CPU is the right choice when:

You’re using a frontier model (GPT-4o, Claude, Gemini) exclusively via API
Your agent’s task latency is acceptable with API call overhead (~200-500ms per inference)
You have no local model requirements
Your workload is bursty and unpredictable (CPU instances scale more elastically)

A well-optimized CPU-only agent deployment can handle sophisticated tasks with excellent economics. Don’t upgrade to GPU until you’ve identified a specific bottleneck that GPU solves.

When GPU Becomes Necessary

High-Frequency Inference

If your agent makes many inference calls per task step — running multiple small models for routing, classification, or summarization alongside a primary reasoning model — API latency compounds. At 20+ inference calls per minute per agent, local inference starts to win on both latency and cost.

A local 7B or 13B model on an A10G runs inference at ~2-5ms per call after the first (CUDA kernel warmup cost). The same call via API costs ~200ms in network round trips. At high frequency, local inference is an order of magnitude faster.

Embedding Generation at Scale

Vector store workflows require embedding text. For small collections (< 1M documents), a hosted embedding API is fine. At tens of millions of embeddings, or when you’re embedding continuously (streaming new content into memory), local embedding models on GPU become substantially cheaper.

The math: hosted embedding via API at $0.02/1M tokens, a 10B token corpus costs ~$200. A single A10G instance running text-embedding-3-small equivalents locally costs ~$1/hour. If the embedding job takes under 200 hours total, local wins.

Low-Latency Multi-Turn Pipelines

Some agent architectures run multiple model calls sequentially where the output of each call is the input to the next. Chain-of-thought verification, debate-style reasoning, multi-agent collaboration — these patterns amplify API latency. Five sequential 400ms API calls is 2 seconds of waiting before the agent does anything useful.

With local inference, five sequential calls at 10ms each is 50ms. For interactive agent experiences where the user is watching, this difference is perceptible.

Air-Gapped and Compliance Requirements

Some deployments cannot send data to external APIs due to regulatory requirements, data residency restrictions, or customer contractual obligations. Local inference on dedicated GPU is the only option. This is a non-negotiable driver regardless of the cost analysis.

Choosing the Right GPU

GPU	VRAM	Best For
A10G	24GB	7B-13B models, embedding workloads
A100 40GB	40GB	30B-70B models, mixed workloads
A100 80GB	80GB	70B models at batch size, fine-tuning
H100 80GB	80GB	Largest models, training, peak throughput

For most agent inference workloads, an A10G running a quantized 13B model hits the sweet spot. The H100 is overkill unless you’re fine-tuning or running 70B+ models at scale.

The Hybrid Pattern

The most cost-effective production pattern combines both:

Frontier model via API for primary reasoning (expensive per-token, but used selectively)
Local 7B-13B model on GPU for classification, routing, summarization, embedding (cheap per-call, high frequency)

This hybrid approach uses the expensive frontier model only where it’s genuinely needed — complex multi-step reasoning — and handles the high-volume low-complexity calls locally. In practice, 60-70% of LLM calls in a complex agent pipeline can be handled by a local model, reducing API costs significantly while maintaining quality on the tasks that matter.

Start with CPU. Measure. Upgrade only when you have a specific, measured reason.