Prompt Caching for LLM APIs

Prompt caching is a provider-side feature that stores the intermediate attention state (the KV cache) for a prefix of your prompt. On subsequent calls that share the same prefix, the model skips re-processing those tokens and charges a reduced rate.

Why it matters

LLM pricing is dominated by input tokens for most real apps. A coding agent with a 30k-token system prompt, a 10k-token tool schema, and 5k-token conversation history pays full rate on 45k tokens every turn — unless you cache.

With caching, the stable prefix is billed at roughly 10% (Anthropic) or 50% (OpenAI) of the input rate on cache hits. For high-traffic apps the savings are an order of magnitude.

How it works

Anthropic — you mark up to 4 cache_control breakpoints in your messages. Cached prefixes live for 5 minutes by default; a 1-hour tier is available at higher storage cost.
OpenAI — automatic on prompts >=1024 tokens. You cannot control cache boundaries, but you can improve hit rate by keeping the prefix stable.

The cache key is a hash of the prefix, the model, and (for Anthropic) the exact tools and system block. Any byte change invalidates it.

Keeping your hit rate high

Put static content (system prompt, tool definitions, fixed retrieval) first.
Put the volatile user turn last.
Never shuffle the order of tool definitions between calls.
Avoid timestamps or per-request IDs inside cached regions.

When it doesn't help

Short prompts under 1k tokens usually aren't cached at all.
One-shot calls to unique prompts — nothing to reuse.
Fine-tuned models sometimes do not support caching; check provider docs.
If your prefix changes every call, caching is a net loss on Anthropic (writes cost 25% more than normal input).

Failure modes

A silent cache miss looks like a 10x cost spike overnight. Watch cache_read_input_tokens in your billing logs — when the ratio drops below 80%, something in your prefix is mutating.