Prompt Caching for LLM APIs
Prompt caching is a provider-side feature that stores the intermediate attention state (the KV cache) for a prefix of your prompt. On subsequent calls that share the same prefix, the model skips re-processing those tokens and charges a reduced rate.
Why it matters
LLM pricing is dominated by input tokens for most real apps. A coding agent with a 30k-token system prompt, a 10k-token tool schema, and 5k-token conversation history pays full rate on 45k tokens every turn — unless you cache.
With caching, the stable prefix is billed at roughly 10% (Anthropic) or 50% (OpenAI) of the input rate on cache hits. For high-traffic apps the savings are an order of magnitude.
How it works
- Anthropic — you mark up to 4
cache_controlbreakpoints in your messages. Cached prefixes live for 5 minutes by default; a 1-hour tier is available at higher storage cost. - OpenAI — automatic on prompts >=1024 tokens. You cannot control cache boundaries, but you can improve hit rate by keeping the prefix stable.
The cache key is a hash of the prefix, the model, and (for Anthropic) the exact tools and system block. Any byte change invalidates it.
Keeping your hit rate high
- Put static content (system prompt, tool definitions, fixed retrieval) first.
- Put the volatile user turn last.
- Never shuffle the order of tool definitions between calls.
- Avoid timestamps or per-request IDs inside cached regions.
When it doesn't help
- Short prompts under 1k tokens usually aren't cached at all.
- One-shot calls to unique prompts — nothing to reuse.
- Fine-tuned models sometimes do not support caching; check provider docs.
- If your prefix changes every call, caching is a net loss on Anthropic (writes cost 25% more than normal input).
Failure modes
A silent cache miss looks like a 10x cost spike overnight. Watch cache_read_input_tokens in your billing logs — when the ratio drops below 80%, something in your prefix is mutating.