AIStackWatch
Back to wiki

Prompt Caching for LLM APIs

Prompt caching is a provider-side feature that stores the intermediate attention state (the KV cache) for a prefix of your prompt. On subsequent calls that share the same prefix, the model skips re-processing those tokens and charges a reduced rate.

Why it matters

LLM pricing is dominated by input tokens for most real apps. A coding agent with a 30k-token system prompt, a 10k-token tool schema, and 5k-token conversation history pays full rate on 45k tokens every turn — unless you cache.

With caching, the stable prefix is billed at roughly 10% (Anthropic) or 50% (OpenAI) of the input rate on cache hits. For high-traffic apps the savings are an order of magnitude.

How it works

  • Anthropic — you mark up to 4 cache_control breakpoints in your messages. Cached prefixes live for 5 minutes by default; a 1-hour tier is available at higher storage cost.
  • OpenAI — automatic on prompts >=1024 tokens. You cannot control cache boundaries, but you can improve hit rate by keeping the prefix stable.

The cache key is a hash of the prefix, the model, and (for Anthropic) the exact tools and system block. Any byte change invalidates it.

Keeping your hit rate high

  • Put static content (system prompt, tool definitions, fixed retrieval) first.
  • Put the volatile user turn last.
  • Never shuffle the order of tool definitions between calls.
  • Avoid timestamps or per-request IDs inside cached regions.

When it doesn't help

  • Short prompts under 1k tokens usually aren't cached at all.
  • One-shot calls to unique prompts — nothing to reuse.
  • Fine-tuned models sometimes do not support caching; check provider docs.
  • If your prefix changes every call, caching is a net loss on Anthropic (writes cost 25% more than normal input).

Failure modes

A silent cache miss looks like a 10x cost spike overnight. Watch cache_read_input_tokens in your billing logs — when the ratio drops below 80%, something in your prefix is mutating.