AIStackWatch
Back to wiki

Tokens, Context, and Cost

LLM pricing is almost always quoted per million tokens, with separate rates for input and output. A token is roughly 0.75 English words or 3-4 characters of code. What looks like a small per-token price multiplies fast in agentic workflows.

The shape of a bill

For a typical coding agent turn:

  • Input tokenssystem prompt (5k) + tool schemas (3k) + conversation history (20k) + retrieved code (30k) = ~58k.
  • Output tokens — a few thousand for an edit plan plus patches.
  • Cached input tokens — ideally 90%+ of the stable prefix, billed at 10-50% of full rate.

Input tokens dominate. Output is usually 5-10% of the bill. That surprises people used to chat apps, where output dominates because history is short.

Current rates (per million tokens, April 2026)

  • Claude Opus 4.7 — input $15 / output $75 / cached $1.50.
  • Claude Sonnet 4.6 — input $3 / output $15 / cached $0.30.
  • GPT-5 — input $10 / output $30, automatic cache at $1.
  • DeepSeek V3 — input $0.27 / output $1.10. An order of magnitude cheaper; quality trails Sonnet on agentic tasks.

Prices move every quarter. Always re-quote from the provider before a capacity decision.

How to cut costs without losing quality

  • Cache the system prompt. See the prompt-caching article. Usually the single biggest win.
  • Prune tool schemas. Tools you added in month one that nobody calls still bill on every turn.
  • Down-route easy queries. Route classification and summarization to Haiku / DeepSeek; keep Opus for hard reasoning.
  • Batch async work. Batch API endpoints offer 50% off for 24-hour-turnaround jobs.
  • Stop early. If the agent has answered, end the turn. Many loops continue producing null text that still bills output tokens.

Failure modes

  • Unbounded output. max_tokens not set; a runaway response burns $5 of Opus on a single turn.
  • Context reinjection. Accidentally including the entire conversation plus full tool outputs each turn; cost grows quadratically.
  • Debug loops. A retry-on-error path retries tool-call errors forever while you're asleep.

When NOT to optimize

Pre-PMF, do not spend a week on prompt caching to save $3 a day. Build the product. Once you have monthly spend greater than rent, start optimizing — the numbers compound fast.