AIStackWatch
Back to wiki

TTFT and p95 Latency for LLMs

LLM latency has two parts. Time to first token (TTFT) is how long the API takes to emit the first byte — determined by queueing, prompt processing, and speculative decoding warmup. Inter-token latency (ITL), or the inverse, tokens per second, determines how long the rest of the response takes.

Why splitting them matters

For a chat UI, TTFT is what users perceive as speed. A 300ms TTFT with 60 tok/s feels instant. A 2s TTFT with 200 tok/s feels broken even though the total time is shorter. Always measure both.

For async jobs (batch summarization, embedding pipelines), total time is what matters; optimize throughput instead.

Typical numbers (April 2026, measured end-user)

  • Claude Sonnet 4.6 — p50 TTFT ~500ms, 85 tok/s output.
  • GPT-5 — p50 TTFT ~700ms, 70 tok/s output.
  • DeepSeek V3 via Fireworks — p50 TTFT ~250ms, 140 tok/s.
  • Llama 3.3 70B via Together — p50 TTFT ~200ms, 120 tok/s.
  • Groq/Cerebras — sub-100ms TTFT, 500+ tok/s on supported models.

What inflates TTFT

  • Large uncached prompts. Every 100k tokens of fresh input adds a second or more of prompt processing.
  • Cold model on the provider side. Rare models fall out of residency; first request warms it up.
  • Tool-use loops. Each tool-call round trip adds a full TTFT on both the model side and your server side.
  • Geographic distance. A US-East user hitting a US-West endpoint pays ~80ms of network on every round trip.

What cuts TTFT

  • Prompt caching. Drops prompt-processing to near zero on hits.
  • Streaming. You can't reduce TTFT, but the user sees tokens sooner overall.
  • Route by region. Most providers have regional endpoints; use the closest.
  • Prefer speculative-decoding providers (Fireworks, Together) for open models.

Failure modes

  • p95 blindness. Your p50 is 400ms; your p95 is 8 seconds because 5% of requests hit a rate-limit retry path. Users feel p95, not p50.
  • Streaming broken by proxies. A corporate proxy buffers the response and TTFT jumps to total-response time. Test from real client networks.
  • Silent model downgrade. Provider rolls out a new serving stack, ITL drops 40%. Without alerting you miss it for weeks.

When NOT to optimize

Background jobs and nightly analytical pipelines do not need sub-second TTFT. Trade latency for cost by using batch APIs and larger prompt sizes.