TTFT and p95 Latency for LLMs
LLM latency has two parts. Time to first token (TTFT) is how long the API takes to emit the first byte — determined by queueing, prompt processing, and speculative decoding warmup. Inter-token latency (ITL), or the inverse, tokens per second, determines how long the rest of the response takes.
Why splitting them matters
For a chat UI, TTFT is what users perceive as speed. A 300ms TTFT with 60 tok/s feels instant. A 2s TTFT with 200 tok/s feels broken even though the total time is shorter. Always measure both.
For async jobs (batch summarization, embedding pipelines), total time is what matters; optimize throughput instead.
Typical numbers (April 2026, measured end-user)
- Claude Sonnet 4.6 — p50 TTFT ~500ms, 85 tok/s output.
- GPT-5 — p50 TTFT ~700ms, 70 tok/s output.
- DeepSeek V3 via Fireworks — p50 TTFT ~250ms, 140 tok/s.
- Llama 3.3 70B via Together — p50 TTFT ~200ms, 120 tok/s.
- Groq/Cerebras — sub-100ms TTFT, 500+ tok/s on supported models.
What inflates TTFT
- Large uncached prompts. Every 100k tokens of fresh input adds a second or more of prompt processing.
- Cold model on the provider side. Rare models fall out of residency; first request warms it up.
- Tool-use loops. Each tool-call round trip adds a full TTFT on both the model side and your server side.
- Geographic distance. A US-East user hitting a US-West endpoint pays ~80ms of network on every round trip.
What cuts TTFT
- Prompt caching. Drops prompt-processing to near zero on hits.
- Streaming. You can't reduce TTFT, but the user sees tokens sooner overall.
- Route by region. Most providers have regional endpoints; use the closest.
- Prefer speculative-decoding providers (Fireworks, Together) for open models.
Failure modes
- p95 blindness. Your p50 is 400ms; your p95 is 8 seconds because 5% of requests hit a rate-limit retry path. Users feel p95, not p50.
- Streaming broken by proxies. A corporate proxy buffers the response and TTFT jumps to total-response time. Test from real client networks.
- Silent model downgrade. Provider rolls out a new serving stack, ITL drops 40%. Without alerting you miss it for weeks.
When NOT to optimize
Background jobs and nightly analytical pipelines do not need sub-second TTFT. Trade latency for cost by using batch APIs and larger prompt sizes.