TTFT and p95 Latency for LLMs

LLM latency has two parts. Time to first token (TTFT) is how long the API takes to emit the first byte — determined by queueing, prompt processing, and speculative decoding warmup. Inter-token latency (ITL), or the inverse, tokens per second, determines how long the rest of the response takes.

Why splitting them matters

For a chat UI, TTFT is what users perceive as speed. A 300ms TTFT with 60 tok/s feels instant. A 2s TTFT with 200 tok/s feels broken even though the total time is shorter. Always measure both.

For async jobs (batch summarization, embedding pipelines), total time is what matters; optimize throughput instead.

Typical numbers (April 2026, measured end-user)

Claude Sonnet 4.6 — p50 TTFT ~500ms, 85 tok/s output.
GPT-5 — p50 TTFT ~700ms, 70 tok/s output.
DeepSeek V3 via Fireworks — p50 TTFT ~250ms, 140 tok/s.
Llama 3.3 70B via Together — p50 TTFT ~200ms, 120 tok/s.
Groq/Cerebras — sub-100ms TTFT, 500+ tok/s on supported models.

What inflates TTFT

Large uncached prompts. Every 100k tokens of fresh input adds a second or more of prompt processing.
Cold model on the provider side. Rare models fall out of residency; first request warms it up.
Tool-use loops. Each tool-call round trip adds a full TTFT on both the model side and your server side.
Geographic distance. A US-East user hitting a US-West endpoint pays ~80ms of network on every round trip.

What cuts TTFT

Prompt caching. Drops prompt-processing to near zero on hits.
Streaming. You can't reduce TTFT, but the user sees tokens sooner overall.
Route by region. Most providers have regional endpoints; use the closest.
Prefer speculative-decoding providers (Fireworks, Together) for open models.

Failure modes

p95 blindness. Your p50 is 400ms; your p95 is 8 seconds because 5% of requests hit a rate-limit retry path. Users feel p95, not p50.
Streaming broken by proxies. A corporate proxy buffers the response and TTFT jumps to total-response time. Test from real client networks.
Silent model downgrade. Provider rolls out a new serving stack, ITL drops 40%. Without alerting you miss it for weeks.

When NOT to optimize

Background jobs and nightly analytical pipelines do not need sub-second TTFT. Trade latency for cost by using batch APIs and larger prompt sizes.