LLM Observability Stack
LLM observability is the practice of logging every step of an LLM pipeline — inputs, retrieved context, tool calls, raw outputs, token counts, latency, cost — so that you can diagnose problems and improve quality over time. Standard APM (Datadog, New Relic) isn't enough; traces need to preserve prompts and token-level detail.
What you should capture
- Full prompt — system, user, assistant turns, tool schemas, retrieved documents.
- Full response — final text plus any intermediate
tool_usecalls. - Token counts — input, output, cached. Used for both cost and ceiling analysis.
- Latency breakdown — time to first token, total, tool-call wait time.
- Metadata — user ID, session ID, feature flag, model name, prompt version.
- Feedback — thumbs up/down, edit distance between suggestion and accepted text.
The tools
- Helicone — sits in front of your API calls as a proxy. Easy install, great for cost and cache-hit visibility.
- Arize — stronger on ML-ops patterns; evaluation datasets and drift detection in one place.
- LangSmith — tight integration with LangChain/LangGraph; traces show graph structure natively.
- Braintrust — pairs evals with online logging; good when you want one system for both.
What observability unlocks
- Regression detection — last week's p95 latency was 4s, today it's 11s. A tool-call loop is stuck. Trace view shows it in seconds.
- Prompt version diffing — deploy v3 of your classifier prompt, compare win-rate against v2 on the same traffic.
- Real-traffic eval sets — sample production sessions with a low score into your eval dataset. Free labeled data.
- Cost attribution — per-user, per-feature, per-prompt-version. Essential before you cap any freemium plan.
Privacy and cost concerns
- Logging full prompts captures PII by default. Redact on the way in.
- Retention is expensive at scale. 30 days hot, archive older.
- Sampling at 10-20% for high-volume endpoints is usually fine once evals are solid.
When NOT to add it yet
Pre-traffic prototypes don't need observability — you're watching stdout. Add it the week you onboard the first external user.