LLM Observability Stack

LLM observability is the practice of logging every step of an LLM pipeline — inputs, retrieved context, tool calls, raw outputs, token counts, latency, cost — so that you can diagnose problems and improve quality over time. Standard APM (Datadog, New Relic) isn't enough; traces need to preserve prompts and token-level detail.

What you should capture

Full prompt — system, user, assistant turns, tool schemas, retrieved documents.
Full response — final text plus any intermediate tool_use calls.
Token counts — input, output, cached. Used for both cost and ceiling analysis.
Latency breakdown — time to first token, total, tool-call wait time.
Metadata — user ID, session ID, feature flag, model name, prompt version.
Feedback — thumbs up/down, edit distance between suggestion and accepted text.

The tools

Helicone — sits in front of your API calls as a proxy. Easy install, great for cost and cache-hit visibility.
Arize — stronger on ML-ops patterns; evaluation datasets and drift detection in one place.
LangSmith — tight integration with LangChain/LangGraph; traces show graph structure natively.
Braintrust — pairs evals with online logging; good when you want one system for both.

What observability unlocks

Regression detection — last week's p95 latency was 4s, today it's 11s. A tool-call loop is stuck. Trace view shows it in seconds.
Prompt version diffing — deploy v3 of your classifier prompt, compare win-rate against v2 on the same traffic.
Real-traffic eval sets — sample production sessions with a low score into your eval dataset. Free labeled data.
Cost attribution — per-user, per-feature, per-prompt-version. Essential before you cap any freemium plan.

Privacy and cost concerns

Logging full prompts captures PII by default. Redact on the way in.
Retention is expensive at scale. 30 days hot, archive older.
Sampling at 10-20% for high-volume endpoints is usually fine once evals are solid.

When NOT to add it yet

Pre-traffic prototypes don't need observability — you're watching stdout. Add it the week you onboard the first external user.