AIStackWatch
Back to wiki

LLM Observability Stack

LLM observability is the practice of logging every step of an LLM pipeline — inputs, retrieved context, tool calls, raw outputs, token counts, latency, cost — so that you can diagnose problems and improve quality over time. Standard APM (Datadog, New Relic) isn't enough; traces need to preserve prompts and token-level detail.

What you should capture

  • Full prompt — system, user, assistant turns, tool schemas, retrieved documents.
  • Full response — final text plus any intermediate tool_use calls.
  • Token counts — input, output, cached. Used for both cost and ceiling analysis.
  • Latency breakdown — time to first token, total, tool-call wait time.
  • Metadata — user ID, session ID, feature flag, model name, prompt version.
  • Feedback — thumbs up/down, edit distance between suggestion and accepted text.

The tools

  • Helicone — sits in front of your API calls as a proxy. Easy install, great for cost and cache-hit visibility.
  • Arize — stronger on ML-ops patterns; evaluation datasets and drift detection in one place.
  • LangSmith — tight integration with LangChain/LangGraph; traces show graph structure natively.
  • Braintrust — pairs evals with online logging; good when you want one system for both.

What observability unlocks

  • Regression detection — last week's p95 latency was 4s, today it's 11s. A tool-call loop is stuck. Trace view shows it in seconds.
  • Prompt version diffing — deploy v3 of your classifier prompt, compare win-rate against v2 on the same traffic.
  • Real-traffic eval sets — sample production sessions with a low score into your eval dataset. Free labeled data.
  • Cost attribution — per-user, per-feature, per-prompt-version. Essential before you cap any freemium plan.

Privacy and cost concerns

  • Logging full prompts captures PII by default. Redact on the way in.
  • Retention is expensive at scale. 30 days hot, archive older.
  • Sampling at 10-20% for high-volume endpoints is usually fine once evals are solid.

When NOT to add it yet

Pre-traffic prototypes don't need observability — you're watching stdout. Add it the week you onboard the first external user.