Link
Reddit Discussion: How do you deal with LLM observability? What tools do you guys use?
Ari’s Comment: Ari’s post on LLM Observability and what it does
What the comment covers
LLM monitoring is about collecting, visualizing, and setting up alerts based on general metrics (latency, tokens, cost…) or custom KPIs (evaluations).
LLM observability tools provide an SDK to log LLM calls from your code or an LLM Proxy to intercept requests. You can use the SDK to manually log inputs/outputs of LLMs and other steps like preprocessing or retrieval from a vector database
My Thoughts
Overall takeaway
Traditional software observability focused on deterministic systems where inputs reliably produced the same outputs.
Unlike traditional software where the same input always produces the same output, LLMs (a probabilistic system) introduce variability by design.
This variability means we need new ways to think about system health and reliability.
This makes observability both more challenging and more crucial.
Monitoring what your AI/LLM software system is doing by collecting data and generating system metrics is a good way to track the health of the system.
As models change, and new models are tested, it’s helpful to know exactly what’s happening and where.
Observability tools provide a total view into the health and behavior of the application, which is crucial when using a non-deterministic system like LLMs.
Ari’s comment suggests the following observability scenario:
You can use an LLM with a specific prompt to evaluate LLM generations and ask the LLM to judge the generation on hallucinations, toxicity, context relevancy for RAG applications, or any criteria you think are useful.
While this seems like a great end goal, starting small and building up is suggested.
Practical Implementation
LLM observability teams generally follow a progression of increasing sophistication:
- Basic logging of inputs and outputs using the observability platform’s SDK
- Add tracing of preprocessing steps and external service calls
- Implement automated evaluations
Step number 1 gives visibility into what prompts are being sent and what responses are being received.
Step number 2 helps understand the full context of each LLM interaction (things like vector database retrievals).
Step number 3 might have an evaluation LLM that checks each response for:
- Factual accuracy by comparing against retrieved context
- Tone and appropriateness for your use case
- Consistency with previous responses
- Relevance to the original query
As you can imagine, with more sophistication, scale and performance considerations come into play.
Scale & Performance Considerations
First off, you need to consider the logging of every LLM interaction.
Some teams choose a sampling strategy to log every nth request or a more intellegient sampling technique based on request characteristics.
Second, you need to consider storage implications of keeping details traces of things like large context windows.
This storage challenge becomes increasingly important as context windows grow.
For example, OpenAI has hinted at moving toward infinite context windows, which would dramatically increase storage requirements for comprehensive logging.
Some teams choose retention policies which keep detailed logs for recent interactions while summarizing older data.
Other teams are developing more creative solutions to this storage challenge.
Some implement smart compression for older logs, keeping only key metrics and representative samples.
Other teams keep comprehensive logs of recent interactions in high-speed storag and then move older logs to slower, cheaper storage tiers with selective retention of key insights
Lastly, because LLMs have not yet reached the “intelligence for free” stage, they still charge based on the number of tokens used per request.
Granted, in general, LLM’s cost are decreasing by 10x each year for constant quality, so you can worry less each year, it still means that if there is a large evaluation run on every response, that it could become a cost issue.
Some teams choose to evaluate representative transactions or focus on high-risk interactions.
Trade-offs & Limitations
Per the above and Ari’s example, the biggest trade-off in LLM Obvservability revolves around depth versus performance.
Having an LLM evaluate another LLM’s output adds latency and cost to each interaction.
Applications with strict latency requirements may end up having to compromise on observability.
Additionally, using a non-deterministic LLM evaluator to means that the evalautor itself might make a mistake or be inconsistent.
Some teams choose to require human validation of the evaluations themselves.
For technical teams and leaders, robust LLM observability unlocks several strategic capabilities.
Understanding these capabilities helps teams make informed decisions about which trade-offs are worth making for their specific use case.
Strategic Implications
- Quality Assurance: Teams can measure response quality and track it over time which is especially important when switching models or updating prompts.
- Cost Management: Detailed token usage tracking can help optimize prompts as well as identify opportunities for cost reduction.
- Risk Management: Automated evaluation of responses helps catch potential issues before they impact users, particularly important for customer-facing applications.
Team Adoption Patterns
Successful teams typically implement observability in phases:
- They start with basic metrics everyone understands - latency, token counts, and error rates which build team familiarity with the tooling and concepts.
- Then they add custom evaluations for their specific use cases, often starting with high-risk or high-value interactions.
- Finally, they integrate observability data into their development workflow, using it to guide prompt engineering and model selection decisions.
The phased approach to observability mirrors how teams naturally build trust in any new system.
Start with what’s easily understood and measurable, then gradually expand as confidence grows.
This patience pays off in better long-term adoption and more sophisticated usage patterns.
Key Takeaways for AI Agent Development
Observability in AI agent development is fundamentally different from traditional software observability because we’re dealing with non-deterministic systems (and potentially non-deterministic evaluators of said systems)
The key is not just monitoring what happened, but understanding why it happened and whether it was appropriate.
This requires a combination of quantitative metrics and qualitative evaluation.
Looking forward, as AI agents become more autonomous and handle more complex tasks, observability will become even more critical.
Teams will need to develop increasingly sophisticated evaluation frameworks that can assess not just individual responses, but entire chains of reasoning and action.
The goal isn’t perfect prediction or control - that’s impossible with LLMs - but rather building systems we can understand and trust through comprehensive visibility into their operation.
This comprehensive visibility then becomes the foundation for building reliable, trustworthy AI Agents that teams can confidently deploy and maintain.