How do you measure the reliability of LLM outputs in production?
Run tasks multiple times, track dominant results, and calculate dynamic confidence with statistical methods (LLN, CLT). Practical guidelines and code in this post.
As developers, we often use Large Language Models (LLMs) in real-world systems where not only the answer to a task matters, but also how confident we are in that answer. In an experiment, I ran a simple task—"How many R's are in 'strawberry'?"—103 times sequentially and independently on the DeepSeek-R1-Distill-Llama-8B model. This experiment illustrates two key aspects of a production system:
Task Result: What is the output of the task?
Probabilistic Confidence: How much trust can we put in that output?
When we observe successive executions, both pieces of information evolve. The outcome of the task can be inferred using principles like the Law of Large Numbers (LLN) and the Central Limit Theorem (CLT). But if the results do not converge to a stable answer, it might mean that the task is inherently ambiguous for the model. On the other hand, the confidence is evaluated dynamically based on the number of executions and the distribution of outputs.
In other words, you want to ask, "Now that I've done X executions, and I see that Y of them point to one particular result while the rest differ, how confident am I that this result is indeed the correct one?"
This post is part of a series of reflexions on the upcoming transformation of digital interactions:
Use Cases
Technical (coming soon)
Responsible Design (coming soon)
Early November 2024, I spent a day with the OpenAI team in Paris. A few weeks earlier, the announcement of an API to allow developers to create multimodal assistants—voice + text—in real time had made waves. I was fortunate enough to have time and receive firsthand advice to explore this new Realtime API with the team that designed it.
This technology is what powers the ChatGPT app in voice mode, and it's now available to developers for creating new applications.
I was blown away.
In 6 hours, I prototyped an interaction for a cooking coach. 100% voice-based, with beautiful sound quality, expressive intentions in the voice, and emotion. User-driven interactions, assistant-driven responses, contextual information references—all orchestrated with very simple code for a highly encouraging result.