Dynamic Confidence Estimation for LLM Task Execution
How do you measure the reliability of LLM outputs in production?
Run tasks multiple times, track dominant results, and calculate dynamic confidence with statistical methods (LLN, CLT). Practical guidelines and code in this post.
As developers, we often use Large Language Models (LLMs) in real-world systems where not only the answer to a task matters, but also how confident we are in that answer. In an experiment, I ran a simple task—"How many R's are in 'strawberry'?"—103 times sequentially and independently on the DeepSeek-R1-Distill-Llama-8B model. This experiment illustrates two key aspects of a production system:
- Task Result: What is the output of the task?
- Probabilistic Confidence: How much trust can we put in that output?
When we observe successive executions, both pieces of information evolve. The outcome of the task can be inferred using principles like the Law of Large Numbers (LLN) and the Central Limit Theorem (CLT). But if the results do not converge to a stable answer, it might mean that the task is inherently ambiguous for the model. On the other hand, the confidence is evaluated dynamically based on the number of executions and the distribution of outputs.
In other words, you want to ask, "Now that I've done X executions, and I see that Y of them point to one particular result while the rest differ, how confident am I that this result is indeed the correct one?"
Below is a proposed method to integrate these insights into your system.
The Underlying Idea
Imagine you have a system that executes a task repeatedly. For each execution, you obtain a result (e.g., a number encapsulated in <result>...</result>
). Over time, you build up a set of results which you can analyze to determine:
- Dominant Outcome: By clustering the results, you can identify which answer appears most frequently.
- Dynamic Confidence: As you accumulate more outcomes, you can calculate a confidence score for the dominant answer. This score quantifies your certainty that the dominant result truly reflects the correct answer.
In statistical terms, if we denote by the total number of independent task executions and by the number of times a particular answer (say, the dominant cluster) occurs, then the empirical proportion is given by:
By the Law of Large Numbers (LLN), as increases, will converge to the true probability of that answer being correct. Moreover, using the Central Limit Theorem (CLT), we can approximate the variability of our estimate. The standard error (SE) of the proportion is:
A confidence interval (CI) for the true proportion can be constructed as:
where is the Z-score associated with the desired confidence level (e.g., for 80% confidence).
Proposed Methodology
1. Run Multiple Executions:
Execute the task repeatedly, each time independently. Store each output along with metadata (e.g., timestamp, token count, etc.).
2. Cluster the Results:
Group the outputs by the result value. Let’s denote the dominant cluster by and its occurrence count by .
3. Calculate the Empirical Proportion and Confidence:
- Empirical Proportion:
- Standard Error:
- Confidence Interval:
This framework allows you to state:
"After executions, the dominant cluster constitutes an empirical proportion of the responses. Thus, we are approximately confident (according to the confidence interval) that this is the correct result."
Empirical Analysis of the DeepSeek Experiment
Let’s apply the methodology to the experimental results obtained from DeepSeek 8b on the task "How many R's in strawberry?".
The summarized outcomes from the experiment are:
- Total responses analyzed: 103
- Correct answers (i.e., responses with <result>3</result>
): 85
- Incorrect answers (i.e., responses with <result>2</result>
): 16
- Malformed outputs: 2
Since the two malformed outputs do not provide a valid result, we consider only the well-structured responses. This gives us an effective total of:
with the dominant correct result (3 R's) appearing times.
Step-by-Step Calculation
- Empirical Proportion:
- Standard Error:
- 80% Confidence Interval (using ):
This yields:
Interpreting the Result
After 101 valid executions, the dominant cluster corresponding to the correct answer (3 R's) constitutes approximately 84.16% of the responses. With an 80% confidence level, we can say that the true proportion of correct responses likely lies between roughly 79.5% and 88.8%. In practical terms, this means:
"After 101 executions, the dominant cluster (i.e., responses with <result>3</result>
) constitutes about 84.16% of the responses. Thus, we are approximately 80% confident that the true proportion of correct answers is between 79.5% and 88.8%."
Practical Example in Python
Below is an example snippet that demonstrates how to compute the dynamic confidence score as more executions are collected:
import math
def compute_confidence(k, N, Z=1.28):
"""
k: The count of the dominant result.
N: Total number of executions.
Z: Z-score for the desired confidence (default 80% -> Z ~ 1.28).
"""
p_hat = k / N
SE = math.sqrt(p_hat * (1 - p_hat) / N)
CI_lower = p_hat - Z * SE
CI_upper = p_hat + Z * SE
return p_hat, CI_lower, CI_upper
# Based on the experiment:
N = 101 # valid responses (103 total minus 2 malformed)
k = 85 # correct responses (<result>3</result>)
p_hat, CI_lower, CI_upper = compute_confidence(k, N)
print(f"Empirical proportion: {p_hat:.3f}")
print(f"80% Confidence Interval: [{CI_lower:.3f}, {CI_upper:.3f}]")
Dynamic Evolution of Cluster Probabilities
The experimentation using the above code brings valuable insights into how the dominant clusters evolve as more responses are collected. Early in the process, both clusters "2" and "3" may exhibit volatile proportions due to the low number of executions. However, as the sample size increases, the empirical probabilities begin to stabilize. For instance, while cluster "3" (representing the correct answer) generally trends upward as more valid responses are accumulated, its evolution is captured alongside an 80% confidence interval that narrows over time. This shrinkage reflects increasing certainty in the estimate due to the law of large numbers and the diminishing effect of initial random fluctuations.
Conversely, cluster "2" displays a different evolution, typically plateauing at a lower probability. The dynamic tracking and visualization of the confidence intervals allow us to appreciate not only which result is dominant but also how confident we can be in that dominance at any given point. The shaded areas represent the confidence intervals and reveal tentative boundaries for the true proportion as more data is integrated. In practice, these visual cues empower developers to make real-time decisions about whether to continue collecting samples or to commit to a particular outcome.
Conclusion
By integrating these statistical principles into your task execution pipeline, you can dynamically assess both the result and the confidence level of that result. In our DeepSeek experiment, after 101 valid executions, the dominant result (3 R's) appears with an empirical proportion of about 84.16%. With an 80% confidence interval, we are reasonably assured that the true proportion of correct responses lies between roughly 79.5% and 88.8%. You can make decision out of this.
Happy coding!