Which Method? — Stats for LLM Evals

Recommendations and Defaults

We have run the most comprehensive simulation study to date on confidence interval methods for LLM eval data, and with few exceptions, we have validated these results on real eval data as well. Based on this research, we recommend the following methods for calculating confidence intervals in LLM eval contexts. We also performed a smaller simulation study on p-value methods, and report recommendations as well. These recommendations are implemented as the method="auto" defaults in evalstats, which adapt to data type and sample size based on which methods performed best in our simulations. Note that the contents of this section are subject to change as we gather more evidence for different situations.

Recommended Confidence Intervals Methods (By Data Type and Type of Comparison)

Point estimates

Binary (0 / 1)

Wilson score interval

Use Wilson for binary point-estimate CIs. It is fast, closed-form, and lands on-target in both synthetic and real eval data simulations.

Continuous scores in [0, 1] (e.g. accuracy, ROUGE, similarity)

Normal-Inverse-Gamma (NIG) or Logit-transformed t-interval

Both methods perform nearly identically at the top. NIG is slightly conservative in practice and very fast; logit-t is also near-exact and analytic. Use NIG if you'd like to be safe.

Unbounded numeric (Likert, grades, arbitrary floats)

Student's t-interval

For unbounded numeric point estimates, use Student's t-interval. It is the preferred method across unbounded numeric settings.

Pairwise comparisons

Binary (0 / 1)

Bayesian paired (N<50) / Tango score interval (N≥50)

Real eval data shows Tango under-covers at small N in dominated or jointly-sparse pairs. Bayesian paired handles these edge cases better below N=50. At N≥50, Tango is on-target, nearly 100× faster, and produces tighter CIs.

Continuous scores in [0, 1] (e.g. accuracy, ROUGE, similarity)

Student's t-interval

Use Student's t-interval for pairwise comparisons on continuous data. It is the preferred method across numeric/continuous pairwise settings.

Unbounded numeric (Likert, grades, arbitrary floats)

Student's t-interval

Use Student's t-interval for pairwise comparisons on unbounded numeric data.

Point estimates

Binary (0 / 1)

Normal-Inverse-Gamma (NIG)-nested

A nested variant of the Normal-Inverse-Gamma Bayesian method has the best coverage across synthetic (0.966) and real eval data. Decomposes variance into between-item and within-item run components — no resampling required and very fast (0.14 ms). Though it seems theoretically strange, this method performed better than more "reasonable" Bayesian methods for a binary regime.

Continuous scores in [0, 1]

NIG-nested or Logit-t on per-input means

NIG-nested (0.966) is the safest default: it accounts for run variance explicitly. Logit-t on per-input means (0.949) is also on-target and instant if you pre-average runs.

Unbounded numeric (Likert, grades, arbitrary floats)

Student's t-interval on per-input means

Average runs per input, then apply the standard t-interval over those means. Near-target coverage and instant to compute.

Pairwise comparisons

Binary (0 / 1)

Tango (multi-run variant with moments)

tango_multirun_mmnt is the fastest and best-calibrated multi-run binary pairwise method (0.957), validated in both our synthetic and real-data simulations. As far as we know, this is a novel multi-run extension of the Tango score interval which was proposed by GPT-5.

Continuous scores in [0, 1]

Student's t-interval on per-input means

Average runs per input, then apply the paired t-interval over those means. Near-target (0.947) and instant.

Unbounded numeric (Likert, grades, arbitrary floats)

Student's t-interval on per-input means

Average runs per input, then apply the paired t-interval over those means.

Recommended p-value methods by analysis type

Task	Data Type	Recommended Method	Notes
Pairwise test (A vs B)	Any paired eval scores	Wilcoxon Signed Ranks	Slightly conservative, consistent across data types and sample sizes, and no Type I inflation above the alpha in our simulation.
Pairwise test (A vs B), but in a safety-critical setting	Any paired eval scores	Sign Test	More conservative than Wilcoxon; useful when false positives are very costly.
Multiple pairwise tests (k variants)	Any	Benjamini-Hochberg (`fdr_bh`)	Default correction in `evalstats`; controls expected false discovery rate with better power than stricter family-wise procedures.

Reporting principle

Do not rely on NHST alone for eval decisions. If you report p-values, report them alongside effect sizes and confidence intervals. Prefer effect sizes and CIs as primary evidence, and treat p-values as secondary.

Sample Size Guidance

N < 15: Do not report statistics. The CI width will be so large as to be meaningless, and the sample is extremely unlikely to be representative.

15 ≤ N < 30: Report statistics with a prominent warning. Results should be treated as exploratory only.

N ≥ 30: CIs and point estimates of true performance begin to become more reliable.

Simulation Study: Which CI method should I use in X situation?

We ran comprehensive simulation studies over eval-like distributions to compare methods for calculating CIs head-to-head— to the best of our knowledge, the most comprehensive test of CI methods ever conducted for small-sample AI evals data. We compared almost all known CI methods (and even a few ones invented with the help of AI) to see which ones are best for different data types, sample sizes, and estimands. We also measured the computational cost of each method, since some of the more complex methods can be very expensive to run at scale. Finally, we validated the results on real eval data by subsampling from real benchmark outputs, to double-check how well the simulation findings hold up in practice.

How to measure CI "performance"?

A confidence interval (CI) is only useful if it's trustworthy, tight, and fast to compute. We measure CI quality along three dimensions:

Coverage: If you ask a method to build a 95% CI, it should actually contain the true value 95% of the time across many repeated experiments. A method with under-coverage (say, 88%) gives intervals that are too narrow and misleadingly confident. A method with over-coverage (say, 99%) is overly conservative—it works, but it may be too cautious. We target nominal coverage of 95% for all simulations.
Width: Once a method achieves good coverage, narrower intervals are better. Width measures how "confident" the CI is: all else equal, the method that produces the tightest interval while still hitting 95% coverage wins.
Cost: how long the method takes to run. Simple analytic formulas (like the Wilson interval for binary scores) compute in microseconds. Resampling methods like the bootstrap require thousands of resampled datasets and can take seconds or more, especially as the sample size increases. Cost matters when you're running evaluations over many models, prompts, or benchmarks simultaneously.

In practice, coverage comes first—a cheap, narrow interval that's wrong 20% of the time is useless. Among methods that achieve adequate coverage, we prefer narrower intervals, and among those, faster ones.

Methodology

We rigged realistic distributions, randomly sampled from them, then measured how close each CI method gets to the true value across repeated trials. Two estimands were tested: single-sample CIs (does the interval contain the true mean?) and pairwise difference CIs (does the interval contain the true difference?). Sample sizes: N = 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100. Bootstrap and Bayesian draws: 10,000. Monte-Carlo trials: 2,000 per condition. We also validated on real LLM eval data by subsampling from 1) OpenEvals outputs across 15 unique model × benchmark pairs, for the single-run case, and 2) outputs of six models across nine benchmarks run 5 times per benchmark, for the multi-run case. Simulations are implemented in simulations/ folder in the project repository.

Distributions Tested

Across our synthetic simulations, we tested the following four data types (our simulations using real data test only the first two, binary and continuous):

Binary scores ({0, 1}): Bernoulli with p ∈ {0.02, 0.05, 0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 0.98} (9 scenarios)
Continuous scores (0–1 floats): Beta families — uniform Beta(1,1), U-shaped Beta(0.5,0.5), right-skewed Beta(2,8), left-skewed Beta(8,2), moderate-skew Beta(2,5), extreme-right Beta(0.35,6), extreme-left Beta(6,0.35), near-boundaries Beta(0.3,0.3), near-center Beta(6,6), mixture of Beta(0.5,4) and Beta(5.5,1.2); plus a logit-normal (μ=−0.35, σ=1.35), zero-inflated (70% spike at 0, Beta(2,4) body), and one-inflated (70% spike at 1, Beta(4,2) body) (13 scenarios)
Likert scores (integers 1–5): latent-normal model rounded and clipped to 1–5, covering uniform (high variance), skewed-low, skewed-high, bimodal (peaks at extremes), center-peaked, near-floor, near-ceiling, polarized, and flat-middle (9 scenarios)
Grades (0–100 floats): clipped-normal scenarios — symmetric N(50,20), high-scoring N(75,15), low-scoring N(35,20), ceiling-heavy N(88,10), floor-heavy N(12,10), very-high N(92,7), very-low N(8,7), high-variance N(50,34); plus a 3-component truncated-normal mixture, a heavy-tailed t(df=3) distribution, a zero-spiked (40% spike at 0, N(45,20) body), and a hundred-spiked (40% spike at 100, N(65,18) body) (12 scenarios)

Single-Sample Coverage (target: 0.95)

The following image and table presents the results of the Monte Carlo simulation. The table breaks down the final average coverage rates at each sample size, with notes on performance. (Coverage rate = fraction of trials where the CI contains the true population mean. Higher is better; under-coverage means the method is over-confident.)

Single-sample CI coverage vs. computation cost for all methods, averaged across eval types and sample sizes

Table 1 — Single-sample method summary, synthetic data (averaged across all eval types, scenarios, and sample sizes)

Method	Coverage	Time (ms)	Width	Dev	Data types
Percentile bootstrap	0.911▼	1.346±0.001	3.3176	-0.039	Any
BCa	0.913▼	1.635±0.001	3.3392	-0.037	Any
Bayes bootstrap	0.908▼	2.376±0.001	3.2552	-0.042	Any
Smooth bootstrap (Gaussian KDE)	0.929	3.273±0.002	3.7557	-0.021	Any
Bootstrap-t	0.910▼	1.981±0.001	3.7062	-0.040	Numeric (continuous/discrete)
Student's t-interval	0.918	0.042±0.000	3.6109	-0.032	Numeric (continuous/discrete)
Wilson score interval ★	0.951	0.029±0.000	0.2033	+0.001	Binary
Wald interval	0.806▼	0.026±0.000	0.1716	-0.144	Binary
Clopper-Pearson	0.977	0.049±0.000	0.2236	+0.027	Binary
Bayesian independent samples	0.946	0.243±0.000	0.2037	-0.004	Binary
Beta interval	0.934	2.442±0.003	0.1361	-0.016	Continuous [0, 1] (non-binary)
Logit-t ★	0.951	0.042±0.000	0.1443	+0.001	Continuous [0, 1]
Normal-Inverse-Gamma (NIG) ★	0.958	0.031±0.000	0.1430	+0.008	Continuous [0, 1]
Empirical likelihood (EL)	0.937	1.069±0.001	0.1327	-0.013	Numeric (continuous/discrete)

Coverage target is 0.95. ▼ marks under-target coverage. ★ marks preferred methods. Methods with a scope note (binary only / continuous [0, 1] only) are not applied to other data types; their numbers are therefore not directly comparable to general-purpose methods. Bootstrap-family methods are averaged across all data types. MCSE values are negligible (around ~0.005 or lower), so they are omitted from this display.

Single-Sample Coverage on Real LLM Eval Data (target: 0.95)

To complement the synthetic simulation, we also evaluated CI methods on real LLM evaluation data by randomly drawing samples from evaluation score outputs in the OpenEvals dataset, across 15 unique model × benchmark pairs, for variety. Coverage was measured by repeatedly subsampling from each benchmark's full score distribution and checking whether the CI contained the true (full-population) mean. This tests how methods behave on actual eval data shapes, which may differ from the idealized distributions in the synthetic study.

Single-sample CI coverage vs. computation cost on real LLM eval data (OpenEvals), averaged across benchmarks and sample sizes

Table 2 — Single-sample method summary, real LLM eval data via OpenEvals (mean across all benchmarks, models, and sample sizes)

Method	Coverage	Time (ms)	Width	Dev	Data types
Percentile bootstrap	0.899▼	1.359±0.001	0.2082	-0.051	Any
BCa	0.901▼	1.656±0.001	0.2184	-0.049	Any
Bayes bootstrap	0.903▼	2.395±0.002	0.2042	-0.047	Any
Smooth bootstrap (Gaussian KDE)	0.918	3.269±0.003	0.2381	-0.032	Any
Bootstrap-t	0.884▼	1.887±0.001	0.2250	-0.066	Numeric (continuous/discrete)
Student's t-interval	0.907▼	0.043±0.000	0.2293	-0.043	Numeric (continuous/discrete)
Wilson score interval ★	0.955	0.029±0.000	0.2521	+0.005	Binary
Wald interval	0.879▼	0.025±0.000	0.2476	-0.071	Binary
Clopper-Pearson	0.975	0.051±0.000	0.2793	+0.025	Binary
Bayesian independent samples	0.953	0.246±0.001	0.2531	+0.003	Binary
Beta interval	0.939	2.073±0.004	0.1334	-0.011	Continuous [0, 1] (non-binary)
Logit-t	0.943	0.042±0.000	0.1434	-0.007	Continuous [0, 1]
Normal-Inverse-Gamma (NIG) ★	0.964	0.032±0.000	0.1423	+0.014	Continuous [0, 1]
Empirical likelihood (EL)	0.940	1.109±0.002	0.1307	-0.010	Numeric (continuous/discrete)

Pairwise Comparison Coverage (target: 0.95)

For paired comparisons, coverage measures whether the CI on the difference contains the true difference. This is the more practically important statistic for comparing two prompts or models.

Pairwise scenarios are parameterised by two factors that are swept jointly: the intraclass correlation (ICC), which captures how much variance is attributable to the input (prompt/example) rather than the model, and the standardised effect size (Cohen's d), which sets the true difference between the two candidates. The official test sweeps ICC ∈ {0.05, 0.15, 0.30, 0.50} and d ∈ {0.0, 0.2, 0.4} (the d = 0 condition measures Type I error rate; the non-null conditions measure coverage under a real effect). ICC is implemented differently per data type: a Beta-hierarchical model for binary scores, a Gaussian-noise perturbation of a Beta base for continuous scores, and a latent-normal model for Likert and grades. All pairwise scenarios also use the full expanded scenario suite, so the distributions tested are the same rich set described above.

Pairwise CI coverage vs. computation cost for all methods, averaged across eval types, ICC levels, and sample sizes

Table 3 — Pairwise method summary, synthetic data (averaged across all eval types, scenarios, and sample sizes)

Method	Coverage	Time (ms)	Width	Dev	Data types
Percentile bootstrap	0.922	1.458±0.000	2.7531	-0.028	Any paired
BCa	0.908▼	1.791±0.000	2.7734	-0.042	Any paired
Bayes bootstrap	0.917	3.473±0.001	2.7108	-0.033	Any paired
Smooth bootstrap (Gaussian KDE)	0.950	3.549±0.001	3.1162	+0.000	Any paired
Bootstrap-t	0.930	2.026±0.000	3.0566	-0.020	Numeric/continuous paired
Student's t-interval ★	0.950	0.055±0.000	3.8937	+0.000	Numeric/continuous paired
Newcombe score interval	0.854▼	0.051±0.000	0.2007	-0.096	Binary paired
Tango score interval ★	0.955	0.042±0.000	0.2682	+0.005	Binary paired
Bayesian independent comparison	0.978	0.546±0.000	0.3313	+0.028	Binary paired
Bayesian paired comparison	0.958	4.071±0.000	0.3010	+0.008	Binary paired
NIG (adapted for arbitrary numeric range)	0.937	0.037±0.000	3.4274	-0.013	Numeric/continuous paired
Empirical likelihood (EL)	0.937	1.317±0.000	3.6481	-0.013	Numeric/continuous paired

Coverage target is 0.95. ▼ marks under-target coverage. ★ marks preferred methods. Results are averaged across all eval types, scenarios, and sample sizes. Binary-only methods (Newcombe, Tango, Bayesian paired/independent) are shown for diagnostic reference and are not directly comparable to general-purpose methods.

Key findings from pairwise simulation

For pairwise comparisons on general numeric data, smooth bootstrap (0.950) and the Student's t-interval (0.950) both hit the target exactly. BCa remains the worst performer (0.908▼) and is explicitly not recommended. Among binary-specific methods, Tango score interval (0.955) is the new preferred method, replacing Newcombe (0.854▼ — severe under-coverage). The Bayesian paired method looks good in aggregate (0.958), but because it is expensive and has a catastrophic failure mode around N≈200+, it is retained only as a reference benchmark.

So Did Tango Score Interval Just Outperform the Bayesian Paired Method?

Based on the synthetic simulation alone, it certainly looks that way. In that study, Tango (0.955) edges out the Bayesian paired method (0.958) in coverage while being nearly 100× faster and without the catastrophic N≈200+ failure mode. The synthetic distributions, however, are drawn from idealized parametric families. The natural question is: does this hold on real LLM eval data, where item correlations, difficulty clustering, and lopsided score distributions are facts of life rather than simulation choices?

Pairwise Coverage on Real LLM Eval Data (Binary)

To check, we ran a pairwise coverage simulation on OpenEvals data, subsampling from pairs of models that share items for the same benchmark, across multiple benchmarks, and measuring whether each method's CI on the paired difference contained the true (full-population) difference. The results reveal a discrepancy that the synthetic study masked:

Pairwise binary CI coverage vs. computation cost on real LLM eval data (OpenEvals), comparing Tango and Bayesian paired methods across sample sizes

Table 6 — Pairwise binary method summary, real LLM eval data via OpenEvals (mean across all benchmarks, model pairs, and sample sizes)

Method	Coverage	Time (ms)	Width	Dev	Data types
Newcombe score interval	0.816▼	0.037±0.000	0.2382	-0.134	Binary paired
BCa	0.892▼	1.649±0.002	0.3272	-0.058	Any paired
Bayes bootstrap	0.899▼	2.368±0.003	0.3214	-0.051	Any paired
Percentile bootstrap	0.910▼	1.338±0.001	0.3242	-0.040	Any paired
Bootstrap-t	0.918	1.892±0.002	0.3455	-0.032	Any paired
t_interval	0.942	0.057±0.000	0.3593	-0.008	Any paired
Tango score interval ★	0.946	0.048±0.000	0.3111	-0.004	Binary paired
Smooth bootstrap (Gaussian KDE)	0.948	3.242±0.004	0.3727	-0.002	Any paired
Bayesian paired comparison ★	0.958	4.076±0.000	0.3439	+0.008	Binary paired
Bayesian independent comparison	0.975	0.551±0.000	0.3844	+0.025	Binary paired

Coverage target is 0.95. ▼ marks under-target coverage. ★ marks preferred methods. Results are averaged across all benchmarks, model pairs, and sample sizes. Methods shown for all types (bootstrap, t-interval, etc.) are included for reference; in practice only binary-specific methods are applied to binary data.

Tango is still good, but it slightly under-covers at small N<50—so we can't entirely reproduce the strong result. Why?

Why does Tango under-cover at small N on real data?

When we looked more closely at which model pairs drive the under-coverage, two patterns emerge:

Dominated pairs: one model scores near 0% on a benchmark while the other scores, say, 80%. The discordant pairs (inputs where the two models disagree) are too lopsided for Tango's score statistic to accumulate enough signal, and the CI ends up slightly too narrow.
Jointly sparse pairs: both models score very low, e.g. both around 2%. Again, discordant pairs are too rare, and the CI ends up slightly too narrow.

These extreme situations are very common in our sample of OpenEval's benchmark, since we sample a diverse set of models at different parameter scales (e.g., a 1B model versus a 70B one), where the strong model dominates. Our synthetic simulation represented these extreme situations, but in the end, their results are averaged with more "well-behaved" pairs across a wide range of distributions.

In fact, we can see how it might be an artifact of the OpenEval sampling by running the same simulation on different LLM evals data—taking a single-run of the multi-run benchmark data we collected (explained below). When we do that, we see Tango again matches Bayesian paired:

CI Coverage vs. Computation Cost — Pairwise Binary (Real LLM Evals, Single-Run Subsample)

Pairwise binary CI coverage vs. computation cost on real LLM eval data (multi-run benchmark, single-run subsample), comparing Tango and Bayesian paired methods across sample sizes

Figure: Coverage vs. computation cost for pairwise binary CI methods on real LLM eval data (single-run subsample of multi-run benchmarks), comparing Tango and Bayesian paired across sample sizes.

This pattern could also be an artifact of how the real-data simulation is constructed. Because we subsample from large benchmarks — and large benchmarks contain correlated items and difficulty clusters — a random draw of N = 20 items can easily land in a sparse or lopsided regime that would be unusual in a carefully curated, representative eval set. In a smaller, purpose-built evaluation that attempts reasonable coverage of difficulty, the dominance and sparsity problems would largely disappear.

Finally, the situations where Tango under-performs are also extreme cases where CIs don't tell us much —if one model completely dominates in performance (0% vs 80%), merely eyeballing mean differences may be enough. Similarly, if both models perform very poorly (2% vs 2%), statistics are similarly not that informative. So if a CI method is "slightly off", it may not be a big deal in practice.

That said, evalstats cannot base recommendations on "it's usually fine." Since the failure mode exists and is observable on real benchmark data at small N, our recommendation changes:

N < 50: use Bayesian paired with 10,000 resamples for binary pairwise CIs. It provides better small-sample coverage in the real-data regime, including the sparse/dominated cases where Tango struggles. We sacrifice speed considerably, but it's safer for extreme cases that may very well appear in practice on poorly constructed evaluation sets or "David vs. Goliath" scenarios.
N ≥ 50: switch to Tango score interval. At this point Tango reaches its target coverage, is dramatically faster, produces tighter CIs than Bayesian paired, and sits closer to the nominal α level rather than over-covering. Critically, it also does not deteriorate at large N the way the Bayesian paired method does.

This threshold-based switch is what evalstats implements under method="auto" for binary pairwise comparisons.

Multi-Run Coverage (Hierarchical Data)

Increasingly, it has become good practice to collect multiple "runs" per input, to account for the inherent stochasticity in LLM outputs. These multi-run scenarios introduce a second level of variance — run stochasticity within each input — that single-run methods ignore or may not even be able to handle. We ran a separate Monte Carlo simulation (200 MC reps; 10% of the single-run study) over the same distribution suite, sweeping both N (number of inputs) and f_run (runs per input), to compare "flat" methods that just use the mean of runs or take only the first run, against nested resampling methods. For a harder challenge that better matches real LLM eval data (where each "item" is really of different difficulty), run-level noise in these simulations is heteroscedastic: run noise variance differs across inputs.

The key finding is that taking the per-input mean across runs and running standard methods over those means performs reasonably well — Wilson, Tango, and NIG all achieve near-target coverage this way. Input-level clustering is automatically absorbed when the mean is treated as a single observation, so these methods are not as badly hurt as truly flat approaches that treat every run as an independent data point. However, tighter CI widths can be achieved with methods that explicitly account for within-item run variance: by decomposing total variance into a between-item term and a within-item Monte Carlo noise term (as in tango_multirun_mmnt and the nested resampling variants), the interval can be narrowed without sacrificing coverage. The practical upshot is that running on means is a safe baseline, but if you have enough runs per input the multi-run-aware methods will give you meaningfully narrower intervals at no extra compute cost.

CI Coverage vs. Computation Cost — Point Estimates

$Multi-run point-estimate CI coverage vs. computation cost for all methods, averaged across eval types, run-noise fractions, and sample sizes$

Figure: Coverage vs. computation cost for point-estimate (mean) CI methods in the multi-run setting, averaged across eval types, run-noise fractions, and sample sizes.

Table 4 — Multi-run point-estimate method summary, synthetic data (averaged across all eval types, scenarios, N, and f_run)

Method	Coverage	Time (ms)	Width	Dev	Data types / Notes
bootstrap	0.915	1.805±0.002	3.1739	-0.035	Any
bca	0.921	2.272±0.002	3.1981	-0.029	Any
bayes_bootstrap	0.912	3.230±0.004	3.1036	-0.038	Any
smooth_bootstrap	0.935	4.375±0.005	3.6200	-0.015	Any
bootstrap_t	0.924	2.552±0.003	3.6410	-0.026	Any
t_interval ★	0.924	0.080±0.000	3.5029	-0.026	Numeric only — near-target 0.95 for arbitrary-ranged data; appears here to under-cover due to subpar performance on binary 0/1 data
bootstrap_nested	0.928	27.472±0.033	3.3464	-0.022	Any
bayes_bootstrap_nested	0.935	21.073±0.026	3.4204	-0.015	Any
bca_nested	0.936	28.137±0.034	3.3720	-0.014	Any
smooth_bootstrap_nested	0.944	30.182±0.036	3.7742	-0.006	Any — best general-purpose nested method; near-exact coverage
bootstrap_t_nested	0.928	28.214±0.034	3.6337	-0.022	Any
wilson_flat ★	0.974	0.066±0.000	0.2191	+0.024	Binary — simple, still performs well; takes mean of runs per item
wald_flat	0.828▼	0.042±0.000	0.1838	-0.122	Binary only
clopper_pearson_flat	0.989	0.083±0.000	0.2432	+0.039	Binary only
bayes_indep_flat	0.920	0.454±0.002	0.2077	-0.030	Binary only
wilson_de	0.881▼	0.069±0.000	0.1335	-0.069	Binary only
wilson_od	0.918	0.051±0.000	0.1611	-0.032	Binary only
beta_binomial	0.919	0.049±0.000	0.1606	-0.031	Binary only
nig_nested ★	0.966	0.144±0.000	0.1691	+0.016	Binary and continuous [0,1] — safer alternative to nig; most on-target method across synthetic and real data simulations
beta	0.932	3.377±0.009	0.1443	-0.018	Continuous [0,1]
logit_t ★	0.949	0.079±0.000	0.1548	-0.001	Continuous [0,1] only — on-target and instant
nig ★	0.957	0.048±0.000	0.1606	+0.007	Binary and continuous [0,1] — for binary, takes mean of runs per item; slightly conservative, very fast
el	0.933	1.729±0.003	0.1403	-0.017	Continuous [0,1] only

Coverage target is 0.95. ▼ marks severe under-coverage. ★ marks preferred methods. 200 MC reps per condition. "Flat" methods treat multiple runs as independent observations and ignore run-level variance entirely. Nested methods resample inputs, then runs within inputs.

Why does nig_nested work for binary data?

nig_nested is a hierarchical extension of the Normal-Inverse-Gamma (NIG) Bayesian credible interval designed for multi-run data. The standard nig method places a NIG conjugate prior on (μ, σ²) and returns an equal-tailed credible interval derived from the marginal Student-t posterior for μ. The nested variant adapts this to an (N, R) score matrix by explicitly decomposing variance into two levels: a between-item component (variance of per-item means across inputs) and a within-item component (average run-level variance divided by R). It then inflates the sum-of-squares fed into the NIG update as σ²_eff = σ²_item + σ²_run / R, so the posterior uncertainty automatically widens to reflect run noise — without needing any resampling.

What makes its binary performance surprising is that NIG is a Gaussian model, so there is no principled reason it should fit 0/1 data well. Yet it achieves 0.966 coverage, crucially remaining above 0.95 for all N, while remaining fast (0.14 ms). The likely explanation is that the hierarchical variance decomposition happens to produce a conservatively wide interval for binary data under the default weakly-informative prior, compensating for the model mismatch. This is somewhat accidental good behavior, as theoretically, a binomial model would be a more justified choice. It nevertheless earns the ★ as the most on-target method across both synthetic AND real-data simulations, outperforming or matching the other "more theoretically reasonable" Bayesian methods (beta-binomial, bayes_bootstrap_nested, and bayes_indep_flat) at lower cost. We can't argue with those results.

Multi-Run CI Method Performance for Pairwise Comparisons

CI Coverage vs. Computation Cost — Pairwise Comparisons

$Multi-run pairwise CI coverage vs. computation cost for all methods, averaged across eval types, run-noise fractions, and sample sizes$

Figure: Coverage vs. computation cost for pairwise CI methods in the multi-run setting, averaged across eval types, run-noise fractions, and sample sizes.

Table 5 — Multi-run pairwise method summary, synthetic data (averaged across all eval types, scenarios, N, and f_run)

Method	Coverage	Time (ms)	Width	Dev	Data types / Notes
bootstrap	0.927	1.932±0.002	1.9882	-0.023	Any paired
bca	0.919	2.430±0.002	2.0047	-0.031	Any paired
bayes_bootstrap	0.913	3.505±0.003	1.9516	-0.037	Any paired
smooth_bootstrap	0.954	4.686±0.004	2.2674	+0.004	Any paired - under-covers for small N on binary data
bootstrap_t	0.936	2.718±0.002	2.2593	-0.014	Any paired
t_interval ★	0.947	0.089±0.000	2.1927	-0.003	Numeric paired only — near-target and instant; aggregate under-coverage due to binary subpar performance
bootstrap_diff_nested	0.955	51.815±0.049	2.2261	+0.005	Any paired — near-target but expensive
bayes_diff_nested	0.956	35.782±0.033	2.3823	+0.006	Any paired — near-target; cheaper than bootstrap nested
smooth_diff_nested	0.972	54.841±0.051	2.4805	+0.022	Any paired — safe and conservative even for small N on binary data; most expensive
tango_flat	0.944	0.088±0.000	0.2775	-0.006	Binary only — fast flat binary pairwise method
newcombe_flat	0.757▼	0.050±0.000	0.1874	-0.193	Binary only — severe under-coverage
bayes_indep_comp	0.980	0.701±0.001	0.3486	+0.030	Binary only — over-covers; wide CIs
bayes_paired_comp	0.962	5.333±0.003	0.3173	+0.012	Binary only — slightly over-covers
tango_multirun_cluster	0.957	0.092±0.000	0.2174	+0.007	Binary only, multi-run — near-target and fast
tango_multirun_effective	0.964	0.074±0.000	0.2256	+0.014	Binary only, multi-run — near-target and fast
tango_multirun_mmnt ★	0.957	0.068±0.000	0.2178	+0.007	Binary only, multi-run — best and cheapest; performs consistently in real data simulations

Coverage target is 0.95. ▼ marks under-target coverage. ★ marks preferred methods. 200 MC reps per condition. Nested methods resample inputs then runs within inputs for each resample step, greatly increasing compute cost but properly accounting for run-level variance.

On tango_multirun_mmnt: a GPT-5 invention?

tango_multirun_mmnt is the best-performing method—but, to the best of our knowledge, it is an approach that did not exist in the literature prior to this project. It emerged from a conversation with GPT-5, where we asked it to extend the Tango paired-binary CI to the multi-run setting. GPT-5 proposed decomposing the total variance of the per-item paired difference — computed as delta_i = d10_i − d01_i (mean discordance rates across runs) — into a between-item term and a within-item Monte Carlo noise term. Crucially, it identified that naively using the observed item-level variance Var(delta_i) would double-count the within-item run noise, and proposed subtracting E[u_i − delta_i^2] / R (where u_i = d10_i + d01_i is the discordance mass and R is the number of runs) to recover the latent between-item variance. The resulting interval retains Tango's score-shrinkage denominator and reduces exactly to the original Tango CI when R = 1. Under simulation it achieves near-target coverage (0.957) while being the fastest of the three multi-run Tango variants, making it the recommended default. It is notable as an example of a novel statistical procedure generated by a language model that holds up under both synthetic and real-data simulation.

Multi-Run Coverage with Real LLM Evals Data

Since we had run the other cases on real LLM eval data, we wanted to check if the multi-run results held up there as well. Unfortunately, we could not use OpenEvals (or any other public dataset) because they do not have multi-run data. (Only DOVE does, but they sample only 100 items per benchmark, which is too small for a coverage simulation.) Thus, we had to run benchmarks ourselves on a sample of models. For this test, we only ran benchmarks that produce binary scores, for two reasons: the Tango under-coverage problem was more pronounced there and we wanted to see if it held up on real data; and it is harder to find numeric score benchmarks.

We ran the following benchmarks and models using the Inspect AI benchmark harness and the OpenRouter API. This range of models was chosen as a good sample of model diversity in terms of provider and size:

Benchmarks: MMLU, ARC, HellaSwag, GSM8K, WinoGrande, TruthfulQA, BoolQ, PIQA, BBQ
Models: meta-llama/llama-3.1-8b-instruct, mistralai/ministral-8b-2512, google/gemma-3n-e4b-it, ibm-granite/granite-4.1-8b, qwen/qwen3.5-35b-a3b, openai/gpt-4o-mini

As you can see, the above results largely hold up on real data for the binary case. The one difference is that many methods are brought down a little, which is why we recommend NIG-nested here instead of NIG on means.

CI Coverage vs. Computation Cost — Point Estimates (Real LLM Evals)

Multi-run real-evals point-estimate CI coverage vs. computation cost

Figure: Coverage vs. computation cost for point-estimate CI methods in the multi-run setting, measured on real LLM eval data.

CI Coverage vs. Computation Cost — Pairwise Comparisons (Real LLM Evals)

Multi-run real-evals pairwise CI coverage vs. computation cost

Figure: Coverage vs. computation cost for pairwise CI methods in the multi-run setting, measured on real LLM eval data.

Comparison with Bayesian Methods for Binary Scores

To address the ICML 2025 position paper arguing that practitioners should avoid CLT-style / frequentist intervals for LLM evals in favor of Bayesian beta-posterior methods ("Don't Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints"), we ensured the simulation included the exact Bayesian methods the authors advocate, implemented using their bayes_evals library. Our simulation results put some of their recommendations into question. For the binary scores 0/1, we find:

Wilson score interval lands on target coverage (0.951 synthetic; 0.955 real data) at much cheaper computational cost, so the paper's preferred Bayesian alternative is not needed here.
For the pairwise comparisons, the story is nuanced. In the synthetic simulation, Tango (0.955) matches Bayesian paired (0.958) in coverage while being nearly 100× faster. But on real LLM eval data — where dominated pairs and jointly-sparse model comparisons are common — Tango under-covers at N < 50, caused by extreme situations of sparse data regimes. Bayesian paired remains better in these small-sample setups. At N ≥ 50, Tango reaches target coverage, is dramatically faster, produces tighter intervals, and does not deteriorate at large N.
In the new multi-run simulation, the NIG-nested variant achieves the highest point-estimate coverage overall (0.966), well above the authors' Bayesian independent samples method (0.920). NIG-nested is itself a Bayesian method — placing a Normal-Inverse-Gamma prior on (μ, σ²) and returning a marginal Student-t credible interval — but uses a Gaussian likelihood rather than a Binomial/Beta model. The result shows that the authors' specific Beta-posterior model is not uniquely suited to the hierarchical binary case; a different conjugate Gaussian model substantially outperforms it.

In the course of this analysis, we also uncovered a severe problem in the authors' library that makes it unsuitable for practical usage on benchmark data: around N≈200+, the bayes_evals Bayesian paired method begins to fail catastrophically with overconfident CIs that continue to decline as N increases. (We ran additional simulations up to N=1000 confirming this pattern. The reason why seems to be that the importance sampling procedure becomes unstable at larger N.) Because of this failure mode, Bayesian paired is not recommended for N ≥ 100 — and it should never be used beyond N ≈ 200. For small samples (N < 50), it retains a narrow use-case: real-data simulations show it handles dominated and jointly-sparse pairs better than Tango in that regime. For larger samples, and as the default once N reaches 50, practitioners could switch to the Tango and Wilson intervals (and their multi-run analogues), which are fast, well-calibrated, and stable. For multi-run data, which was not considered in the paper, practitioners should use NIG-nested for point estimates and Tango's multi-run variant for pairwise comparisons.

Is BCa bootstrap a good method for estimating CIs?

If you chat with an LLM, it might start to go off about how BCa is the gold standard. Don't listen to it. BCa is actually one of the worst methods in most of our simulations. We aren't the only skeptics, either. One simulation study found:

"Our simulation results revealed an outperformance of the percentile method over both BCa and Student's t with respect to the desired coverage of the CIs. The BCa method tended to produce slightly better-balanced CIs than the other two methods, but overall, it was undermined due to its narrower CIs than the desired level in coverage." — Simulation study on bootstrap CI methods for factor analysis loadings

This is consistent with our own simulation results on both synthetic and real LLM eval data.

Simultaneous Confidence Intervals for Multiple Comparisons

When comparing multiple prompts or models, it is common to run multiple pairwise comparisons. In these cases, controlling the family-wise error rate (FWER) is essential to avoid spurious findings. The max-T resampling procedure is a well-established method in genomics and neuroscience for constructing simultaneous confidence intervals (CIs) that jointly cover all comparisons, rather than treating each CI in isolation (basically, it widens the CIs to ensure the alpha level is maintained across all comparisons). It also has very few assumptions, allowing for robust inference even when candidates are correlated. Despite its power and flexibility, max-T is virtually unknown in LLM evaluation and ML contexts.

What is max-T?

The max-T procedure works by resampling the data (e.g., via bootstrap), computing the test statistic (such as the difference in means) for all pairs in each resample, and then recording the maximum absolute value of the test statistics across all comparisons for each resample. The distribution of these maxima is then used to set CI bounds that guarantee the desired coverage simultaneously for all comparisons, not just marginally. This approach is less conservative than Bonferroni correction, yet provides strong control over false positives.

evalstats uses the max-T procedure by default whenever you output multiple pairwise comparisons with bootstrapped CIs. This ensures that the reported intervals account for the increased uncertainty from making many comparisons at once. For consistency, evalstats also applies the max-T correction to p-values when users request them, so that both CIs and p-values are corrected in the same principled way.

The max-T method is recommended by Rink & Brannath, 2025, a recent ML paper on multiple comparisons correction when comparing many candidate models that may be correlated. For more foundational background, see the Methods Used in evalstats and Statistical Pitfalls in ML sections of the Resources page.

Which p-value Method Works Best?

We also ran another Monte Carlo simulation study to compare p-value methods for pairwise testing across LLM eval data types and sample sizes, implemented in simulations/sim_compare_pvalues.py. The criterion here is Type I error control under the null (target: α = 0.05) plus consistency across data distributions and N.

Wilcoxon Signed Ranks is our recommended default for pairwise comparisons for those adopting a NHST framework. In this simulation, it is slightly conservative, consistently performant across sample sizes, data types, and distributions, and does not show pathological behavior where Type I error is inflated above the nominal threshold (α = 0.05). Wilcoxon is a non-parametric test that makes fewer assumptions about the data distribution, which likely contributes to its robustness in this context. It is also widely implemented, well-understood, and computationally efficient, making it a practical choice for developers.

That said, the Sign Test is also a reasonable choice for safety-critical settings where false positives are especially costly, because it is much more conservative than Wilcoxon.

For comparing 3+ models/prompts, we recommend the Friedman test as the default, with Wilcoxon signed ranks as the post-hoc pairwise test. Friedman is a non-parametric test for comparing more than two related samples, it doesn't assume normality, and it's robust to homoscedasticity (variance of residuals). Friedman tests are already used in some ML contexts for comparing models, as advocated by Demsar (2006). Note that evalstats does not output Friedman by default, since it does not foreground a NHST-style analysis pipeline, but it is available as an option for users who want to use it.

For multiple comparisons correction, we recommend the Benjamini-Hochberg procedure (correction="fdr_bh") to those in lower-stakes contexts, as it which performed comparably to other methods in our simulation, but had the greatest power among methods that maintained Type I error control. In extremely safety-critical contexts, the Bonferroni-Holm correction (correction="holm") is more appropriate for its strong control over false positives, albeit at the cost of increased false negatives.

Do not rely on NHST alone

We ran simulations to provide guidance on which p-value method is most reliable for LLM evals, for those who want to use them, but we do not recommend relying on p-values. Null-hypothesis significance testing (NHST) should not be the primary basis for eval decisions. A p-value is only a compatibility check against a null model; it does not tell you the magnitude, direction, or practical importance of an effect.

If you report p-values at all, always report them alongside effect sizes and confidence intervals. In practice, prefer effect sizes and CIs as the main evidence, and use p-values as secondary context. Prefer alpha thresholds that are more conservative than 0.05, if you use them at all (e.g., 0.001).