May 28, 2026 · Skelf-Research

Stopping rules: when have you compared enough?

There is no universal answer, but there are three honest tests. We walk through each, with the queries you can run against the compere API.

stopping-rulesucboperations

Every team that runs a comparison study eventually hits the same question. We have collected a few hundred votes. The leaderboard looks pretty stable. Do we stop, or do we collect more?

There is no universal answer. There are, however, three honest tests, none of which require statistics you cannot run from the compere API. This post walks through all three. We use the live HTTP endpoints — /ratings, /comparisons/, /mab/next_comparison — throughout, so you can copy-paste against your own deployment.

The short version: stop when the decisions you would make on this ranking would not change with more data. Everything below is operationalizing that.

Test 1: top-k stability

This is the test that maps most directly to product decisions. You usually do not need the whole ranking right. You need to know which three headlines to ship, or which five candidate responses to keep in the RLHF dataset, or which top item to promote. The rest of the leaderboard can wobble freely.

The check is straightforward. Snapshot the top-k entities every time you collect another m comparisons. If the set of entity IDs in the top-k has not changed in the last several snapshots, you are done for that decision.

# pseudocode against a running compere instance
top_k_now=$(curl -s localhost:8090/ratings | jq -r '.[:5] | .[].id')

# compare with a snapshot from N comparisons ago
diff <(echo "$top_k_now") <(cat snapshot_at_n200.txt)

Operationally, we treat top-k as stable when its membership has not changed across the most recent 10% of total comparisons. If your study has 300 votes, the last 30 should not move the top-5 set. Note this is about set membership, not order. If headlines A and B keep swapping the #1 and #2 slots but no other headline has entered the top-5, you are stable for any decision that selects “the top five” rather than “the single best one”.

If your decision is rank-sensitive (you genuinely need to know which of A or B is #1), tighten the test to top-k order, and expect to collect roughly 2-3x more votes.

Test 2: UCB exhaustion

The compere MAB module exposes a useful signal almost by accident. The UCB1 score for an entity is win_rate + c * sqrt(ln(N) / n_i). The right-hand term — the exploration bonus — shrinks as n_i grows. When every entity’s UCB score is dominated by its empirical win rate rather than its exploration bonus, the system is telling you it has no more high-information pairs to ask about.

Concretely, query /mab/next_comparison and inspect the two entity IDs it returns. Track which pair is being recommended over a sliding window. The signal you want is:

Early in the study, recommended pairs change frequently and span many entity IDs (the system is exploring).
Late in the study, recommended pairs cluster — the system is asking you the same handful of questions, because they are the only ones still uncertain.

When that cluster stabilizes — the same two or three pairs keep coming back — you have reached a point where the UCB algorithm is no longer learning. Asking more questions of those pairs will refine their relative position, but it will not change anything else on the board. That is a reasonable signal to stop.

This test has one important failure mode: if you set UCB_EXPLORATION_CONSTANT very high, the exploration bonus stays large for longer and the test triggers later than it should. The default c = 1.414 is calibrated for this stopping test to be useful around 5-10 comparisons per entity.

Test 3: bootstrap re-runs

The most rigorous test is also the slowest: rerun the ranking on a random subsample of your comparison history and see if the leaderboard agrees with itself.

import random
from compere.modules.rating import update_elo_ratings

def replay(comparisons, fraction=0.8):
    sample = random.sample(comparisons, int(len(comparisons) * fraction))
    # reset ratings, replay in random order, return final leaderboard
    ...

# run replay() a few dozen times; compare top-k stability across runs

If 80% of the data, in arbitrary order, reproduces roughly the same top-k, your ranking is stable to the sampling noise that actually generated it. If it does not, you have a fragile leaderboard and you need more votes. We have observed cases where top-3 was stable on the full data but moved in 40% of bootstrap replays; those rankings were one bad week of voting away from being wrong.

This is the test that catches non-transitive structure (rock-paper-scissors loops) and biased reviewer pools. Both produce leaderboards that look stable until you resample, at which point the rotation becomes obvious.

A combined recipe

In practice we run all three. The decision tree we use:

Have we collected at least 5 * n comparisons, where n is the number of entities? If no, keep going. Below that ratio the Elo numbers are mostly initialization noise and the stopping tests are not yet meaningful.
Is top-k set membership stable across the most recent 10% of votes? If no, keep going.
Has /mab/next_comparison been returning the same small cluster of pairs for the last several queries? If no, keep going — the system thinks it can still learn.
Does a bootstrap replay over 80% of the data reproduce the top-k? If no, keep going.
All four pass: stop.

For most catalogs we work with — 10 to 30 entities — this combined test fires somewhere between 8 * n and 15 * n comparisons. Smaller catalogs converge faster in absolute terms but waste a higher fraction of votes on initialization. Larger catalogs need disproportionately more data.

What stopping does not mean

Stopping a study does not freeze the truth. If the underlying preferences move — new reviewers join, the news cycle shifts, model outputs change — the old ranking goes stale. The mechanically correct answer is to either restart the study periodically or run a continuously-updated deployment with a smaller K-factor (the K controls how much weight to give recent comparisons relative to history). We have customers on both patterns. Neither is wrong; they answer different questions.

Stopping also does not produce confidence intervals. Compere does not report uncertainty intervals around each rating, because in online Elo the most defensible uncertainty proxy is n_i (the number of comparisons that fed into the rating). Two entities with similar ratings but very different n_i should be treated very differently when making a decision. If you need formal credible intervals, you should export your comparison history and fit a Bayesian Bradley-Terry. We do not do this in-product, on purpose: the moment we promise calibrated intervals, we owe the user a defensible model of reviewer noise, and reviewer noise is not stationary.

So the honest stopping rule is small: collect until your decision will not change, then audit the decision once more, then ship. The leaderboard will not be the truth. It will be an honest summary of the votes you collected, which is a much more achievable goal.

← All posts