Why 50 pairwise votes beat 500 ratings
Rating scales drift, anchor, and lie. Pairwise comparisons survive all three. Here is the math we use, and where it actually breaks down.
If you have ever asked a panel of human reviewers to score model outputs on a 1-to-7 Likert scale, you know how the data turns to mush. Reviewer A is generous on Tuesdays. Reviewer B reserves “7” for divine intervention. Reviewer C is anchored by whatever they saw last. The arithmetic mean over all of that is, charitably, a vibe.
This is one of the oldest results in psychometrics, and it is the reason compere does not implement rating sliders. We collect comparisons — “A or B?” — and we let an Elo update handle the bookkeeping. After fifty well-chosen comparisons, you typically learn more about a catalog of twenty items than you would from five hundred independent ratings of the same items. Here is why, and where the claim breaks.
What rating scales actually measure
A reviewer who hands you a “5 out of 7” has performed three operations in their head, mostly unconsciously. They have evaluated the item. They have located it on their personal scale. And they have applied whatever calibration they currently believe is in force — “are we grading on a curve today?” Only the first step is about the item. The other two are noise, and they compound across reviewers.
The standard fixes — z-scoring per reviewer, mean-centering, mixed-effects models with a reviewer random intercept — help with location bias. They do not help with the harder problem, which is that the scale itself is not stable within one reviewer’s session. Anchoring effects on Likert data are well-documented and they are large. If item #3 in the queue was a disaster, the reviewer’s “5” between item #4 and item #20 means two different things.
Pairwise comparisons sidestep all of this. The reviewer is not asked to locate either item on a scale. They are asked which is better. The output is a single bit per question, which is much less information per response than a Likert rating — but it is a cleaner bit, and we can ask many more of them.
What Elo does with those bits
Compere stores each entity’s strength as a single floating-point rating (default initial value: 1500). When a comparison resolves, we update both ratings with the standard Elo formula:
expected_a = 1 / (1 + 10**((rating_b - rating_a) / 400))
new_rating_a = rating_a + K * (actual_a - expected_a)
K is the rating sensitivity (default 32). actual_a is 1 if entity A won, 0 if it lost, 0.5 for a draw. The formula is the same one used by FIDE, and it has the same properties: ratings move quickly at first, then stabilize; a win against a weaker opponent is worth less than a win against a stronger one; the system is conservative about overturning established ratings.
There is no maximum-likelihood estimator under the hood. Compere is not running Bradley-Terry; it is running online Elo. This is a deliberate trade. An MLE would extract slightly more information from the same data, but it would require a refit after every batch, and it would lose the property that you can hand a stakeholder a single number per entity and they can explain its movement after a single match.
Why “well-chosen” matters
The big lie in the headline is the word “well-chosen”. Fifty random pairwise comparisons over a catalog of twenty items will not give you a good ranking. They will give you a bunch of redundant evidence about pairs you already had data on, and zero evidence about the items that landed in nobody’s queue.
This is what the Multi-Armed Bandit module is for. Compere picks the next pair using the UCB1 score:
UCB(i) = win_rate(i) + c * sqrt(ln(N) / n_i)
where c defaults to 1.414, N is the total number of comparisons so far, and n_i is the number of comparisons involving entity i. Entities that have not been compared get a large prior weight (UCB_UNEXPLORED_WEIGHT, default 1000.0) so they enter the queue early. The two highest-UCB entities are paired.
The practical effect is that the system spends its first few rounds making sure every entity has been seen, then concentrates the remaining budget on pairs whose outcome is genuinely uncertain. A comparison between the obvious top of the catalog and the obvious bottom does not move ratings much and is mostly wasted. UCB will avoid that pair and ask you about the middle of the pack, where one extra vote actually shifts the ranking.
When the claim breaks
The “50 votes” number is not a guarantee. It is what we typically see in catalogs of 10-30 items with non-pathological underlying strengths. Three conditions can blow it up:
- Catalog explosion. UCB’s information-per-comparison decays roughly with the square root of the number of items. A catalog of 500 items will need a few hundred comparisons to even reach the point where ratings have moved off the initial 1500.
- Non-transitive structure. If A beats B beats C beats A in your domain, Elo will oscillate. The rating system assumes a single latent strength per entity. Pairwise voting on rock-paper-scissors does not converge, and no amount of UCB will save it.
- Drift in the underlying truth. If you are ranking news headlines for engagement and the news cycle moves underneath you, “the right ranking” is itself moving. The K-factor controls how fast you can chase the truth; pushing it up gets you there faster and makes the ratings noisier.
We have shipped UCB_EXPLORATION_CONSTANT and ELO_K_FACTOR as environment variables specifically so that you can tune for these. Higher exploration when your catalog is large and quiet; higher K when the truth itself is moving.
A practical recipe
If you are starting a new comparison study, the dial we recommend is:
- Start with the defaults.
c = 1.414,K = 32, initial rating1500. - Aim for roughly
5 * ncomparisons wherenis the number of entities. With twenty items that is a hundred votes; in our experience the top-five rank order is stable by then. - Watch the histogram of
n_i(comparisons per entity) during collection. If it is wildly uneven andUCB_UNEXPLORED_WEIGHTis at the default, you have a bug somewhere — UCB should be flattening it. - Stop when the leaderboard’s top-
kpositions are stable across the last 10% of votes. The next post in this series digs into this stopping rule in detail.
Fifty good pairwise comparisons will not give you a definitive ranking of five hundred items. But for the catalogs most teams actually have — tens, occasionally low hundreds — they will get you a more honest answer than five hundred ratings on a seven-point scale ever will. And if you want the math, the rating update is fourteen lines of Python and it is in rating.py. No prestige machinery; just an old algorithm being asked the right question.
One more thing worth saying explicitly. The “50 versus 500” framing is a slogan, and slogans hide their assumptions. The real claim is narrower: for the specific job of inducing a consistent ranking from human preferences on a modest catalog, pairwise verdicts plus UCB pair selection plus Elo updates extract more signal per minute of reviewer time than independent rating slides. The actual numbers depend on your reviewers, your items, and your tolerance for being wrong. But the direction is consistent enough across the studies we have run that we have stopped treating it as a hypothesis and started treating it as the default. If you have a counter-example, please file an issue; we genuinely want to know where this breaks in practice.