Reading the Elo output without lying

Compere ships Elo, not Bradley-Terry. The numbers it produces are easy to misread. A field guide to what an Elo gap actually means.

elointerpretationcommunication

Compere outputs Elo ratings. We picked Elo deliberately, over Bradley-Terry, over Thurstone, over TrueSkill, because Elo has a single attractive property no other estimator quite matches: a stakeholder can read a rating change and explain it correctly in one sentence. That property is also Elo’s most common failure mode. The number is easy to read and easy to misread, and the misreads have shapes we have learned to recognize.

This post is a field guide. If you are about to put an Elo leaderboard in front of a product manager, this is the briefing.

What an Elo rating is

A compere Elo rating is a single floating-point number, starting at 1500 (configurable via ELO_INITIAL_RATING), updated after each comparison. The update is:

expected = 1 / (1 + 10 ** ((opponent - rating) / 400))
new_rating = rating + K * (actual - expected)

K is the K-factor (default 32). actual is 1, 0, or 0.5 for win, loss, or draw. The 400 in the denominator is a calibration constant inherited from chess; it sets the rating scale so that a 200-point gap corresponds to a ~76% win expectation.

That last sentence is the only quantitative reading of an Elo rating that is actually defensible.

What an Elo gap means

When entity A has rating 1620 and entity B has rating 1500, the model expects A to beat B with probability 1 / (1 + 10**((1500 - 1620) / 400)), which is approximately 0.665. That is a number. You can put it in a report.

The temptation is to do more arithmetic. People will, in our experience, look at a leaderboard like this:

Headline A    1720
Headline B    1612
Headline C    1500
Headline D    1390

and start saying things like “Headline A is 220 points better than Headline D” or “Headline A is 14% above the average”. Both of those sentences are wrong. The gap between A and D corresponds to a win probability (78% for A) and to nothing else. There is no linear scale of “goodness” here. There is no meaningful “average” to be 14% above; the population mean of a rating distribution drifts with K and with the comparison schedule.

A useful reframe is to translate every rating gap to the expected-win-probability before sharing it. Compere does not currently render this column in the API response, but it is one line:

def win_prob(a, b):
    return 1 / (1 + 10 ** ((b - a) / 400))

The leaderboard above, expressed as pairwise expected win probabilities for the top entity, is:

A vs B:  65%
A vs C:  76%
A vs D:  78%

Those are the numbers a product manager should be looking at.

What rating changes mean

Single-comparison rating changes are even easier to misread. After a 1500-rated entity beats a 1500-rated entity, both ratings move by K * (1 - 0.5) = 16 points (default K). If a 1700 entity beats a 1500 entity, the winner gains very little — that result was expected — while the loser drops by about K * (0 - 0.24) = -7.6 points. Symmetry breaks here: upsets move the board, expected results barely do.

This is desirable. It is also confusing the first time a reviewer sees their top-ranked entity stay flat after winning ten comparisons in a row. The flatness is the system saying “yes, we knew.” If you want flatness to be alarming instead of reassuring, you have configured the wrong tool.

Why we did not pick Bradley-Terry

Compere does not implement Bradley-Terry. A Bradley-Terry MLE would, in principle, extract a little more information from the same comparison data — it considers the full pattern of wins and losses jointly, rather than updating one match at a time. For static catalogs with fixed item strengths it is a strictly better estimator.

We chose Elo anyway, for three reasons:

  1. Online updates. Elo is constant-time per comparison. Bradley-Terry is a fit over the entire matchup history; it does not have a meaningful “this match’s contribution” and it must be refit after every batch. For systems serving live UCB pair selection, that becomes the bottleneck.
  2. Interpretability of deltas. “Your rating went up by 8” is a concept Elo gives you. Bradley-Terry only gives you the new MLE; what changed is a function of the whole dataset.
  3. Stakeholder familiarity. Half the people reading the leaderboard already know what an Elo is from chess. None of them know what a Bradley-Terry log-strength parameter is.

If your study is a one-shot offline ranking and you want a tighter estimator, you should export comparisons from compere and fit Bradley-Terry in scipy or choix. We do not do this for you, and the documentation says so.

The most common misreads, ranked

In approximate order of how often we have caught them in the wild:

“Entity X is twice as good as Entity Y, look at the ratings.” No. Elo is not on a ratio scale. There is no defined origin. A 3000-rated entity is not twice as good as a 1500-rated one in any meaningful sense; they are both arbitrary points on a relative scale.

“The average rating went up.” It cannot, mathematically. Elo conserves total rating across each match. If a leaderboard’s mean appears to have moved, you have either added or removed entities, or your initial-rating configuration changed between snapshots.

“This entity is the best because it has the most wins.” No: it might just have had the most comparisons. The leaderboard is by rating, not by win count, for this reason. Always pair a rating with its n_i (comparison count) when sharing results.

“The rating converged after 30 votes, so we are done.” Maybe. The next post in this series is entirely about how to know when you are really done, and the answer is rarely “30 votes”.

A reporting template that holds up

When we ship a compere-derived ranking to a stakeholder, we include four columns:

EntityRatingComparisons (n)Implied win-rate vs. mid-pack
Headline A17202378%
Headline B16121966%
Headline C15002150%
Headline D13901835%

The rating column is for sorting. The n column is for honesty — it stops anyone from over-trusting an item with three comparisons sitting above an item with thirty. The implied-win-rate column converts the rating into the only number that has a defensible real-world meaning: “if I show this and a typical item side by side, this one wins about 78% of the time.”

If you cannot defend those three columns, you are reading the Elo output as something it is not. The numbers are useful; they are also opinionated. Pretending otherwise is how rankings end up in a deck that gets quoted six months later by someone who has never heard of K-factors.