Intelligent pairwise comparisons. Better rankings with fewer votes.
Compere is a Python library and FastAPI service that picks which pair to compare next using a Multi-Armed Bandit, then turns the verdicts into Elo ratings you can read. Built for RLHF data collection, A/B testing, leaderboards, and eval ranking.
Two algorithms, one honest answer.
c = 1.414 (default), N = total comparisons, ni = comparisons involving entity i. New entities receive a large initial weight so they get surveyed before they get scored.
expected = 1 / (1 + 10((opp − rating) / 400))
K = 32 by default. Initial rating 1500. The same Elo formulation you know from chess; nothing fancier is claimed.
Note on the field: compere does not implement Bradley-Terry, Thurstone, or TrueSkill. UCB plus Elo was chosen because both are interpretable end-to-end — you can explain a rating change to a stakeholder in two sentences.
Eval & RLHF teams
Collect preference data over model outputs without showing annotators every pair. UCB concentrates votes on uncertain comparisons; you ship a reward signal sooner.
A/B and content ranking ops
Rank designs, headlines, or product photos against each other. The Elo board is sortable, replayable, and an honest function of the votes it received.
Taste-graph builders
Turn “A or B?” clicks into a ranked catalog. SQLite by default; PostgreSQL when you outgrow it; same code on either.
Researchers and tinkerers
Import compere as a library and call create_entity /
create_comparison / get_ratings directly. No server needed
for offline studies.
Install, run, vote.
pip install compere
compere --port 8090
# get the next pair to compare (UCB picks it)
curl localhost:8090/mab/next_comparison
# record a verdict
curl -X POST localhost:8090/comparisons/ \
-H "Content-Type: application/json" \
-d '{"entity1_id":1,"entity2_id":2,"selected_entity_id":1}'
# read the leaderboard
curl localhost:8090/ratings
Interactive API docs are served at /docs by FastAPI. The full HTTP surface
is the eleven endpoints listed in the
API reference.
-
Stopping rules: when have you compared enough?
There is no universal answer, but there are three honest tests. We walk through each, with the queries you can run against the compere API.
-
Reading the Elo output without lying
Compere ships Elo, not Bradley-Terry. The numbers it produces are easy to misread. A field guide to what an Elo gap actually means.
-
Why 50 pairwise votes beat 500 ratings
Rating scales drift, anchor, and lie. Pairwise comparisons survive all three. Here is the math we use, and where it actually breaks down.
Honest, narrow comparisons against tools that overlap on one axis or another.
-
compere vs. plain Elo libraries
You can get Elo from a 30-line gist. The interesting question is which pair to ask about next.
-
compere vs. AHP-style ranking SaaS
AHP gives weighted criteria; compere gives a single ranked list from raw pairwise votes.