AI Benchmarks Catalog
The third-party AI benchmarks we aggregate and cross-reference against our first-party meo scores: Artificial Analysis indices, GPQA, MMLU-Pro, AIME, SWE-style coding, LMArena Elo, and more — each with what it measures, its license, and a per-benchmark model leaderboard.
We treat these as attributed secondary signals and cross-check them against our un-leaked first-party meo benchmark. Public benchmarks are contamination-prone; crowd arenas are gameable — see the methodology for why.
Artificial Analysis
| Benchmark | Type | License | Leaderboard |
|---|---|---|---|
| AA Coding Index | index | proprietary-attribution | View ranking → |
| AA Intelligence Index | index | proprietary-attribution | View ranking → |
| AA Math Index | index | proprietary-attribution | View ranking → |
| AIME | accuracy | proprietary-attribution | View ranking → |
| AIME 2025 | accuracy | proprietary-attribution | View ranking → |
| GPQA Diamond | accuracy | proprietary-attribution | View ranking → |
| Humanity's Last Exam | accuracy | proprietary-attribution | View ranking → |
| IFBench | accuracy | proprietary-attribution | View ranking → |
| LCR (long-context reasoning) | accuracy | proprietary-attribution | View ranking → |
| LiveCodeBench | accuracy | proprietary-attribution | View ranking → |
| MATH-500 | accuracy | proprietary-attribution | View ranking → |
| MMLU-Pro | accuracy | proprietary-attribution | View ranking → |
| SciCode | accuracy | proprietary-attribution | View ranking → |
| τ²-bench | accuracy | proprietary-attribution | View ranking → |
| Terminal-Bench Hard | accuracy | proprietary-attribution | View ranking → |
| Median output throughput (tokens/s) | tok_s | proprietary-attribution | View ranking → |
| Median time to first token (s) | seconds | proprietary-attribution | View ranking → |
LMArena (Chatbot Arena)
| Benchmark | Type | License | Leaderboard |
|---|---|---|---|
| LMArena Elo (Chatbot Arena) | elo | apache-2.0 | View ranking → |
Artificial Analysis (artificialanalysis.ai). Redistribution requires an AA commercial license.
Benchmark concepts
Plain-language explainers of key benchmarking terms.