AI Model Rankings

Every way to rank AI models: best Effective Value, cheapest per correct answer, fastest, most robust, best for agentic work, open-weights, per-domain, and per-benchmark leaderboards — each with an explainer of what the metric means in production.

By value

Best AI models by Effective Value (𝕍)

Ranked by Effective Value (𝕍) — the single metric that fuses accuracy, speed, cost, and the exponential error-cascade of deep agentic work. 𝕍 rewards models that complete long multi-step chains without human rescue. For production agents (not one-shot chat), 𝕍 is the number that matters most. Shown indexed to the top model = 100 (raw 𝕍 spans a ~2.7M× range).

Best AI models for deep agentic chains

Ranked by Effective Value at task depth N=40 — long agentic chains where one error compounds. As depth grows the ranking shifts decisively toward accuracy: fast-but-flawed models collapse while accurate models pull ahead. This is the leaderboard for autonomous, multi-step work. Shown indexed to the top model at N=40 = 100.

Most robust AI models under pressure

Ranked by how little a model’s accuracy moves under threatening or emotionally-loaded prompts (susceptibility index closest to zero = most robust). A model that caves to pressure is a liability in adversarial or high-stakes settings. See the sycophancy research for the full epistemic-integrity picture.

By efficiency

Most cost-efficient AI models ($ per correct answer)

Ranked by USD per correct answer — true intelligence-per-dollar, not just sticker price. A cheap model that needs many retries is not cheap. Cost-per-correct already amortizes the error rate, so it is the honest cost metric for budget-sensitive workloads.

Fastest AI models (seconds per correct answer)

Ranked by wall-seconds per correct answer. For interactive and high-throughput workloads, latency per useful result — not raw tokens/sec — is what users feel. Models that think a lot to get there pay for it here.

By access

Best open-weights AI models

Open-weight models you can self-host, fine-tune, and run on-prem — ranked by meo accuracy where available. For teams with data-residency, privacy, or cost-control requirements, open weights change the build-vs-buy calculus entirely.

AI models with the longest context windows

Ranked by maximum context window — how much you can put in one prompt (long documents, whole codebases, large retrieval sets). Bigger context unlocks workloads smaller windows cannot attempt, though effective use of long context varies by model.

By reasoning domain

Best AI models for base-rate bias

Ranked by accuracy on the meo “base-rate bias” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.

Best AI models for unsatisfiable constraints

Ranked by accuracy on the meo “unsatisfiable constraints” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.

Best AI models for perceptual illusions

Ranked by accuracy on the meo “perceptual illusions” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.

Best AI models for logic, math & CS

Ranked by accuracy on the meo “logic, math & CS” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.

Best AI models for long arithmetic

Ranked by accuracy on the meo “long arithmetic” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.

Best AI models for regex automata

Ranked by accuracy on the meo “regex automata” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.

Best AI models for tape-machine simulation

Ranked by accuracy on the meo “tape-machine simulation” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.

Best AI models for critical thinking (Watson-Glaser)

Ranked by accuracy on the meo “critical thinking (Watson-Glaser)” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.

Best AI models for theory of mind

Ranked by accuracy on the meo “theory of mind” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.

Best AI models for framework-application bias

Ranked by accuracy on the meo “framework-application bias” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.

Best AI models for multi-step state tracking

Ranked by accuracy on the meo “multi-step state tracking” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.

By third-party benchmark

Per-benchmark leaderboards over the full field. Benchmark catalog →

AA Coding Index AA Intelligence Index AA Math Index AIME AIME 2025 GPQA Diamond Humanity's Last Exam IFBench LCR (long-context reasoning)LiveCodeBench MATH-500 MMLU-Pro SciCode τ²-bench Terminal-Bench Hard Median output throughput (tokens/s)Median time to first token (s)LMArena Elo (Chatbot Arena)