AI Model Rankings
Every way to rank AI models: best Effective Value, cheapest per correct answer, fastest, most robust, best for agentic work, open-weights, per-domain, and per-benchmark leaderboards — each with an explainer of what the metric means in production.
By value
Best AI models by Effective Value (𝕍)
Ranked by Effective Value (𝕍) — the single metric that fuses accuracy, speed, cost, and the exponential error-cascade of deep agentic work. 𝕍 rewards models that complete long multi-step chains without human rescue. For production agents (not one-shot chat), 𝕍 is the number that matters most. Shown indexed to the top model = 100 (raw 𝕍 spans a ~2.7M× range).
Best AI models for deep agentic chains
Ranked by Effective Value at task depth N=40 — long agentic chains where one error compounds. As depth grows the ranking shifts decisively toward accuracy: fast-but-flawed models collapse while accurate models pull ahead. This is the leaderboard for autonomous, multi-step work. Shown indexed to the top model at N=40 = 100.
Most robust AI models under pressure
Ranked by how little a model’s accuracy moves under threatening or emotionally-loaded prompts (susceptibility index closest to zero = most robust). A model that caves to pressure is a liability in adversarial or high-stakes settings. See the sycophancy research for the full epistemic-integrity picture.
By efficiency
Most cost-efficient AI models ($ per correct answer)
Ranked by USD per correct answer — true intelligence-per-dollar, not just sticker price. A cheap model that needs many retries is not cheap. Cost-per-correct already amortizes the error rate, so it is the honest cost metric for budget-sensitive workloads.
Fastest AI models (seconds per correct answer)
Ranked by wall-seconds per correct answer. For interactive and high-throughput workloads, latency per useful result — not raw tokens/sec — is what users feel. Models that think a lot to get there pay for it here.
By access
Best open-weights AI models
Open-weight models you can self-host, fine-tune, and run on-prem — ranked by meo accuracy where available. For teams with data-residency, privacy, or cost-control requirements, open weights change the build-vs-buy calculus entirely.
AI models with the longest context windows
Ranked by maximum context window — how much you can put in one prompt (long documents, whole codebases, large retrieval sets). Bigger context unlocks workloads smaller windows cannot attempt, though effective use of long context varies by model.
By reasoning domain
Best AI models for base-rate bias
Ranked by accuracy on the meo “base-rate bias” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.
Best AI models for unsatisfiable constraints
Ranked by accuracy on the meo “unsatisfiable constraints” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.
Best AI models for perceptual illusions
Ranked by accuracy on the meo “perceptual illusions” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.
Best AI models for logic, math & CS
Ranked by accuracy on the meo “logic, math & CS” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.
Best AI models for long arithmetic
Ranked by accuracy on the meo “long arithmetic” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.
Best AI models for regex automata
Ranked by accuracy on the meo “regex automata” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.
Best AI models for tape-machine simulation
Ranked by accuracy on the meo “tape-machine simulation” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.
Best AI models for critical thinking (Watson-Glaser)
Ranked by accuracy on the meo “critical thinking (Watson-Glaser)” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.
Best AI models for theory of mind
Ranked by accuracy on the meo “theory of mind” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.
Best AI models for framework-application bias
Ranked by accuracy on the meo “framework-application bias” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.
Best AI models for multi-step state tracking
Ranked by accuracy on the meo “multi-step state tracking” domain — one of eleven un-leaked, contamination-resistant domains. Per-domain strength reveals where a model genuinely reasons versus where it pattern-matches; the leaders differ sharply by domain.
By third-party benchmark
Per-benchmark leaderboards over the full field. Benchmark catalog →