Methodology & Effective Value (π)
How meoadvisors scores AI models: an un-leaked private holdout, objective-first grading with a bias-controlled multi-lab jury, generator-as-oracle domains, and the Effective Value (π) metric whose ranking inverts with task depth. Full transparency on N/Ο/Ξ΄, domains, and novelty vs public leaderboards.
as of 2026-06-08 Β· methodology v4
The Effective Value (π) metric
Accuracy answers βhow often is the model right?β Deployment hinges on a harder question: how much value does a model deliver per unit of money and time, given that a long autonomous task fails entirely if any step fails? Effective Value is our answer:
- Numerator β velocity Γ chain success. (1βE)N is the probability of completing an N-step chain without human rescue. At E=0.1 a single step succeeds 90% of the time, but a 10-step chain only β35%, and a 40-step chain β1.5%. Velocity v (tokens/sec) rewards throughput β but only throughput that reaches the finish line.
- Denominator β money plus friction-amplified time. Cost C_f plus time, where time is amplified by a compounding friction term Ξ΄(EΒ·N) (Ξ΄>1): deeper tasks with higher error rates lose more time to debugging and hallucination loops. A flawed model is penalized twice.
- The Ο β« C_f thesis. For autonomous workflows the cost of time far exceeds the cost of money β an hour of a stalled agent dwarfs a few cents of tokens. With per-item costs in cents and times in tens of seconds, the time term dominates C_f even at Ο=1 (our default).
- Defaults. N=10 (a moderate agentic chain), Ο=1, Ξ΄=1.5 (β50% compounding friction per error-step). These are scenario knobs, not constants β the robust finding is how the ranking moves as N sweeps.
Sensitivity: the rank inversion with depth
The decisive consequence of the formula is a rank inversion with task depth, computed live from the current board. At shallow depth raw speed wins; as the chain deepens the error-cascade makes accuracy decisive:
| Model | Accuracy | π-rank N=1 | N=10 | N=40 |
|---|---|---|---|---|
| OpenAI: GPT-5.5 | 73.2% | 6 | 2 | 1 |
| Anthropic: Claude Opus 4.8 | 70.8% | 3 | 1 | 2 |
| xAI: Grok 4.3 | 56.6% | 1 | 3 | 5 |
| Google: Gemini 3.5 Flash | 60.0% | 2 | 4 | 3 |
| DeepSeek: DeepSeek V4 Flash | 56.2% | 22 | 11 | 8 |
| Google: Gemini 3.1 Flash Lite | 35.5% | 4 | 18 | 18 |
Fast models top the shallow ranking; accurate flagships dominate deep chains. A cheap-but-accurate model rises sharply with depth while a fast-but-flawed one collapses. Explore it interactively with the N/Ο/Ξ΄ sliders on the dashboard.
What the cross-model statistics show
- More βthinkingβ predicts less accuracy, not more (Ο=β0.54, p=0.004). Token volume is a struggle signal, not a capability signal β the strongest models are the most concise.
- Price buys accuracy weakly, with steep diminishing returns (Ο=0.58): a 12Γ cheaper model lands within a few points of mid-priced flagships. Cost-per-correct, not token price, is the decision-relevant quantity.
- π is accuracy-anchored but efficiency-adjusted (Ο=0.91 with accuracy, r=0.61) β neither a relabeling of accuracy nor independent of it.
- At moderate depth, pricier models score higher π β the chain-success term dominates, so βexpensive implies low valueβ is false for deep agentic tasks and true only for shallow ones.
How we keep the benchmark un-leaked
- Private holdout. The holdout is never served; a tiny public sample is published only for illustration and labelled βassume contaminated.β No grading is ever done against the public sample.
- Server-side answers. Ground truth and rubrics live in a physically separate store, never serialized into any export or page.
- Canary tagging + leak tripwire. Every item carries a namespaced canary; a log-probability tripwire flags contamination (best-effort, since hosted models rarely expose logprobs).
- Rotation. Each cycle retires the oldest/easiest items. For parametric and generator-as-oracle domains, a new seed yields a fresh un-leakable item with the same exact ground-truth formula.
Objective-first grading + a bias-controlled jury
Wherever a ground truth exists we grade objectively (an atomic exact/structured check plus an LLM-equivalence fallback to avoid prose false-negatives). Only for genuinely open-ended prose do we use a multi-lab jury, with four load-bearing rules: one model per lab (disjoint families); independent multi-call scoring aggregated by median/majority with randomized criterion order; never judge your own lab; and per-modality filtering (vision items judged only by vision-capable jurors). We deliberately do not use an answer-fusion router as the verdict mechanism β bias-controlled scoring is not the same as synthesizing one answer.
The eleven domains
Three modalities; seven authored-with-verification, four generator-as-oracle (an embedded solver computes the ground truth β no answer key to leak, infinite parametric items).
| Domain | Modality | What it probes |
|---|---|---|
| illusions | visual | Perceptual illusions as deterministic SVG with exact measurements β does the model report the percept or the measured truth? |
| logic_math_cs | text | Reworded lateral / CS / math reasoning stumpers with a single defensible answer. |
| framework_bias | text | Does the model apply a SPECIFIED framework rather than its data-driven prior (instruction-following under bias)? |
| base_bias | text | Five-category round-robin probe of base-rate / prior bias, with an inverted suspect-truth guard. |
| watson_glaser | text | Five-way critical-thinking inference (True / Probably True / Insufficient Data / Probably False / False). |
| theory_of_mind | text | Nested false-belief reasoning. |
| state_tracking | text | Clue / scheduling / spatial puzzles with an internal validation key and "no consistent solution" impossible items. |
| long_arithmeticoracle | text | Multi-operand / large-digit exact arithmetic (BigInt) β carry-chain drift in the middle digits. |
| regex_automatonoracle | text | Regular-language membership over {a,b,c,d}, simulated by an embedded NFA engine. |
| tape_machineoracle | text | Step-by-step simulation of a tiny register VM with per-item randomized opcode mnemonics (nothing memorizable). |
| csp_unsatoracle | text | Zebra-style constraint puzzles; half are deliberately unsatisfiable to probe the failure to detect UNSAT. |
Every model runs under an identical protocol: maximum reasoning effort, temperature 1, no web search, one isolated context per item, on a roster pinned by exact provider slug.
Novelty vs existing leaderboards
Crowd arenas (LMArena) are contamination-resistant but reward style and are gameable. Independent aggregators (Artificial Analysis, Epoch) run carefully but inherit the contamination of public benchmarks. Contamination-resistant suites (LiveBench, LiveCodeBench) use rotation + objective scoring; we push further with generator-as-oracle items that are un-leakable by construction and a fully private holdout. To our knowledge the combination of a private multi-modal holdout, generator-as-oracle infinite items, a bias-controlled multi-lab jury, and a real-world-efficacy metric with a demonstrated depth-driven rank inversion is novel.
Resources
- Paper (preprint): Zenodo 10.5281/zenodo.20586608
- Dataset (CC BY 4.0): Zenodo 10.5281/zenodo.20586610 Β· Hugging Face Β· Kaggle
- Research deep-dives: Effective Value Β· Sycophancy Β· Multi-LLM-as-judge Β· Cheap-ensemble fusion
FAQ
Why is the meo benchmark "un-leaked"?
The holdout is never served. Ground-truth answers live server-side and are never serialized into any export or page. Four of the eleven domains are generator-as-oracle: a solver computes the answer, so there is no fixed answer key to leak and a fresh seed yields unlimited new items with provably-correct labels.
What is Effective Value (π)?
π = vΒ·(1βE)^N / (C_f + ΟΒ·t_baseΒ·Ξ΄^(EΒ·N)) β a single metric fusing speed, accuracy, cost, and the exponential error-cascade of deep agentic work. It encodes that error is penalized twice (chain success falls and debugging friction rises) and that for autonomous workflows the cost of time dominates the cost of money.
Why does the ranking change with task depth?
At shallow depth (N=1) π rewards raw speed; as the chain deepens the (1βE)^N term makes accuracy decisive. Fast-but-flawed models collapse while accurate models pull ahead β the right model for a one-shot autocomplete is often the wrong model for a long autonomous job.
How is meo different from Artificial Analysis or LMArena?
Crowd arenas (LMArena) reward style and are gameable; aggregators (Artificial Analysis, Epoch) run public benchmarks that inherit contamination. meo grades an un-leaked private holdout objective-first, judges open-ended prose with a bias-controlled multi-lab jury, and adds the real-world-efficacy π metric. We aggregate AA/LMArena as attributed secondary signals β we complement, not replace, them.