Methodology & Effective Value (𝕍)

How meoadvisors scores AI models: an un-leaked private holdout, objective-first grading with a bias-controlled multi-lab jury, generator-as-oracle domains, and the Effective Value (𝕍) metric whose ranking inverts with task depth. Full transparency on N/ω/δ, domains, and novelty vs public leaderboards.

as of 2026-06-08 · methodology v4

The Effective Value (𝕍) metric

Accuracy answers “how often is the model right?” Deployment hinges on a harder question: how much value does a model deliver per unit of money and time, given that a long autonomous task fails entirely if any step fails? Effective Value is our answer:

𝕍 = v · (1 − E)^N / ( C_f + ω · t_base · δ^(E·N) )

Numerator — velocity × chain success. (1−E)^N is the probability of completing an N-step chain without human rescue. At E=0.1 a single step succeeds 90% of the time, but a 10-step chain only ≈35%, and a 40-step chain ≈1.5%. Velocity v (tokens/sec) rewards throughput — but only throughput that reaches the finish line.
Denominator — money plus friction-amplified time. Cost C_f plus time, where time is amplified by a compounding friction term δ^(E·N) (δ>1): deeper tasks with higher error rates lose more time to debugging and hallucination loops. A flawed model is penalized twice.
The ω ≫ C_f thesis. For autonomous workflows the cost of time far exceeds the cost of money — an hour of a stalled agent dwarfs a few cents of tokens. With per-item costs in cents and times in tens of seconds, the time term dominates C_f even at ω=1 (our default).
Defaults. N=10 (a moderate agentic chain), ω=1, δ=1.5 (≈50% compounding friction per error-step). These are scenario knobs, not constants — the robust finding is how the ranking moves as N sweeps.

Sensitivity: the rank inversion with depth

The decisive consequence of the formula is a rank inversion with task depth, computed live from the current board. At shallow depth raw speed wins; as the chain deepens the error-cascade makes accuracy decisive:

Model	Accuracy	𝕍-rank N=1	N=10	N=40
OpenAI: GPT-5.5	73.2%	6	2	1
Anthropic: Claude Opus 4.8	70.8%	3	1	2
xAI: Grok 4.3	56.6%	1	3	5
Google: Gemini 3.5 Flash	60.0%	2	4	3
DeepSeek: DeepSeek V4 Flash	56.2%	22	11	8
Google: Gemini 3.1 Flash Lite	35.5%	4	18	18

Fast models top the shallow ranking; accurate flagships dominate deep chains. A cheap-but-accurate model rises sharply with depth while a fast-but-flawed one collapses. Explore it interactively with the N/ω/δ sliders on the dashboard.

What the cross-model statistics show

More “thinking” predicts less accuracy, not more (ρ=−0.54, p=0.004). Token volume is a struggle signal, not a capability signal — the strongest models are the most concise.
Price buys accuracy weakly, with steep diminishing returns (ρ=0.58): a 12× cheaper model lands within a few points of mid-priced flagships. Cost-per-correct, not token price, is the decision-relevant quantity.
𝕍 is accuracy-anchored but efficiency-adjusted (ρ=0.91 with accuracy, r=0.61) — neither a relabeling of accuracy nor independent of it.
At moderate depth, pricier models score higher 𝕍 — the chain-success term dominates, so “expensive implies low value” is false for deep agentic tasks and true only for shallow ones.

How we keep the benchmark un-leaked

Private holdout. The holdout is never served; a tiny public sample is published only for illustration and labelled “assume contaminated.” No grading is ever done against the public sample.
Server-side answers. Ground truth and rubrics live in a physically separate store, never serialized into any export or page.
Canary tagging + leak tripwire. Every item carries a namespaced canary; a log-probability tripwire flags contamination (best-effort, since hosted models rarely expose logprobs).
Rotation. Each cycle retires the oldest/easiest items. For parametric and generator-as-oracle domains, a new seed yields a fresh un-leakable item with the same exact ground-truth formula.

Objective-first grading + a bias-controlled jury

Wherever a ground truth exists we grade objectively (an atomic exact/structured check plus an LLM-equivalence fallback to avoid prose false-negatives). Only for genuinely open-ended prose do we use a multi-lab jury, with four load-bearing rules: one model per lab (disjoint families); independent multi-call scoring aggregated by median/majority with randomized criterion order; never judge your own lab; and per-modality filtering (vision items judged only by vision-capable jurors). We deliberately do not use an answer-fusion router as the verdict mechanism — bias-controlled scoring is not the same as synthesizing one answer.

The eleven domains

Three modalities; seven authored-with-verification, four generator-as-oracle (an embedded solver computes the ground truth — no answer key to leak, infinite parametric items).

Domain	Modality	What it probes
illusions	visual	Perceptual illusions as deterministic SVG with exact measurements — does the model report the percept or the measured truth?
logic_math_cs	text	Reworded lateral / CS / math reasoning stumpers with a single defensible answer.
framework_bias	text	Does the model apply a SPECIFIED framework rather than its data-driven prior (instruction-following under bias)?
base_bias	text	Five-category round-robin probe of base-rate / prior bias, with an inverted suspect-truth guard.
watson_glaser	text	Five-way critical-thinking inference (True / Probably True / Insufficient Data / Probably False / False).
theory_of_mind	text	Nested false-belief reasoning.
state_tracking	text	Clue / scheduling / spatial puzzles with an internal validation key and "no consistent solution" impossible items.
long_arithmeticoracle	text	Multi-operand / large-digit exact arithmetic (BigInt) — carry-chain drift in the middle digits.
regex_automatonoracle	text	Regular-language membership over {a,b,c,d}, simulated by an embedded NFA engine.
tape_machineoracle	text	Step-by-step simulation of a tiny register VM with per-item randomized opcode mnemonics (nothing memorizable).
csp_unsatoracle	text	Zebra-style constraint puzzles; half are deliberately unsatisfiable to probe the failure to detect UNSAT.

Every model runs under an identical protocol: maximum reasoning effort, temperature 1, no web search, one isolated context per item, on a roster pinned by exact provider slug.

Novelty vs existing leaderboards

Crowd arenas (LMArena) are contamination-resistant but reward style and are gameable. Independent aggregators (Artificial Analysis, Epoch) run carefully but inherit the contamination of public benchmarks. Contamination-resistant suites (LiveBench, LiveCodeBench) use rotation + objective scoring; we push further with generator-as-oracle items that are un-leakable by construction and a fully private holdout. To our knowledge the combination of a private multi-modal holdout, generator-as-oracle infinite items, a bias-controlled multi-lab jury, and a real-world-efficacy metric with a demonstrated depth-driven rank inversion is novel.

Resources

Paper (preprint): Zenodo 10.5281/zenodo.20586608
Dataset (CC BY 4.0): Zenodo 10.5281/zenodo.20586610 · Hugging Face · Kaggle
Research deep-dives: Effective Value · Sycophancy · Multi-LLM-as-judge · Cheap-ensemble fusion

FAQ

Why is the meo benchmark "un-leaked"?

The holdout is never served. Ground-truth answers live server-side and are never serialized into any export or page. Four of the eleven domains are generator-as-oracle: a solver computes the answer, so there is no fixed answer key to leak and a fresh seed yields unlimited new items with provably-correct labels.

What is Effective Value (𝕍)?

𝕍 = v·(1−E)^N / (C_f + ω·t_base·δ^(E·N)) — a single metric fusing speed, accuracy, cost, and the exponential error-cascade of deep agentic work. It encodes that error is penalized twice (chain success falls and debugging friction rises) and that for autonomous workflows the cost of time dominates the cost of money.

Why does the ranking change with task depth?

At shallow depth (N=1) 𝕍 rewards raw speed; as the chain deepens the (1−E)^N term makes accuracy decisive. Fast-but-flawed models collapse while accurate models pull ahead — the right model for a one-shot autocomplete is often the wrong model for a long autonomous job.

How is meo different from Artificial Analysis or LMArena?

Crowd arenas (LMArena) reward style and are gameable; aggregators (Artificial Analysis, Epoch) run public benchmarks that inherit contamination. meo grades an un-leaked private holdout objective-first, judges open-ended prose with a bias-controlled multi-lab jury, and adds the real-world-efficacy 𝕍 metric. We aggregate AA/LMArena as attributed secondary signals — we complement, not replace, them.