Skip to main content

Methodology & Effective Value (𝕍)

How meoadvisors scores AI models: an un-leaked private holdout, objective-first grading with a bias-controlled multi-lab jury, generator-as-oracle domains, and the Effective Value (𝕍) metric whose ranking inverts with task depth. Full transparency on N/Ο‰/Ξ΄, domains, and novelty vs public leaderboards.

as of 2026-06-08 Β· methodology v4

The Effective Value (𝕍) metric

Accuracy answers β€œhow often is the model right?” Deployment hinges on a harder question: how much value does a model deliver per unit of money and time, given that a long autonomous task fails entirely if any step fails? Effective Value is our answer:

𝕍 = v Β· (1 βˆ’ E)N / ( C_f + Ο‰ Β· t_base Β· Ξ΄(EΒ·N) )
  • Numerator β€” velocity Γ— chain success. (1βˆ’E)N is the probability of completing an N-step chain without human rescue. At E=0.1 a single step succeeds 90% of the time, but a 10-step chain only β‰ˆ35%, and a 40-step chain β‰ˆ1.5%. Velocity v (tokens/sec) rewards throughput β€” but only throughput that reaches the finish line.
  • Denominator β€” money plus friction-amplified time. Cost C_f plus time, where time is amplified by a compounding friction term Ξ΄(EΒ·N) (Ξ΄>1): deeper tasks with higher error rates lose more time to debugging and hallucination loops. A flawed model is penalized twice.
  • The Ο‰ ≫ C_f thesis. For autonomous workflows the cost of time far exceeds the cost of money β€” an hour of a stalled agent dwarfs a few cents of tokens. With per-item costs in cents and times in tens of seconds, the time term dominates C_f even at Ο‰=1 (our default).
  • Defaults. N=10 (a moderate agentic chain), Ο‰=1, Ξ΄=1.5 (β‰ˆ50% compounding friction per error-step). These are scenario knobs, not constants β€” the robust finding is how the ranking moves as N sweeps.

Sensitivity: the rank inversion with depth

The decisive consequence of the formula is a rank inversion with task depth, computed live from the current board. At shallow depth raw speed wins; as the chain deepens the error-cascade makes accuracy decisive:

ModelAccuracy𝕍-rank N=1N=10N=40
OpenAI: GPT-5.573.2%621
Anthropic: Claude Opus 4.870.8%312
xAI: Grok 4.356.6%135
Google: Gemini 3.5 Flash60.0%243
DeepSeek: DeepSeek V4 Flash56.2%22118
Google: Gemini 3.1 Flash Lite35.5%41818

Fast models top the shallow ranking; accurate flagships dominate deep chains. A cheap-but-accurate model rises sharply with depth while a fast-but-flawed one collapses. Explore it interactively with the N/Ο‰/Ξ΄ sliders on the dashboard.

What the cross-model statistics show

  • More β€œthinking” predicts less accuracy, not more (ρ=βˆ’0.54, p=0.004). Token volume is a struggle signal, not a capability signal β€” the strongest models are the most concise.
  • Price buys accuracy weakly, with steep diminishing returns (ρ=0.58): a 12Γ— cheaper model lands within a few points of mid-priced flagships. Cost-per-correct, not token price, is the decision-relevant quantity.
  • 𝕍 is accuracy-anchored but efficiency-adjusted (ρ=0.91 with accuracy, r=0.61) β€” neither a relabeling of accuracy nor independent of it.
  • At moderate depth, pricier models score higher 𝕍 β€” the chain-success term dominates, so β€œexpensive implies low value” is false for deep agentic tasks and true only for shallow ones.

How we keep the benchmark un-leaked

  • Private holdout. The holdout is never served; a tiny public sample is published only for illustration and labelled β€œassume contaminated.” No grading is ever done against the public sample.
  • Server-side answers. Ground truth and rubrics live in a physically separate store, never serialized into any export or page.
  • Canary tagging + leak tripwire. Every item carries a namespaced canary; a log-probability tripwire flags contamination (best-effort, since hosted models rarely expose logprobs).
  • Rotation. Each cycle retires the oldest/easiest items. For parametric and generator-as-oracle domains, a new seed yields a fresh un-leakable item with the same exact ground-truth formula.

Objective-first grading + a bias-controlled jury

Wherever a ground truth exists we grade objectively (an atomic exact/structured check plus an LLM-equivalence fallback to avoid prose false-negatives). Only for genuinely open-ended prose do we use a multi-lab jury, with four load-bearing rules: one model per lab (disjoint families); independent multi-call scoring aggregated by median/majority with randomized criterion order; never judge your own lab; and per-modality filtering (vision items judged only by vision-capable jurors). We deliberately do not use an answer-fusion router as the verdict mechanism β€” bias-controlled scoring is not the same as synthesizing one answer.

The eleven domains

Three modalities; seven authored-with-verification, four generator-as-oracle (an embedded solver computes the ground truth β€” no answer key to leak, infinite parametric items).

DomainModalityWhat it probes
illusionsvisualPerceptual illusions as deterministic SVG with exact measurements β€” does the model report the percept or the measured truth?
logic_math_cstextReworded lateral / CS / math reasoning stumpers with a single defensible answer.
framework_biastextDoes the model apply a SPECIFIED framework rather than its data-driven prior (instruction-following under bias)?
base_biastextFive-category round-robin probe of base-rate / prior bias, with an inverted suspect-truth guard.
watson_glasertextFive-way critical-thinking inference (True / Probably True / Insufficient Data / Probably False / False).
theory_of_mindtextNested false-belief reasoning.
state_trackingtextClue / scheduling / spatial puzzles with an internal validation key and "no consistent solution" impossible items.
long_arithmeticoracletextMulti-operand / large-digit exact arithmetic (BigInt) β€” carry-chain drift in the middle digits.
regex_automatonoracletextRegular-language membership over {a,b,c,d}, simulated by an embedded NFA engine.
tape_machineoracletextStep-by-step simulation of a tiny register VM with per-item randomized opcode mnemonics (nothing memorizable).
csp_unsatoracletextZebra-style constraint puzzles; half are deliberately unsatisfiable to probe the failure to detect UNSAT.

Every model runs under an identical protocol: maximum reasoning effort, temperature 1, no web search, one isolated context per item, on a roster pinned by exact provider slug.

Novelty vs existing leaderboards

Crowd arenas (LMArena) are contamination-resistant but reward style and are gameable. Independent aggregators (Artificial Analysis, Epoch) run carefully but inherit the contamination of public benchmarks. Contamination-resistant suites (LiveBench, LiveCodeBench) use rotation + objective scoring; we push further with generator-as-oracle items that are un-leakable by construction and a fully private holdout. To our knowledge the combination of a private multi-modal holdout, generator-as-oracle infinite items, a bias-controlled multi-lab jury, and a real-world-efficacy metric with a demonstrated depth-driven rank inversion is novel.

Resources

FAQ

Why is the meo benchmark "un-leaked"?

The holdout is never served. Ground-truth answers live server-side and are never serialized into any export or page. Four of the eleven domains are generator-as-oracle: a solver computes the answer, so there is no fixed answer key to leak and a fresh seed yields unlimited new items with provably-correct labels.

What is Effective Value (𝕍)?

𝕍 = vΒ·(1βˆ’E)^N / (C_f + ω·t_baseΒ·Ξ΄^(EΒ·N)) β€” a single metric fusing speed, accuracy, cost, and the exponential error-cascade of deep agentic work. It encodes that error is penalized twice (chain success falls and debugging friction rises) and that for autonomous workflows the cost of time dominates the cost of money.

Why does the ranking change with task depth?

At shallow depth (N=1) 𝕍 rewards raw speed; as the chain deepens the (1βˆ’E)^N term makes accuracy decisive. Fast-but-flawed models collapse while accurate models pull ahead β€” the right model for a one-shot autocomplete is often the wrong model for a long autonomous job.

How is meo different from Artificial Analysis or LMArena?

Crowd arenas (LMArena) reward style and are gameable; aggregators (Artificial Analysis, Epoch) run public benchmarks that inherit contamination. meo grades an un-leaked private holdout objective-first, judges open-ended prose with a bias-controlled multi-lab jury, and adds the real-world-efficacy 𝕍 metric. We aggregate AA/LMArena as attributed secondary signals β€” we complement, not replace, them.