Effective Value (𝕍)

Effective Value (𝕍) is a single metric that fuses speed, accuracy, cost, and the exponential error-cascade of deep agentic work. Its headline finding: the best model depends on task depth — fast models win one-shot tasks, accurate models dominate deep chains. 𝕍 makes quantitative an intuition that accuracy-only leaderboards cannot express.

as of 2026-06-08

What is Effective Value?

𝕍 = v · (1 − E)^N / ( C_f + ω · t_base · δ^(E·N) )

The numerator multiplies velocity v (tokens/sec) by the probability (1−E)^N of completing an N-step chain without human rescue. The denominator is money C_f plus time, where time is amplified by a compounding friction term δ^(E·N). A flawed model is penalized twice: its chain-success falls and its debugging friction rises. The weight ω encodes that, for autonomous workflows, the cost of time dwarfs the cost of money. Full derivation on the methodology page.

Why does the ranking invert with task depth?

At E=0.1, one step succeeds 90% of the time — but a 10-step chain only ≈35%, and a 40-step chain ≈1.5%. So as depth grows, accuracy overwhelms speed. Computed live from the current board:

Model	Accuracy	𝕍-rank N=1	N=10	N=40
OpenAI: GPT-5.5	73.2%	6	2	1
Anthropic: Claude Opus 4.8	70.8%	3	1	2
xAI: Grok 4.3	56.6%	1	3	5
Google: Gemini 3.5 Flash	60.0%	2	4	3
DeepSeek: DeepSeek V4 Flash	56.2%	22	11	8
Google: Gemini 3.1 Flash Lite	35.5%	4	18	18

Try it interactively with the N/ω/δ sliders on the dashboard, or browse the best-for-agentic ranking.

Does more “thinking” mean more accuracy?

Counter-intuitively, no. Across the 22-model roster, accuracy anti-correlates with reasoning+output tokens per response (ρ=−0.54, p=0.004). Token volume is a struggle signal: the strongest model is among the most concise (~1,380 tokens/correct) while the weakest thrash (tens of thousands of tokens/correct). For 𝕍 this means velocity and accuracy are not in tension — the accurate models are also the lean ones.

What this means for choosing a model

One-shot / interactive: optimize for speed and cost-per-correct — fast models win shallow tasks.
Deep agentic / autonomous: optimize for accuracy — at N≥10 the error-cascade makes the most accurate model the most valuable, even if pricier. “Expensive implies low value” is false for deep tasks.
Budget-sensitive: rank by cost-per-correct, not sticker price — a cheap model that needs retries is not cheap.

FAQ

What is Effective Value (𝕍)?

𝕍 = v·(1−E)^N / (C_f + ω·t_base·δ^(E·N)) — a single metric fusing velocity, chain-success probability, financial cost, and time amplified by a compounding debugging-friction term. It measures real-world efficacy per unit of money and time, not just how often a model is right.

Why does the best model depend on task depth?

The chain-success term (1−E)^N means small error rates compound catastrophically over many steps. At shallow depth 𝕍 rewards speed; at deep depth accuracy dominates. The right model for a one-shot autocomplete is often the wrong model for a long autonomous job.

Do models that emit more reasoning tokens score higher?

No — across models, accuracy ANTI-correlates with reasoning+output tokens (Spearman ρ=−0.54, p=0.004). Token volume is a struggle signal, not a capability signal: the strongest models are the most concise.

Does paying more for a model buy more accuracy?

Only weakly, with steep diminishing returns (ρ=0.58 but r=0.29). A 12× cheaper model can land within a few points of mid-priced flagships, so cost-per-correct — not token price — is the decision-relevant quantity.

Source: meo-benchmark preprint (Zenodo). Related: sycophancy · multi-LLM-as-judge · cheap-ensemble fusion.