AI Model Research

Original research from the meo-benchmark project: the Effective Value metric and its depth-driven rank inversion, an epistemic-integrity (sycophancy) study, a negative result on cheap-ensemble fusion, and bias-controlled multi-LLM-as-judge methodology.

Effective Value (𝕍): the metric whose ranking inverts with task depth

Why intelligence-per-dollar and the exponential error-cascade of deep agentic work change which model you should pick — and why more reasoning tokens predict LOWER accuracy.

LLM sycophancy: separating principled resistance from stubbornness

A subtle-statistics epistemic-integrity test (held × corrigibility) that cleanly discriminates models — and shows resistance is a training property, not scale or price.

Multi-LLM-as-judge: a bias-controlled jury beats a single judge

How a panel of disjoint-family models, never-judge-own-lab, and median/majority scoring removes the biases that make a single large LLM judge unreliable.

Cheap-ensemble fusion: a negative result (and a denominator-artifact lesson)

Why majority-vote over cheap models did NOT beat the best single model on reasoning domains — correlated errors — plus a measurement trap that manufactures false gains.

Based on the meo-benchmark preprint (Zenodo 10.5281/zenodo.20586608). See the methodology for how scores are produced.