AI Model Research
Original research from the meo-benchmark project: the Effective Value metric and its depth-driven rank inversion, an epistemic-integrity (sycophancy) study, a negative result on cheap-ensemble fusion, and bias-controlled multi-LLM-as-judge methodology.
Why intelligence-per-dollar and the exponential error-cascade of deep agentic work change which model you should pick — and why more reasoning tokens predict LOWER accuracy.
A subtle-statistics epistemic-integrity test (held × corrigibility) that cleanly discriminates models — and shows resistance is a training property, not scale or price.
How a panel of disjoint-family models, never-judge-own-lab, and median/majority scoring removes the biases that make a single large LLM judge unreliable.
Why majority-vote over cheap models did NOT beat the best single model on reasoning domains — correlated errors — plus a measurement trap that manufactures false gains.
Based on the meo-benchmark preprint (Zenodo 10.5281/zenodo.20586608). See the methodology for how scores are produced.