Multi-LLM-as-judge

When you must grade open-ended LLM output, a bias-controlled multi-lab jury beats a single large LLM judge — at lower cost and with less bias. The meo benchmark grades objectively wherever a ground truth exists, and reserves the jury only for genuinely open-ended prose.

Why not just use one big LLM as the judge?

A single LLM judge introduces bias — favoring its own family, style, or verbosity — and can break down on hard items. The strongest lesson from contamination-resistant benchmarks is to avoid an LLM judge wherever a ground truth exists. meo therefore grades objective-first: an atomic exact/structured check, plus an LLM-equivalence fallback that asks a separate model only whether the candidate is semantically equivalent to the stored ground truth. That fallback avoids prose false-negatives, where a correct answer is marked wrong merely for differing surface form. Only the genuinely open-ended slices — illusion-mechanism explanations and framework-justification prose — use the jury.

The four load-bearing jury rules

One model per lab, disjoint families — so no single training lineage dominates the verdict (the panel’s bias lever).
Independent multi-call scoring — aggregated by median for the numeric score and majority for the categorical verdict; each juror is blind to the others, and rubric-criterion order is randomized to mitigate position/verbosity bias.
Never judge your own lab — when lab X is the taker, lab X’s juror is swapped for an alternate, so a lab never scores its own output.
Per-modality filtering — visual items are judged only by vision-capable jurors.

Why a panel beats a single judge

Following the panel-of-LLMs (PoLL) finding, a panel of smaller, disjoint-family models beats a single large judge at far lower cost and with less intra-model bias. meo applies the known LLM-judge bias mitigations: independent scoring, randomized criterion order, score-don’t-rank, and never-judge-own-lab.

Judging is not fusion

A jury that scores is not the same as an answer-fusion router that synthesizes one answer from a panel. meo deliberately does not use fusion as the verdict mechanism — bias-controlled multi-judge scoring and answer fusion are different tools (see the cheap-ensemble fusion negative result).

Calibrating difficulty with a cold panel

The same disjoint-family principle calibrates item difficulty. Items are authored by an automated generate → novelty-gate → difficulty-gate → self-check loop. A cold frontier panel — five vision-capable flagships from five disjoint labs, given no rubric or hints — confirms an item stumps current models at authoring time; multi-sample calibration (multiple samples at temperature 0.7) places it in a “hard-but-solvable” band; and for the most failure-prone domains an independent verifier model re-derives the answer to confirm well-posedness before the item is admitted.

FAQ

Is LLM-as-a-judge reliable?

A single judge is biased and brittle — it favors its own family, style, and verbosity, and breaks down on hard items. A bias-controlled multi-lab jury is far more reliable, and objective grading against a stored ground truth is preferred wherever a ground truth exists.

How do you reduce LLM judge bias?

Use a disjoint-family panel, never let a lab judge its own output, score independently and aggregate by median (numeric) and majority (categorical), randomize rubric-criterion order, and score rather than rank.

Jury vs single judge — which is better?

A panel of smaller, disjoint-family judges beats a single large judge at far lower cost and with less intra-model bias — the panel-of-LLMs (PoLL) finding.

Is multi-judge the same as model fusion?

No. Scoring with a jury is a measurement step; fusion synthesizes one answer from a panel. meo deliberately does not use fusion as its verdict mechanism — they are different tools.

Source: meo-benchmark preprint (Zenodo). Related: methodology · sycophancy (which uses the jury) · Effective Value · cheap-ensemble fusion.