Cheap-ensemble fusion: a negative result

We tested whether a majority vote over a cheap ensemble of LLMs could beat the best single cheap model on objective reasoning domains. The result was negative. No 2- or 3-model combination beat the best single model on accuracy, and none beat the cheapest single on cost-per-correct. The more valuable contribution is a measurement lesson: a “denominator artifact” that can make an ensemble look better than it is.

The experiment

Five cheap members were run over 48 objective-text items spanning logic, generator-as-oracle, and base-rate-bias tasks. We scored every 2- and 3-model majority-vote combination, run at temperature 1 with maximum reasoning and no web search. To keep the comparison fair, every model is evaluated on the same same-item, no-error subset of 36 items — so no model is quietly credited for an easier set of questions.

The result was negative

On the fair 36-item subset, no fusion combo improved on the best single model, and every combo cost more than the cheapest single per correct answer:

Configuration	Accuracy	Cost / correct	Note
Best single model (qwen3.7-plus / deepseek-v4-pro)	66.7%	~$0.012–0.014	baseline to beat
Cheapest single (ring-2.6-1t)	63.9%	$0.0020	cost-per-correct floor
Best 2–3-model fusion combo	66.7%	$0.014–0.038	ties on accuracy, costs more

Per the pre-registered decision gate — stop unless a 2-model pair beats the best single on accuracyor on accuracy-per-dollar — the factorial expansion was halted.

Why fusion failed: correlated errors

Majority vote only helps when members err independently. On deterministic reasoning and puzzle domains, the hard items are hard for all cheap models — a correlated failure — so the ensemble misses the same items and merely sums the cost of every member, with no recovery. A 2-model “majority” is additionally degenerate: ties break to the first member, so it is just that member with extra spend bolted on.

The denominator-artifact lesson (the real contribution)

A naive first export showed combos “winning” by +13 points (69% vs 56%). This was a denominator artifact: the scorer counted a single model’s error as a zero over all 48 items, but a comboskipped items where fewer than two members produced an answer (the “≥2 valid” rule) — so combos were silently scored on an easier no-error subset while singles still carried their error-zeros. Restricting every model to the common no-error 36-item subset removed the bias and the apparent gain vanished.

The general lesson: never compare accuracy across models with different effective denominators. An abstention or refusal policy that quietly shrinks the denominator can manufacture a false improvement. This caveat applies to any leaderboard that mixes models with different coverage or refusal behavior.

Caveats and open questions

The fair subset is small (36 items) and the singles clustered tightly at 61–67%, so this is a pilot, not a verdict for all settings. What remains untested is fusion on noisy or open-ended domains, where member errors may be more independent and an ensemble could still help — this pilot only covered objective reasoning. The honest negative result is itself a credibility signal: we report what did not work, not just what did.

FAQ

Does ensembling LLMs improve accuracy?

Not on objective reasoning domains. The hard items are hard for every cheap member at once (correlated errors), so a majority vote misses the same items and just adds cost. Ensembling can still help where members err independently — for example on noisier or more open-ended tasks.

Why didn’t cheap-ensemble fusion beat the best single model?

Majority vote only recovers an item when members fail independently. On deterministic reasoning and puzzle domains the hard items are hard for all cheap members, so the ensemble misses the same ones and merely sums the cost of every member — no accuracy is recovered.

What is a denominator artifact?

Comparing accuracy across models that were effectively scored on different numbers of items — for example one model abstains or skips when it cannot answer, shrinking its denominator. The model with the smaller, easier denominator looks better, which can manufacture a false gain that vanishes once every model is scored on the same items.

Is a negative result useful?

Yes. It prevents wasted spend on an approach that does not pay off here, and in this case it surfaced a measurement trap — the denominator artifact — that applies to any leaderboard mixing models with different coverage or refusal behavior.

Source: meo-benchmark preprint (Zenodo). Related: effective value · sycophancy · multi-LLM-as-judge · methodology.