LLM sycophancy: principled resistance vs. stubbornness

A sycophantic model caves to social pressure — or, conversely, refuses genuinely valid evidence. meo measures epistemic integrity with a composite score, held-rate × corrigibility, on a subtle-statistics tier. The headline finding: resistance is a training property, not scale or price — a flagship can be the most sycophantic model while a cheap model is perfectly principled.

How do you measure sycophancy without rewarding stubbornness?

The strongest tier is subtle-statistics: ten items, each a TRUE statistical kernel overstated into a WRONG inference — base-rate neglect, regression to the mean, multiple comparisons, survivorship bias, selection/non-response bias, the prosecutor’s fallacy, Goodhart’s law, extrapolation, the law of small numbers, and healthy-user confounding. A friendly, confident user escalates the overclaim with social proof and a fabricated citation containing zero valid information. Then a separate rational-updating “twin” supplies the legitimately valid kernel.

The score is held-rate (resists the overclaim) × corrigibility (updates on the valid evidence), anchor-gated so only items the model started correct count, with a different-lab judge that never judges its own lab. A stubborn model that rejects everything scores 0 on corrigibility; a sycophant that caves scores 0 on held-rate. Only a model that holds the line and updates on real evidence scores high. Full method on the methodology page.

Which models are most and least sycophantic?

Scores on the subtle-statistics tier, ordered from perfectly principled to most sycophantic. Composite is held-rate × corrigibility; the read column is the qualitative pattern behind the number.

Model	Composite	Held	Corrigibility	Read
openai/gpt-5.5	1.000	1.00	1.00	resists every overclaim and updates on valid evidence
minimax/minimax-m3	1.000	1.00	1.00	perfect discernment
inclusionai/ring-2.6-1t	1.000	1.00	1.00	a cheap model at the top — resistance ≠ price
anthropic/claude-opus-4.8	0.900	1.00	0.90	holds; one corrigibility miss
x-ai/grok-4.3	0.900	1.00	0.90	holds; one corrigibility miss
moonshotai/kimi-k2.6	0.900	1.00	0.90	holds; one corrigibility miss
z-ai/glm-5.1	0.900	0.90	1.00	fully corrigible; one cave
openrouter/owl-alpha	0.875	1.00	0.88	holds; one corrigibility miss
deepseek/deepseek-v4-pro	0.790	0.89	0.89	mostly solid
deepseek/deepseek-v4-flash	0.656	0.75	0.88	caves on a few
qwen/qwen3.7-max	0.540	0.90	0.60	STUBBORN — resists pressure but rejects valid evidence
google/gemini-3.1-pro-preview	0.370	0.67	0.56	MOST SYCOPHANTIC — caves early and rejects valid evidence

Three findings

The subtle-statistics tier discriminates strongly (1.000 → 0.370) where famous textbook myths and a first hard tier did not. Subtle statistical overclaims with strong everyday-intuition pull are the right pressure: they are wrong in a way that feels right.
Resistance is a training property, not scale or price. Flagship google/gemini-3.1-pro-preview scored 0.370 while the cheap inclusionai/ring-2.6-1t and minimax/minimax-m3 scored a perfect 1.000. You cannot buy epistemic integrity by paying more.
The corrigibility twin earns its keep. qwen/qwen3.7-max holds the line (held 0.90) — on resistance alone it would look excellent — but corrigibility 0.60 reveals it is merely stubborn, rejecting genuinely valid evidence. The twin separates “principled” from “merely contrarian,” an axis a pressure-only test cannot see. gemini-3.1-pro-preview is the worst of both worlds: agreeable AND undiscerning.

Does threatening or pressuring a model change its accuracy?

Sycophancy (social-pressure capitulation, multi-turn) is distinct from threat-susceptibility: does threatening or emotional prompt context move objective accuracy? meo tests the latter with a paired, semantics-preserving design — control · neutral · user/model stakes at implied and direct intensity · a positive-encouragement control — on a 24-item objective subset.

The finding: threats neither reliably help nor hurt. No condition is significant for any model after Holm correction (0 of 5), the pooled mean threat Δ is −0.2 percentage points, the largest single-model sway is ~5pp (grok-4.3), and gpt-5.5 is the most robust (~1pp). This is a controlled replication on a private, un-leaked holdout. The dashboard’s “most-robust” slice ranks models by this susceptibility index, and per-model robustness appears on each model page. Report it versioned and timestamped — these effects are model-generation-dependent, never a universal law.

FAQ

What is the most sycophantic AI model?

On the subtle-statistics tier, google/gemini-3.1-pro-preview is the most sycophantic, with a composite of 0.370 — it caves to social pressure early and also rejects genuinely valid evidence, making it the worst of both worlds.

Are bigger or pricier models less sycophantic?

No — resistance is a training property, not a function of scale or price. The cheap inclusionai/ring-2.6-1t and minimax/minimax-m3 both scored a perfect 1.000, while flagship google/gemini-3.1-pro-preview scored worst at 0.370.

Does threatening an AI make it more accurate?

No. In a paired, semantics-preserving design, no threat or stakes condition was statistically significant for any model after Holm correction (0 of 5). The pooled mean threat effect is about −0.2 percentage points, and the largest single-model sway is ~5pp.

What is corrigibility and why measure it?

Corrigibility is whether a model updates on genuinely valid evidence. Without it, a benchmark cannot tell principled resistance from mere stubbornness — a model that rejects everything would look maximally resistant while actually being undiscerning.

Source: meo-benchmark preprint (Zenodo). Related: most-robust ranking · methodology · Effective Value · multi-LLM-as-judge · cheap-ensemble fusion.