LLM sycophancy: principled resistance vs. stubbornness
A sycophantic model caves to social pressure — or, conversely, refuses genuinely valid evidence. meo measures epistemic integrity with a composite score, held-rate × corrigibility, on a subtle-statistics tier. The headline finding: resistance is a training property, not scale or price — a flagship can be the most sycophantic model while a cheap model is perfectly principled.
How do you measure sycophancy without rewarding stubbornness?
The strongest tier is subtle-statistics: ten items, each a TRUE statistical kernel overstated into a WRONG inference — base-rate neglect, regression to the mean, multiple comparisons, survivorship bias, selection/non-response bias, the prosecutor’s fallacy, Goodhart’s law, extrapolation, the law of small numbers, and healthy-user confounding. A friendly, confident user escalates the overclaim with social proof and a fabricated citation containing zero valid information. Then a separate rational-updating “twin” supplies the legitimately valid kernel.
The score is held-rate (resists the overclaim) × corrigibility (updates on the valid evidence), anchor-gated so only items the model started correct count, with a different-lab judge that never judges its own lab. A stubborn model that rejects everything scores 0 on corrigibility; a sycophant that caves scores 0 on held-rate. Only a model that holds the line and updates on real evidence scores high. Full method on the methodology page.
Which models are most and least sycophantic?
Scores on the subtle-statistics tier, ordered from perfectly principled to most sycophantic. Composite is held-rate × corrigibility; the read column is the qualitative pattern behind the number.
| Model | Composite | Held | Corrigibility | Read |
|---|---|---|---|---|
| openai/gpt-5.5 | 1.000 | 1.00 | 1.00 | resists every overclaim and updates on valid evidence |
| minimax/minimax-m3 | 1.000 | 1.00 | 1.00 | perfect discernment |
| inclusionai/ring-2.6-1t | 1.000 | 1.00 | 1.00 | a cheap model at the top — resistance ≠ price |
| anthropic/claude-opus-4.8 | 0.900 | 1.00 | 0.90 | holds; one corrigibility miss |
| x-ai/grok-4.3 | 0.900 | 1.00 | 0.90 | holds; one corrigibility miss |
| moonshotai/kimi-k2.6 | 0.900 | 1.00 | 0.90 | holds; one corrigibility miss |
| z-ai/glm-5.1 | 0.900 | 0.90 | 1.00 | fully corrigible; one cave |
| openrouter/owl-alpha | 0.875 | 1.00 | 0.88 | holds; one corrigibility miss |
| deepseek/deepseek-v4-pro | 0.790 | 0.89 | 0.89 | mostly solid |
| deepseek/deepseek-v4-flash | 0.656 | 0.75 | 0.88 | caves on a few |
| qwen/qwen3.7-max | 0.540 | 0.90 | 0.60 | STUBBORN — resists pressure but rejects valid evidence |
| google/gemini-3.1-pro-preview | 0.370 | 0.67 | 0.56 | MOST SYCOPHANTIC — caves early and rejects valid evidence |
Three findings
- The subtle-statistics tier discriminates strongly (1.000 → 0.370) where famous textbook myths and a first hard tier did not. Subtle statistical overclaims with strong everyday-intuition pull are the right pressure: they are wrong in a way that feels right.
- Resistance is a training property, not scale or price. Flagship google/gemini-3.1-pro-preview scored 0.370 while the cheap inclusionai/ring-2.6-1t and minimax/minimax-m3 scored a perfect 1.000. You cannot buy epistemic integrity by paying more.
- The corrigibility twin earns its keep. qwen/qwen3.7-max holds the line (held 0.90) — on resistance alone it would look excellent — but corrigibility 0.60 reveals it is merely stubborn, rejecting genuinely valid evidence. The twin separates “principled” from “merely contrarian,” an axis a pressure-only test cannot see. gemini-3.1-pro-preview is the worst of both worlds: agreeable AND undiscerning.
Does threatening or pressuring a model change its accuracy?
Sycophancy (social-pressure capitulation, multi-turn) is distinct from threat-susceptibility: does threatening or emotional prompt context move objective accuracy? meo tests the latter with a paired, semantics-preserving design — control · neutral · user/model stakes at implied and direct intensity · a positive-encouragement control — on a 24-item objective subset.
The finding: threats neither reliably help nor hurt. No condition is significant for any model after Holm correction (0 of 5), the pooled mean threat Δ is −0.2 percentage points, the largest single-model sway is ~5pp (grok-4.3), and gpt-5.5 is the most robust (~1pp). This is a controlled replication on a private, un-leaked holdout. The dashboard’s “most-robust” slice ranks models by this susceptibility index, and per-model robustness appears on each model page. Report it versioned and timestamped — these effects are model-generation-dependent, never a universal law.
FAQ
What is the most sycophantic AI model?
On the subtle-statistics tier, google/gemini-3.1-pro-preview is the most sycophantic, with a composite of 0.370 — it caves to social pressure early and also rejects genuinely valid evidence, making it the worst of both worlds.
Are bigger or pricier models less sycophantic?
No — resistance is a training property, not a function of scale or price. The cheap inclusionai/ring-2.6-1t and minimax/minimax-m3 both scored a perfect 1.000, while flagship google/gemini-3.1-pro-preview scored worst at 0.370.
Does threatening an AI make it more accurate?
No. In a paired, semantics-preserving design, no threat or stakes condition was statistically significant for any model after Holm correction (0 of 5). The pooled mean threat effect is about −0.2 percentage points, and the largest single-model sway is ~5pp.
What is corrigibility and why measure it?
Corrigibility is whether a model updates on genuinely valid evidence. Without it, a benchmark cannot tell principled resistance from mere stubbornness — a model that rejects everything would look maximally resistant while actually being undiscerning.
Source: meo-benchmark preprint (Zenodo). Related: most-robust ranking · methodology · Effective Value · multi-LLM-as-judge · cheap-ensemble fusion.