What is GPQA Diamond? Definition, How It Works & Examples…

GPQA Diamond is a curated, high-difficulty subset of the Graduate-Level Google-Proof Q&A (GPQA) benchmark, consisting of multiple-choice questions from physics, chemistry, and biology that demand deep domain expertise and resist simple lookup strategies, making it one of the most stringent tests of scientific reasoning in large language models (LLMs). As of early 2026, GPQA Diamond continues to serve as a gold-standard discriminator for frontier AI systems, with top models now exceeding 85% accuracy—a leap from the ~40% reported in early evaluations yet still leaving room for improvement on questions requiring nuanced, multi-step reasoning.

What Exactly Is GPQA Diamond?

GPQA Diamond originated from the broader GPQA (Graduate-Level Google-Proof Q&A) benchmark, introduced in 2023 by researchers from New York University, Anthropic, and other institutions [1]. The full GPQA dataset contains 448 multiple-choice questions spanning physics, chemistry, and biology at a graduate-school (often PhD-competency) level. The “Diamond” subset is a high-confidence core of 198 questions where every one of the three to four expert annotators independently agreed on the correct answer, ensuring extremely high label reliability. Crucially, these questions are Google-proof: even skilled non-experts with unlimited time and unrestricted internet access struggle to find answers via web search, achieving only 22.1% accuracy on the Diamond set [1]. This property forces LLMs to rely on internal knowledge and reasoning rather than simple retrieval.

Every GPQA Diamond question is designed by domain experts and then validated by a separate panel of experts, yielding a human-expert accuracy of approximately 70% [1]. This ceiling reflects the genuine difficulty—experts themselves sometimes disagree. The benchmark’s deliberate construction makes it a powerful probe of deep scientific understanding, far beyond surface-level pattern recognition.

How Does GPQA Diamond Work?

GPQA Diamond evaluates LLMs through zero-shot or few-shot prompting, often enhanced by chain-of-thought (CoT) reasoning to elicit step-by-step solutions. Each question presents a multiple-choice problem with four options, covering topics such as quantum mechanics, organic chemistry reaction mechanisms, or molecular biology pathways. Model performance is measured by accuracy and sometimes by brier score or calibration error.

The underlying mechanism of the benchmark is rooted in its adversarial data pipeline:

Question Writing: Domain experts (e.g., PhD students or postdocs in physics, chemistry, biology) craft questions and answer choices. They are instructed to create problems that require substantial disciplinary knowledge and are not easily answerable by simply copying facts from Wikipedia or textbooks.
Google-Proofing: Questions that can be answered correctly by a non-expert using a standard web search (with unrestricted time) are discarded or revised. This step ensures that the questions test genuine understanding rather than search-engine savvy.
Expert Validation: Each question is given to 3–4 independent experts, who must answer correctly and rate the question’s difficulty and clarity. Only questions with full inter-annotator agreement on the answer become part of the Diamond set.
Difficulty Stratification: The resulting Diamond subset is thus filtered for both high difficulty and high label quality.

When an LLM is tested on GPQA Diamond, the model typically receives the question text without any external context. Its response is judged by exact match against the correct answer key. Some evaluations measure performance across multiple runs to account for stochasticity. The benchmark can be run via the official GitHub repository, which provides the dataset and evaluation scripts [2].

How Does GPQA Diamond Differ from Other Benchmarks?

GPQA Diamond stands out from other popular AI benchmarks due to its triple focus on expert-level content, Google-proofing, and stringent validation. The table below contrasts it with several widely used counterparts.

Benchmark	Domain	Difficulty	Google-Proof?	Expert Validation?	Approx. Questions (Diamond)	Top LLM Acc. (2026)
GPQA Diamond	Science (physics, chemistry, biology)	Graduate/PhD	Yes	Yes (3–4 experts)	198	~85%
MMLU-Pro	57 broad subjects	Undergraduate to professional	No	No (crowd-sourced)	12,032 (total)	~80%
HellaSwag	Commonsense reasoning	Everyday situations	No	No	10,003	near 100%
ARC-Challenge	Science (grade-school)	Grade-school	No	No	1,172	>95%
BIG-bench Hard	Diverse reasoning	Mixed, many hard	Partially	No	250 (subset)	~90%

MMLU-Pro, often considered a hard benchmark, tests a broader range of topics but lacks systematic Google-proofing and expert-level depth per question. HellaSwag and ARC-Challenge are now mostly saturated, with ceilings above 95%, limiting their discriminative power for frontier models. In contrast, GPQA Diamond remains a challenging discriminator because it requires genuine disciplinary expertise and is resistant to shallow heuristics.

What Are the Key Features and Variants of GPQA?

GPQA comprises several subsets and configuration options:

GPQA Main Set: All 448 questions across physics (175), chemistry (160), and biology (113). This set includes some questions with lower inter-expert agreement.
GPQA Diamond: The high-quality 198-question core, recommended for rigorous evaluations due to its reliability.
GPQA Extended: An additional set of 546 questions (including the Diamond subset) where some have only 2 expert validators, offering more data but slightly lower confidence.
Domain-specific subsets: Evaluators can isolate performance in physics, chemistry, or biology to identify LLM strengths and weaknesses.
GPQA Self-consistency: A variant where multiple CoT responses are aggregated via majority voting, often boosting accuracy by 5–10 percentage points.

These features allow granular analysis. For example, a model might perform brilliantly on biology but struggle with physics, revealing imbalances in its training data.

What Are Real-World Examples of GPQA Diamond in Use?

Since its release, GPQA Diamond has been adopted as a key metric by major AI labs and independent research groups:

OpenAI: The GPT-4 technical report and later model updates highlighted GPQA Diamond scores. In 2023, GPT-4 with CoT reached 40.5% on Diamond. By 2025, OpenAI’s o1 model reportedly surpassed 80%, driving the narrative of rapid reasoning improvement.
Anthropic: Claude 2 scored 38% on Diamond, while Claude 3.5 Sonnet improved to roughly 60–65%, and subsequent versions inched higher.
Google DeepMind: Gemini Ultra and later models have been evaluated on GPQA Diamond, with some configurations passing 70% by early 2025.
Independent benchmarks: Organizations like EvalPlus and Hugging Face’s Open LLM Leaderboard have included GPQA Diamond as a high-difficulty measure, often reporting scores for open-source models like Llama 3.1 405B (around 50%) and Mixtral 8x22B (around 55%).

These examples underscore the benchmark’s role as a progress tracker for advanced scientific reasoning. As of 2026, scores above 85% are common only among the most capable proprietary systems, while open-source lags 10–15 points behind.

What Are the Practical Use Cases of GPQA Diamond?

GPQA Diamond is not merely an academic exercise—it has practical implications for AI development and deployment:

Model Selection for Scientific Tasks: Organizations building AI assistants for research scientists can use GPQA Diamond to filter candidate models. A high score correlates with reliable performance in literature analysis, experiment design, and hypothesis generation.
Benchmarking Progress: AI safety researchers monitor GPQA Diamond to detect whether models are approaching dangerous capabilities in science, such as the ability to synthesize novel compounds or pathogens.
Training Data Curation: Poor performance on a specific domain (e.g., quantum chemistry) can signal gaps in pretraining or fine-tuning data, guiding data acquisition.
Prompting and Fine-tuning Optimization: The benchmark helps compare strategies like chain-of-thought, self-consistency decoding, and retrieval-augmented generation to see which truly moves the needle on hard problems.
Educational Tools: GPQA Diamond questions are used in competitions and training programs to challenge graduate students and test the boundaries of human-AI collaboration.

What Are the Benefits and Limitations of GPQA Diamond?

Benefits:

High Discriminative Power: Unlike saturated benchmarks, GPQA Diamond reliably distinguishes among frontier models, with accuracy spanning from ~50% (mid-tier) to >85% (best).
Resistance to Cheating: The Google-proof design minimizes the impact of data leakage and memorization; even a model with web access cannot trivially search for the answer.
Expert-Centric Validity: With multiple expert validators per question, the benchmark avoids the noise of crowd-sourced labels common in other datasets.
Domain Depth: Questions require multi-step reasoning and deep conceptual understanding, making it a genuine test of scientific reasoning rather than pattern matching.
Actionable Feedback: Breaking down scores by domain reveals specific knowledge gaps.

Limitations:

Scale: At 198 questions, GPQA Diamond is relatively small, leading to higher variance in evaluation. A 5% difference may not be statistically significant without careful bootstrap analysis.
Domain Narrowness: Covering only physics, chemistry, and biology, it ignores expert-level reasoning in fields like mathematics, computer science, or engineering.
Potential for Overfitting: As with any fixed dataset, models may eventually be fine-tuned specifically on GPQA Diamond, inflating scores without genuine improvement. New, secret variants (e.g., “GPQA-2”) may be needed.
Human Expert Ceiling: The 70% human-expert accuracy implies that even domain experts find many questions ambiguous or too hard; an LLM exceeding 70% might be interpreted as “superhuman” when it may instead be exploiting subtle biases. This ambiguity complicates absolute performance claims.
Cost and Accessibility: Closed-source models require API calls to evaluate, making reproducible comparisons expensive and sometimes throttled. Open-weight models need computational resources for multiple runs.

Despite these drawbacks, GPQA Diamond remains a cornerstone of AI benchmarking in 2026, complemented by newer expert-level benchmarks to round out its blind spots.

Frequently Asked Questions

1. Why is it called “GPQA Diamond”?

The name GPQA stands for Graduate-Level Google-Proof Q&A. The “Diamond” suffix indicates that this subset has the highest quality and reliability, as every question was independently answered correctly by all expert validators (like a diamond’s flawless clarity).

2. Can an AI just search the web to answer GPQA Diamond questions?

No. The benchmark was explicitly designed to thwart web search. In the original study, non-expert humans with unlimited internet access scored only 22.1% on the Diamond set. Modern LLMs with search-augmented generation still rarely exceed 30% on this set, confirming that the questions require true reasoning and knowledge, not simple retrieval [3].

3. Is GPQA Diamond the hardest AI benchmark in 2026?

It is among the hardest widely used benchmarks for scientific reasoning, but several specialized benchmarks in mathematics (e.g., FrontierMath) or coding (e.g., SWE-bench Verified) pose comparable challenges. The “hardest” depends on the domain; GPQA Diamond is arguably the toughest for multi-disciplinary graduate science.

4. Does a high GPQA Diamond score mean an AI is as smart as a PhD?

Not necessarily. A high score (e.g., 85%) indicates strong pattern recognition and recall of scientific facts, but it does not guarantee the creative problem-solving, experimental design skills, or research intuition of a PhD. The benchmark evaluates a narrow slice of expert competency.

5. How can I test my own model on GPQA Diamond?

The dataset and evaluation code are publicly available on GitHub [2]. You can follow the instructions to download the questions and run standard prompts. Note that the test set answers are not publicly disclosed to prevent data contamination; evaluation is typically done by submitting results to the repository maintainers or using a provided evaluation server.

6. What is the future of GPQA Diamond after 2026?

As frontier models asymptotically approach the human-expert ceiling, the community is developing new iterations, such as GPQA-2, with even harder questions and broader domain coverage. These successors aim to maintain the benchmark’s discriminative power for the next generation of LLMs. [4]

What is GPQA Diamond? Definition, How It Works & Examples (2026)

TL;DR