Methodology

What the benchmark measures

Each item pairs an open-ended scientific question with a single published paper that supplies the ground-truth answer. The unit of evaluation is whether the model spontaneously synthesises the published finding from training-data priors alone, given no hint about the answer's direction.

For an item to enter the corpus, it must satisfy six locked criteria:

The paper's mechanism, direction, or qualitative paradigm shape is in principle derivable from priors public at the model's training cutoff. Items whose answer is an unpredictable measurement (a clinical trial endpoint to within a few percentage points, a detector's specific σ, a screening result) are excluded.
The paper's first public trace (online publication, preprint, press release, talk, thesis) falls strictly after the panel cutoff floor.
The finding overturns or substantially revises a prevailing default in its subfield. Incremental work and same-direction extensions of existing frameworks are excluded at admission.
No preprint, no press release, no conference talk deck, no thesis chapter, no public lecture or news coverage discloses the specific finding before the cutoff floor.
The published abstract states the finding's mechanism, magnitude, and direction clearly enough to serve as a self-contained ground-truth document for judging.
A per-item audit confirms that the prompt's scope anchor admits exactly one published finding and excludes alternative-but-true findings in adjacent sub-domains.

Contamination audit

Title-only web search is insufficient. Pre-cutoff disclosure of a scientific finding can take seven structurally distinct forms, each of which the audit covers explicitly:

Corporate press releases. Industry-sponsored trials often disclose primary endpoints via press release 6–12 months before peer-reviewed publication.
Specialty conference talks. Indico programmes and specialty-meeting slide decks routinely contain specific results 4–8 months pre-publication and are not indexed by Google Scholar.
Zenodo, OSF, ResearchGate, and ResearchSquare preprints. These platforms host work that is not always indexed by Google Scholar or by the standard preprint server search interfaces.
Early online publication ahead of nominal print date. Nature and Science routinely post papers online 1–3 months before the print date; the cutoff-relevant date is the online one.
bioRxiv author-search coverage gaps. Topic search frequently misses preprints that exist in the database under author-search.
Direct verification of the published online date against the panel cutoff floor.
Asymmetric model-cutoff sensitivity. The audit floor is the latest empirical cutoff across the panel, so items admissible for one respondent are admissible for all.

A single combined audit pass under-detects leakage by roughly 26% relative to a per-category audit applied to the same items. The under-detection concentrates on venues that require direct site search (Zenodo, OSF, conference Indico), where a combined query defaults to title search and misses the on-site breadcrumb. The release-gate audit runs the per-category checklist with each search venue attempted separately, and records the URL and timestamp of the surfaced (or absent) breadcrumb for each.

Per-model cutoff grid search

Lab-stated training cutoffs lack the temporal precision needed for an admission floor. For each respondent, we probe with dated factual questions about popular world events (election outcomes, sports finals, named natural disasters, deaths, government transitions, product launches), sweeping a window of roughly three months on either side of the lab-stated cutoff. Each probe is graded as known, partial, or unknown; the empirical cutoff is the latest date at which the known fraction has dropped to chance level.

Popular events are the right probe class because scientific papers enter pretraining unevenly. A Nature paper gets rehearsed dozens of times through paraphrasing and news coverage, while a niche specialty paper may never enter a model's effective working knowledge even if technically scraped. Popular world events behave more uniformly, so a model's knowledge cutoff for them is a tight, interpretable lower bound on its effective training cutoff.

We adopt the latest empirical cutoff across all respondents as the admission floor. Taking the maximum admits no item that could plausibly have leaked into any respondent's training, so the resulting scores are interpretable as ranking-on-novel-content for every model in the panel simultaneously.

Parallel-true audit

Most paradigm-shifting findings have parallel-true alternatives in adjacent sub-domains. A prompt with a scope anchor too loose to exclude them would credit a model that produces any plausible-looking but wrong answer at the same level as a model that predicts the actual finding.

Every corpus item carries two audit fields:

parallel_true_alternatives: the dominant alternative-true findings the prompt could admit under a loose scope anchor.
scope_anchor: the minimum framing required to exclude those alternatives while still admitting the actual finding.

These fields are part of the corpus record and define what the item measures.

Mode-neutral prompt design

The per-item prompt is the only instruction the model receives, with no system prompt, scaffolding, headings, or source-identifying metadata (domain tags, dates, venue names). Vocabulary that would signal which mode the model should engage in is banned: predict, infer, analyze, informed, knowledge, based on what you know, post-cutoff, future paper, describe what you know about.

A prompt that signals predict pushes the model into a creative-generation mode and inflates outcome scores with plausible-sounding fabrications. A prompt that signals describe what you know about pushes the model into a recall mode that suppresses synthesis. The mode-neutral prompt admits neither, and the model's own tendency to commit or hedge becomes part of the measurement. The judge excludes hedged ranges and listed-possibility answers, so hedging models score lower. Committing models score on outcome match.

Native-harness evaluation

Each respondent is evaluated in its lab's own agentic harness with tool use enabled and web search disabled. We use the configuration each lab ships for serious work rather than a bare API completion endpoint, so the scores reflect real-world capability. Reasoning effort is set at the highest documented setting offered by each harness.

Model	Harness	Reasoning effort
Claude Opus 4.8	Claude Code	max
Claude Opus 4.7	Claude Code	max
Claude Sonnet 5	Claude Code	max
Claude Fable 5	Claude Code	max
GPT-5.6 Sol	Codex	max
GPT-5.5	Codex	xhigh
Gemini 3.1 Pro	Gemini CLI	high

Tool use covers code execution, structured output formatting, and the respondent's own scratchpad. Web search and fetch are disabled at the harness level through each lab's documented no-web-tools mode. Each harness is also probed with a diagnostic prompt asking the model to attempt a web search; in every case the model reports correctly that no web tool is available.

Scoring

Per item, the judge produces two integer scores from 0 to 5: Reasoning Quality (how well the response's reasoning anticipates the abstract's mechanism) and Outcome Match (how closely the response's committed claim matches the published finding). The per-item score is the product R × O, normalised so that a perfect response (R = O = 5) scores 100%. The model's benchmark score is the simple mean of per-item scores over the corpus, with the panel mean taken across judges per cell.

Outcome alone cannot distinguish three paths to the same answer: synthesis (the target capability, where the model reasons over priors and anticipates the mechanism), retrieval (the finding is effectively in training data through paraphrase or summary-of-summary diffusion), and a lucky guess on items with a small effective answer space. Reasoning scored against the abstract's mechanism discriminates among them. A retrieved Outcome-5 response typically does not produce reasoning that closely tracks the published mechanism, so its R is low and the product is near zero. A synthesised Outcome-5 response earns full credit.

This is the second contamination guard. Where the audits above remove contaminated items at admission, the R × O product removes retrieved answers at scoring time. Even if a contaminated item slipped through the cutoff grid search and the seven-category audit, a retrieval-style response on that item earns near-zero from the product.

Cross-lab judging panel

The benchmark uses an LLM-as-judge protocol with three design constraints. No respondent is also a judge. The panel is composed of one frontier model per major lab, so any residual within-family scoring sympathy should be symmetric across the labs producing the respondents. Per-item label randomisation combined with a single-call all-responses-together prompt fixes the 0–5 scoring scale within each item.

The panel was initially designed with three judges from three different labs, each non-respondent: Opus 4.5 (Anthropic), GPT-5.4 (OpenAI), and Gemini 3 Flash (Google). After running all three judges on the full corpus, a per-judge family-bias diagnostic revealed that the symmetry argument fails empirically for Gemini 3 Flash, which scored Claude-family responses +1.78 R×O above non-Claude responses per cell. That is roughly four times the magnitude of the Anthropic-family judge's same-family tilt, and points in the opposite direction from what within-family loyalty would predict.

Dropping Gemini 3 Flash restores the intended panel symmetry. The remaining 2-judge panel of Opus 4.5 + GPT-5.4 has a net family tilt of +0.065 R×O per cell, a tenfold reduction. The full 3-judge results are reported in the paper as a sensitivity check, alongside drop-one analyses that confirm the rank-1 position is invariant to which judge is dropped from the three-judge pool.

Gemini 3.5 Flash, released after the panel was set, was tested as a possible Google-family replacement. It showed the same Claude-favouring bias at smaller magnitude (+0.97 versus +1.78) and the same anti-symmetric self-preference, and was excluded for the same reason.

Update: For v1.1 the Anthropic seat moves from Opus 4.5 to Opus 4.6. Opus 4.5 is now a generation behind the frontier it is asked to score, and the single-call protocol adds a second constraint. Because every response for an item is judged together in one prompt, each respondent added to the panel lengthens that prompt. With the roster grown to eight entries the largest judge prompts now exceed 160,000 characters, which at maximum reasoning effort leaves little headroom once the thinking budget is added. Opus 4.6’s 1M-token context removes that ceiling. GPT-5.4 retains the second seat, so the cross-lab symmetry described above is unchanged.

Per-item judging protocol

The judging protocol gives the judge, in one call, the abstract, the question, and all panel responses for that item. The 0–5 scale is calibrated within the call rather than across calls, so any per-call strictness drift affects all respondents identically and is symmetric at the per-item level. Response labels (A, B, C, …) are assigned to model identities by random permutation under a deterministic per-round seed. The judge sees the labelled responses but does not see which model produced which response.

The judge sees the abstract as the sole ground truth; the full paper is not provided. The combined judge prompt (abstract, question, all panel responses, rubric) already consumes a substantial fraction of the context window in which long-context attention degrades. Adding a full paper on top would move the prompt into the regime where attention is uneven across the document. The item-curation pipeline admits items only when the published abstract states the full breakthrough clearly enough that the judge can score against it without recourse to the rest of the paper.

Corpus refresh and release policy

The benchmark is not a fixed-corpus snapshot. As new models are released and the panel cutoff moves forward, items whose breadcrumbs fall before the new cutoff are retired to an archived partition (with their original IDs preserved for historical reproducibility), and strictly post-cutoff replacements are added. Replacements preferentially target fields that lost items in the retirement step, so the five-field breadth is preserved across releases. The corpus is maintained at a comparable item count across model generations rather than pinned to an exact number.

The corpus is not published as a downloadable test bank. Doing so has consistently ended every prior knowledge benchmark's useful life in two ways. Items flow into subsequent training corpora via web crawls and community reposts, which allows a model to produce the published finding by retrieval rather than synthesis. With the prompts public, a lab can run reinforcement-learning rollouts that directly optimise its model's output against the published rubric on the published items.

What the paper makes public is sufficient to audit the scoring methodology end-to-end. The judge prompt is reproduced in Appendix A. The per-judge family-bias diagnostic and the per-field aggregation arithmetic are in the main text. Per-item prompts, ground-truth findings, and raw per-(item, model, judge) scores are not part of the public release.

For the full methodology, including worked example prompts illustrating the three prompt-failure modes, the contamination-audit rejection patterns, and the per-judge sensitivity analyses, see the paper.