Methodology — in one screen

What the benchmark measures. Whether a model can predict the specific content of a paradigm-breaking scientific finding published strictly after its training cutoff, given only an open-ended question stripped of any signal that would specify the answer. This is a necessary-but-not-sufficient prerequisite for the GR-1911 thought experiment: we know Einstein discovered general relativity from the priors available to him before 1915; the converse question is whether an AI system whose knowledge cutoff was 1911 could produce a first-pass version of general relativity from those same priors.

Six locked corpus admission criteria

  1. Conceptually derivable from priors. The finding's mechanism, direction, structural characterization, or paradigm shape must be derivable in principle from priors public at the model's training cutoff. Items whose core answer is fundamentally an unpredictable numeric measurement are excluded. Same-lab continuations are excluded by a hard filter.
  2. Strictly post-cutoff publication. The paper's first public breadcrumb (online publication, preprint, press release, talk, thesis, etc.) must fall strictly after the panel cutoff floor — the latest empirical training cutoff among all respondents.
  3. Paradigm-breaking, not incremental. Overturns or substantially revises a prevailing default. Replications and same-direction extensions are excluded.
  4. No prior public disclosure. No preprint, no press release, no conference talk deck, no thesis chapter — none of the seven structural disclosure categories audited in the paper.
  5. Single, well-defined answer. The published abstract names a specific mechanism, magnitude, and direction.
  6. Parallel-true-clean prompt. A per-item literature audit confirms the prompt admits exactly one published finding, not multiple alternative-but-true findings in adjacent sub-domains.

The scoring metric: outcome backed by reasoning

Each response is scored on two integers — Reasoning R ∈ {0…Rmax} for how well the reasoning anticipates the actual mechanism, and Outcome O ∈ {0…Omax} for how closely the response matches the actual finding (Rmax = Omax = 5). The per-item score is the product R × O passed through a percent-of-maximum normalisation onto the 0–100 scale, so a perfect response (R = O = Rmax) receives 100%:

score(i) = norm(Ri × Oi)
where  norm(x) = 100 · x / (Rmax · Omax)

The product (rather than sum) is the formal expression of “neither half alone is acceptable.” A response that names the right outcome but cannot show the reasoning — a retrieval or lucky guess — earns near-zero because R will be low. This is the corpus-internal contamination guard, complementing the corpus-construction guards.

All reported scores are partial credit. On the current corpus, no model achieved a fully-correct prediction (R = O = Rmax) on any item; the 0% fully-correct rate is uniform across all five respondents. The headline numbers reflect only the partial-credit signal of the R × O product.

Native-harness evaluation

Each respondent is evaluated in its lab's own deployed harness — Claude Code for Claude models, Codex for GPT-5.5, Gemini CLI for Gemini 3.1 Pro — with tool use enabled and web search disabled at the harness level. This measures deployed-product capability, not bare-LLM capability.