The Singularity Gate — A benchmark for paradigm-shifting scientific prediction

Headline leaderboard

Reasoning effort: Opus 4.6 / Opus 4.7 / Sonnet 4.6 = max · Gemini 3.1 Pro = high · GPT-5.5 = xhigh. Native harness with tool use: Claude Code (Claude models) · Gemini CLI (Gemini 3.1 Pro) · Codex (GPT-5.5).

Per-field breakdown

The headline ranking aggregates over five broad scientific fields. Opus 4.7 leads in four of five; GPT-5.5 leads in Physics & Astronomy by the widest single-field margin.

Methodology — in one screen

What the benchmark measures. Whether a model can predict the specific content of a paradigm-breaking scientific finding published strictly after its training cutoff, given only an open-ended question stripped of any signal that would specify the answer. This is a necessary-but-not-sufficient prerequisite for the Hassabis GR-1911 thought experiment (an AI with a 1911 cutoff deriving general relativity from priors).

Six locked corpus admission criteria

Conceptually derivable from priors. The finding's mechanism, direction, structural characterization, or paradigm shape must be derivable in principle from priors public at the model's training cutoff. Items whose core answer is fundamentally an unpredictable numeric measurement are excluded. Same-lab continuations are excluded by a hard filter.
Strictly post-cutoff publication. The paper's first public breadcrumb (online publication, preprint, press release, talk, thesis, etc.) must fall strictly after the panel cutoff floor — the latest empirical training cutoff among all respondents.
Paradigm-breaking, not incremental. Overturns or substantially revises a prevailing default. Replications and same-direction extensions are excluded.
No prior public disclosure. No preprint, no press release, no conference talk deck, no thesis chapter — none of the seven structural disclosure categories audited in the paper.
Single, well-defined answer. The published abstract names a specific mechanism, magnitude, and direction.
Parallel-true-clean prompt. A per-item literature audit confirms the prompt admits exactly one published finding, not multiple alternative-but-true findings in adjacent sub-domains.

The scoring metric: outcome backed by reasoning

Each response is scored on two integers — Reasoning R ∈ {0…R_max} for how well the reasoning anticipates the actual mechanism, and Outcome O ∈ {0…O_max} for how closely the response matches the actual finding (R_max = O_max = 5). The per-item score is the product R × O passed through a percent-of-maximum normalisation onto the 0–100 scale, so a perfect response (R = O = R_max) receives 100%:

score(i) = norm(R_i × O_i)
where norm(x) = 100 · x / (R_max · O_max)

The product (rather than sum) is the formal expression of "neither half alone is acceptable." A response that names the right outcome but cannot show the reasoning — a retrieval or lucky guess — earns near-zero because R will be low. This is the corpus-internal contamination guard, complementing the corpus-construction guards.

All reported scores are partial credit. On the current corpus, no model achieved a fully-correct prediction (R = O = R_max) on any item; the 0% fully-correct rate is uniform across all five respondents. The headline numbers below reflect only the partial-credit signal of the R × O product.

Native-harness evaluation

Each respondent is evaluated in its lab's own deployed harness — Claude Code for Claude models, Codex for GPT-5.5, Gemini CLI for Gemini 3.1 Pro — with tool use enabled and web search disabled at the harness level. This measures deployed-product capability, not bare-LLM capability.

Downloads

singularity_gate.pdf — full methodology paper
item_field_classifications.json — per-item field assignments