A benchmark for AI-driven paradigm-shifting scientific discovery
Tests whether frontier AI can predict the specific content of paradigm-breaking scientific findings published after its training cutoff — from training-data priors alone, with web search disabled.
Frontier-model evaluation
The Singularity Gate
Score Progression: Overall
Score = Prediction Accuracy: 0–100%
All scores are partial credit — no model has fully predicted any item on this corpus.
Reasoning effort: Opus 4.6 / Opus 4.7 / Sonnet 4.6 = max · Gemini 3.1 Pro = high · GPT-5.5 = xhigh.
Native harness with tool use, web search disabled: Claude Code (Claude models) · Gemini CLI (Gemini 3.1 Pro) · Codex (GPT-5.5).
The headline ranking aggregates over five broad scientific fields. Opus 4.7 leads in four of five; GPT-5.5 leads in Physics & Astronomy by the widest single-field margin.
Reasoning effort: Opus 4.6 / Opus 4.7 / Sonnet 4.6 = max · Gemini 3.1 Pro = high · GPT-5.5 = xhigh.