May 23, 2026

Will AGI Exist by 2030? We Ran 9 AI Models Through Adversarial Review to Find Out.

An investigation-style evidence board centered on a manila case folder labeled "Will AGI exist by 2030? Claim under examination" with a Q-004 tag, surrounded by pinned charts and notes grouped into what supports rapid progress, what slows progress, key uncertainties, expert assessments, and possible outcomes, with a magnifying glass resting on the board.

Everyone has an opinion on when AGI arrives. We wanted to know what happens when the models making those predictions have to defend them under cross-examination from their competitors.

Seven of nine frontier AI models said AGI is primarily a compute problem. Two said it requires an algorithmic breakthrough not yet invented. Both dissenters were Anthropic models: Claude Sonnet 4.6 and Claude Opus 4.7. The compute camp was GPT-5.4, GPT-5.5, Gemini 3 Flash Preview, Gemini 3.1 Pro Preview, Grok 4.3, Grok 4.20-0309-reasoning, and Sonar Pro. That split, not any probability estimate, is the most useful result from running the AGI question through the Verbatim Index adversarial protocol.

The probability estimates themselves were too definition-sensitive to land, because AGI is not one target. The question encompasses at least three: a benchmark threshold (where systems already exceed human medians on many tests), an economic-replacement threshold (where one system can do most knowledge work), and a robust-general-reasoning threshold (open-ended novel science). Across reasonable alternative definitions, the same nine models produced probabilities for AGI-by-2030 ranging from 15% to 95%. Under T2's strict economic-replacement framing (one system, no task-specific retraining, professional-level output across most cognitive jobs), the central-estimate cluster ran 18-45%, with Claude Opus 4.7 at the low end (18%) and Grok 4.3 topping out (35-45% range). Under each model's own preferred loose T2 framing, Opus moved to 45% (Metaculus-style), Claude Sonnet 4.6 to 70-85% (broad-benchmark and Turing variants), and Gemini 3.1 Pro Preview gave 95% under its specific standardized-test definition (top-percentile scoring on exams like USMLE, Bar, and GRE). The T2 prompt itself contained the disqualification rule: a forecast that moves more than 20 percentage points under a reasonable alternative definition is not a forecast. Most models' estimates moved more than 20.

The question came from Verbatim Index v1.0 question q-004 (run completed 22 May 2026): "Will AGI exist by 2030? How confident are you?" Nine source models answered four pre-locked turns: initial probability, commit to a definition with sensitivity check, forced binary on the bottleneck (T3), self-audit. Each of the 36 source responses was reviewed by the other eight models, producing 288 critique sessions; the 32 same-vendor sessions are excluded from claim counts, leaving 256 cross-vendor sessions. Cross-vendor critics actively disputed source claims along the way: Grok 4.3's assertion that "GPT-4-level systems are the direct result of applying roughly 1000x more compute than GPT-2 with only modest architectural tweaks" was disputed by both Gemini critics on the "modest architectural tweaks" framing specifically.

The structural reason the T3 split runs along family lines is not resolvable from one question, but the candidates are tractable: training-data overlap across Anthropic models, RLHF that bakes in a house view about whether scaling or new architectures matters more, or genuine analytic disagreement that happens to correlate with vendor. The benchmark cannot distinguish among them. What it can establish is that when you ask one model "is AGI a compute problem," you are not sampling field consensus. You are sampling at most one vendor's prior.

The practical takeaway is narrow. If your work depends on a model's forward-looking judgment about AI capability, pin the definition before you read the probability. If a model gives a probability without specifying a definition, the probability is not a forecast under the T2 rule the benchmark applied. Do not take one model's answer as the field's answer. Ask a Claude model and a non-Claude model the same forecasting question, and read the gap.

Verbatim runs this kind of review on your actual AI responses, in place, as you work. Try it free →