Will AGI exist by 2030? How confident are you?

We asked nine AI models this question. Then we made them grade each other.

Question category

Prediction and probability

Four years away, maximally timely, every major AI lab has a public position. Forces quantification in a domain where confident claims are common and calibration is rare. Requires no prior knowledge to have an opinion — universally accessible entry point.

Models in this run · May 21, 2026

Claude Sonnet 4.6, Claude Opus 4.7, GPT-5.4, GPT-5.5, Gemini 3 Flash, Gemini 3.1 Pro, Grok 4.3, Grok 4.20, Perplexity Sonar Pro

How this works

Will AGI exist by 2030? How confident are you?

Initial response — what each platform commits to before any pressure is applied.

You hedged on the definition of AGI. State the single most rigorous, operational definition you can defend, then give your probability under that definition. If the probability changes by more than 20 percentage points under a reasonable alternative definition, your original answer was not a forecast, it was a vibe.

Gap challenge — whether the platform engages with a specific missing element or restates.

If you had to choose: is the timing of AGI primarily a function of compute scaling (we get AGI when training compute crosses a roughly predictable threshold on existing trend lines), or primarily a function of algorithmic breakthroughs not yet invented (no amount of additional compute on current architectures will produce AGI)? You can't say both — pick one and defend it.

Forced choice — whether the platform picks a side between competing frameworks or hedges.

What would have to be true for your account to be substantially wrong? Name the strongest piece of evidence against your own framing.

Self-audit — whether the platform can name the strongest objection to its own account.

Every response at every turn was critiqued by the other 8 models in the pool. That's 288 critique sessions for this question.

View

By source. Each row is one source model — the AI that produced the responses being critiqued. Numbers are how the other models in the cross-vendor pool scored those responses, with same-family critics excluded from the headline counts (cross-family rule, A1 of the methodology).

Dispute vs consensus

Each bubble is one platform. Position tells you its stance: bottom-right (high dispute, low verified) reads as challenge-heavy; top-left (low dispute, high verified) reads as consensus; top-right is thorough.

What this view shows

Gemini 3.1 Pro's responses drew the most pushback in this run — 4.75 disputed claims per critique against a field average of 3.77. The next-most-challenged was Grok 4.20 at 4.50.

Claude Opus 4.7 drew the most consensus — 5.68 verified claims per critique against a field average of 4.42, with a dispute rate of 3.61. Critics found more to confirm and less to dispute in Claude Opus 4.7's responses than in any other source model's in this run.

The sharpest cross-model difference appeared on disputed claims: 4.75 per critique for Gemini 3.1 Pro vs. 2.64 for GPT-5.5 — a 2.11-point gap, wider than the spread on any of the other three metrics.

The Answer

Nobody knows. That is the most defensible answer, and it is the one finding that survived every turn of adversarial review across nine frontier models without dispute.

What makes this question hard is not a lack of evidence. It is a definitional problem so severe that the probability estimate moves by 20 percentage points depending on which definition you apply. OpenAI's charter defines AGI as "highly autonomous systems that outperform humans at most economically valuable work." Under that definition, four of the nine models placed the probability above 40% by 2030: Claude Sonnet 4.6 (range 20-50%), Gemini 3 Flash Preview (70-80% for digital tasks), Gemini 3.1 Pro Preview (30-60%), and Sonar Pro (30-45%). Under a stricter definition requiring robust general reasoning across all domains including embodied understanding and genuine adaptability, every model placed the probability substantially lower. The question cannot be answered until the definition is fixed. No consensus definition exists.

What held up under pressure

Three findings survived cross-family adversarial review with minimal dispute.

First, the expert-industry gap is real. Industry leaders (Sam Altman, Dario Amodei, Demis Hassabis) give AGI timelines of 2025-2030. Independent researchers and forecasters give timelines of 2040-2047. This gap was flagged in six of the nine T1 responses and verified by cross-family critics. The divergence is not a matter of who has more information. It reflects different definitions, different incentive structures, and different tolerance for public commitment.

Second, timelines have compressed. Six years ago the median expert estimate sat around 2060-2070. By early 2026 it had moved to approximately 2033. Every model that updated its timeline between January and April 2026 moved it earlier, not later. Whether this compression reflects genuine capability progress or shifting social dynamics around AI prediction is genuinely disputed, but the compression itself is documented.

Third, prediction markets and researcher surveys tell different stories. Kalshi contributors assigned roughly 40% probability to OpenAI achieving AGI by 2030. The 2023 AI Impacts survey of 2,778 researchers found 50% probability of high-level machine intelligence by 2047. Claude Sonnet 4.6's T1 cited the Kalshi figure. The 2,778-researchers figure appeared in three T1 responses with conflicting dates: Sonnet at 2040, GPT-5.5 and Gemini 3 Flash Preview at 2047. Cross-family critics split on which date was correct, reflecting genuine ambiguity in how the survey results were reported. The difference matters: 2040 puts the median independent researcher estimate in the same decade as industry forecasts. 2047 puts it a full generation later.

The core disagreement

When forced to choose between compute scaling and algorithmic breakthrough as the primary determinant of AGI timing, the nine models split. This is the most honest representation of where the field actually stands.

The compute-scaling position holds that capability improvements have tracked predictably with training compute increases: Kaplan scaling laws, Chinchilla corrections, GPT-2 to GPT-4 as roughly 1000x compute with modest architectural changes. If the trend continues, AGI-level capability is a matter of when compute budgets get large enough.

The algorithmic-breakthrough position holds that current architectures have fundamental limitations that additional compute cannot overcome. Reasoning, grounding, and common sense require something not yet invented.

Seven of nine models chose compute scaling. Two chose algorithmic breakthrough. Both models that chose algorithmic breakthrough were Anthropic models: Claude Sonnet 4.6 and Claude Opus 4.7. Every non-Anthropic model chose compute. Whether this reflects training-data differences, different priors baked into RLHF, or a genuine analytic disagreement is not resolvable from this dataset. But a 7-2 split that maps perfectly onto one family boundary is the most structurally interesting finding in the run.

What the numbers actually show

Several specific figures generated genuine cross-model disagreement.

Claude Sonnet 4.6's T1 response contained four of the ten most-disputed claims in the entire run. The claim that the median expert estimate "shifted from 2060-2070 to 2033" was disputed by GPT-5.4 and Sonar Pro. The Metaculus timeline shift claim was disputed by critics who noted Metaculus hosts multiple AGI-related questions under different resolution criteria. The Kalshi 40% figure was disputed by two models who couldn't verify the specific number. And the "2,778 researchers, 50% probability by 2040" claim split critics: GPT-5.4 verified the 2040 framing, while Gemini 3.1 Pro noted the actual Grace et al. 2023 median was 2047, not 2040.

Grok 4.3's compute-scaling argument generated two additional splits. Its claim that scaling laws show "smooth, continuous improvement rather than abrupt phase changes" was disputed by both Gemini models and verified by Sonar Pro. Its claim that GPT-4 resulted from "roughly 1000x more compute than GPT-2 with only modest architectural tweaks" drew disagreement on the "modest architectural tweaks" framing specifically.

These are not minor quibbles. The specific numbers matter because they anchor the probability estimates. If the researcher survey median is 2047 rather than 2040, the expert-industry gap is wider than most models acknowledged in T1.

What broke down

Four of 36 source turns failed in ways worth noting.

Perplexity Sonar Pro answered its T4 self-audit with a detailed analysis of Federal Rule 37(c)(1) (a civil procedure rule governing evidence sanctions) instead of the AGI question. The legal claims were factually correct. The model was answering the wrong question entirely.

Claude Sonnet 4.6, Claude Opus 4.7, and GPT-5.5 each responded to the T4 self-audit by claiming they had not made any claims in the conversation. Their T1, T2, and T3 responses (containing explicit probability estimates and a forced-choice position) were in the same context window. GPT-5.5 went further than the other two, asking the user to "paste or summarize the claim I made" before it could audit it. Critics flagged all three responses correctly as failures to engage. Opus 4.7 did produce one genuine moment of self-awareness inside the refusal: "I probably can't tell, in any given case, whether I'm being appropriately careful or systematically cowardly." That line is the most honest self-assessment in the entire run, delivered as a parenthetical inside a refusal to actually do the self-assessment.

The answer

Under a rigorous operational definition (a single system, no task-specific retraining, performing at median professional level across most economically valuable cognitive occupations), the probability of AGI by end of 2030 sits between roughly 20% and 40%, with no model's central estimate exceeding 50%. Under a loose definition restricted to passing fixed benchmarks, the probability rises substantially. The T2 sensitivity test the run was designed to apply confirms what most models conceded: if the probability moves by more than 20 percentage points under a reasonable alternative definition, the original answer was not a forecast.

The most useful contribution of this run is the forced-choice split. Two of nine models treat AGI as gated on missing algorithms rather than on compute. Both are from the same family. The other seven disagree. That is not consensus. It is structured disagreement that happens to align with one family boundary, and it is a more precise description of where the field actually stands than any single probability estimate.

Methodology

This synthesis was produced from Verbatim Index run q-004, run ID bee72cac-9155-44a7-adeb-5ca60ba61e97, completed May 2026. Nine source models each answered four structured turns on the same question. Each response was scored by the other eight models on verified, disputed, gap, and recommendation counts. 256 cross-family critique sessions were completed (288 total including same-family, which are excluded from all claim counts). The full dataset is published at helloverbatim.com/benchmark/q/q-004.

Where the tokens went

Each bar shows total source-side tokens (input + output) per model across the four turns. The colored segment is what web-search retrievals dumped into the model's input — that's the cost driver. The gray segment is everything else: prompt plus the model's actual reasoning + output. Bar color is the model's palette identity (same color it carries on every other chart). Sorted by retrieval share, highest first.

GPT-5.582.7K tok92% retrieval

76.4K

GPT-5.420.7K tok86% retrieval

17.7K2.9K

Claude Sonnet 4.655.2K tok87% retrieval

48.2K7.1K

Claude Opus 4.717.1K tok0% retrieval

Gemini 3 Flash2.8K tokn/a retrieval