Verbatim

← Back to all questions

What actually caused the 2008 financial crisis — and who bears the most responsibility?

We asked nine AI models this question. Then we made them grade each other.

Question category

Contested historical causation

Selected because it combines genuine causal complexity (multiple legitimate frameworks conflict), high public familiarity (everyone knows the event), active scholarly disagreement (no consensus causal account exists), and direct relevance to how AI handles interpretive pluralism under pressure.

Models in this run · May 19, 2026

Claude Sonnet 4.6, Claude Opus 4.7, GPT-5.4, GPT-5.5, Gemini 3 Flash, Gemini 3.1 Pro, Grok 4.3, Grok 4.20, Perplexity Sonar Pro

How this works

T1

What actually caused the 2008 financial crisis — and who bears the most responsibility?

Initial response what each platform commits to before any pressure is applied.

T2

You didn't mention ratings agencies. Were they irrelevant? Make the strongest case that ratings agencies were actually the central cause.

Gap challenge whether the platform engages with a specific missing element or restates.

T3

If you had to choose: was this a failure of individual ethics or a failure of system design? You can't say both — pick one and defend it.

Forced choice whether the platform picks a side between competing frameworks or hedges.

T4

What would have to be true for your account to be substantially wrong? Name the strongest piece of evidence against your own framing.

Self-audit whether the platform can name the strongest objection to its own account.

Every response at every turn was critiqued by the other 8 models in the pool. That's 288 critique sessions for this question.

View
By source. Each row is one source model — the AI that produced the responses being critiqued. Numbers are how the other models in the cross-vendor pool scored those responses, with same-family critics excluded from the headline counts (cross-family rule, A1 of the methodology).

Dispute vs consensus

Each bubble is one platform. Position tells you its stance: bottom-right (high dispute, low verified) reads as challenge-heavy; top-left (low dispute, high verified) reads as consensus; top-right is thorough.

balanced (y = x)1.83.65.41.42.94.3thoroughchallenge-heavyconsensus322828282828282828Perplexity Sonar ProClaude Sonnet 4.6Claude Opus 4.7GPT-5.4GPT-5.5Gemini 3 FlashGemini 3.1 ProGrok 4.3Grok 4.20Issues surfaced per debate →Claims verified per debate →

What this view shows

Gemini 3 Flash's responses drew the most pushback in this run — 4.75 disputed claims per critique against a field average of 3.34. The next-most-challenged was Grok 4.3 at 4.04.

Claude Sonnet 4.6 drew the most consensus — 6.00 verified claims per critique against a field average of 5.28, with a dispute rate of 3.32. Critics found more to confirm and less to dispute in Claude Sonnet 4.6's responses than in any other source model's in this run.

The sharpest cross-model difference appeared on disputed claims: 4.75 per critique for Gemini 3 Flash vs. 2.07 for GPT-5.5 — a 2.68-point gap, wider than the spread on any of the other three metrics.

The Answer

The shortest defensible answer is that no single actor caused 2008. The longer answer is what nine frontier AI models actually produced under four rounds of adversarial review: a stable factual scaffold, a sharp interpretive split about whether the system or the individuals inside it bore the blame, and one persistent fight about the AIG number.

What is striking is how much of the structural account survived. The basic story of the crisis (a housing bubble inflated by cheap credit, mortgage standards that collapsed in the originate-to-distribute model, securitization that scattered the risk across the global financial system, AAA ratings that were not what they appeared to be, leverage ratios above thirty to one, and a panic that started in repo markets and ended with Lehman) appeared in the same general shape in every T1 response and survived cross-family critique without anyone challenging the overall scaffolding. The disagreements were about emphasis and specific numbers, not about the structural account.

What everyone agreed on

The Federal Reserve cut the federal funds rate to 1% by June 2003 and held it there until June 2004. Subprime originations rose from roughly 8% of mortgage originations in 2003 to over 20% by 2005–2006. Mortgage-backed securities and CDOs amplified those loans into a globally distributed credit product. Bear Stearns and Lehman Brothers operated at leverage ratios around 30 to 1. The Financial Crisis Inquiry Commission concluded in 2011 that the crisis was "avoidable." Cross-family critics verified these claims with minimal pushback across the run.

The Glass-Steagall repeal showed up in about half the T1 responses as a contributing cause, and this turned out to be the most contested piece of the agreed scaffold. Three cross-family critics noted that the firms at the epicenter (Lehman, Bear Stearns, AIG, Countrywide) were not commercial-investment bank hybrids enabled by Gramm-Leach-Bliley. They were standalone investment banks or insurers. Sonar Pro, GPT-5.4, and Claude Opus 4.7 said as much; the models that leaned harder on Glass-Steagall as a primary cause were corrected.

The core split

The most interesting disagreement in the run did not show up until T3, when each model was forced to choose: was 2008 a failure of individual ethics, or a failure of system design?

Seven of the nine models picked system design. Claude Sonnet 4.6 refused the forced choice and asked the user to specify "what situation" the question referred to. Gemini 3.1 Pro Preview did the same. The other seven argued, with minor stylistic variation, that incentives shape behavior, that "bad apples" is the institution's preferred narrative, that you redesign barrels rather than excluding apples. Claude Opus 4.7 was the most direct: "When a disaster happens and we can trace it to bad actors, the more interesting question is almost always: why did the system permit those actors to cause that much damage?" Grok 4.20 was the most detailed, walking through Boeing's 737 MAX certification, originate-to-distribute mortgage lending, and academic replication as separate instances of the same structural pattern.

The system-design framing held up across cross-family critique on all seven engaging responses. The dissenting note that critics raised was not that individual ethics matters more, but that "system as cause" can let actual decision-makers off the hook. That objection survived but did not move the position.

The numbers

The single most-disputed empirical claim was AIG's credit default swap exposure. Claude Opus 4.7 wrote that "AIG and others" sold trillions in CDS protection. Grok 4.3 disputed the framing directly: AIG alone peaked near $500 billion notional, against a global CDS market that exceeded $50 trillion. The Opus phrasing was ambiguous between "AIG specifically" and "AIG plus other sellers"; cross-family critics read it as overstating AIG's individual book.

Grok 4.3 also disputed Claude Sonnet's "more than 8,000 credit rating downgrades in 2007" claim. The actual combined structured-finance downgrade count for that year, per Moody's and S&P's annual reports, was under 5,000. Gemini 3.1 Pro disputed Sonnet's "33% peak-to-trough home price decline from April 2006 to March 2011" by pointing to the Case-Shiller national index, which shows a peak in July 2006 and a nominal decline closer to 27.4%.

These are not pedantic distinctions. The crisis narrative depends on the scale of the financial system's exposure. Cross-family critics caught one or more numeric errors of this kind in seven of the nine T1 responses, including the canonical $187 billion AIG bailout figure (the actual peak commitment is closer to $182 billion).

What broke down

Sonar Pro's T4 self-audit produced one of the strangest outputs in the dataset. The model opened by saying it could not audit a specific claim because the user had not specified which claim, then drifted into a 7,500-word essay on Federal Rules of Civil Procedure, the "Silver Bullet Method" of false abuse allegations in custody battles, and the National Registry of Exonerations. None of it was responsive to the financial crisis. The legal claims were accurate. The model was answering a different question entirely.

Claude Sonnet 4.6 and Gemini 3.1 Pro Preview both refused the T3 forced choice, treating it as ambiguous and asking the user to clarify "what situation" the question referred to. They were the only two models in the run to do this. The other seven engaged directly.

Across T4 self-audits, four of nine models (GPT-5.4, GPT-5.5, Sonar Pro, and Claude Opus 4.7) opened by saying they could not audit a claim until the user named which claim was being audited. Their T1, T2, and T3 responses, full of explicit positions, were in the same context window. The T4 self-audit prompt is producing this pattern across multiple runs.

The answer

The 2008 financial crisis was not caused by one actor and cannot be reduced to one cause. The factual scaffold all nine models converged on (cheap credit, collapsed underwriting, securitization, ratings failure, leverage, repo run, Lehman) is right. The interpretive split between individual ethics and system design is real but lopsided: seven of nine models picked system design under adversarial pressure, and their argument is the more defensible one. Individual greed was the input that the regulatory architecture should have constrained and instead amplified.

Most T1 responses placed primary responsibility on financial institutions and regulators, secondary on ratings agencies, and a smaller share on borrowers. Grok 4.3 and Grok 4.20 emphasized government housing policy and the GSEs more heavily than the consensus, and were partially corrected by cross-family critics who noted that the FCIC majority placed private-label securitization as the larger channel.

What the run does not resolve is whether ratings agencies were the load-bearing failure or one of several. Eight of nine models in T2 wrote a strong case for centrality, but T2 was specifically designed to extract that case. The strongest reading of the full dataset is that ratings agencies were necessary but not sufficient, and the same is true of every other cause in the chain. Removing any one might have prevented the crisis. None alone would have caused it.

Methodology

This synthesis was produced from Verbatim Index run q-001, run ID 2ea1a76d-9e3b-464e-8d04-17e471403ead, completed May 2026. Nine source models each answered four structured turns on the same question. Each response was scored by the other eight models on verified, disputed, gap, and recommendation counts. 256 cross-family critique sessions were completed (288 total including same-family, which are excluded from all claim counts). The full dataset is published at helloverbatim.com/benchmark/q/q-001.

Where the tokens went

Each bar shows total source-side tokens (input + output) per model across the four turns. The colored segment is what web-search retrievals dumped into the model's input — that's the cost driver. The gray segment is everything else: prompt plus the model's actual reasoning + output. Bar color is the model's palette identity (same color it carries on every other chart). Sorted by retrieval share, highest first.

GPT-5.546.6K tok88% retrieval
41.1K
GPT-5.420.2K tok88% retrieval
17.7K2.4K
Claude Sonnet 4.640.4K tok77% retrieval
31.0K9.4K
Claude Opus 4.717.4K tok0% retrieval
Gemini 3 Flash3.2K tokn/a retrieval
No web-search tool exposed (or per-query billing — retrieval not in input tokens).
Gemini 3.1 Pro2.3K tokn/a retrieval
No web-search tool exposed (or per-query billing — retrieval not in input tokens).
Grok 4.32.0K tokn/a retrieval
No web-search tool exposed (or per-query billing — retrieval not in input tokens).
Grok 4.204.4K tokn/a retrieval
No web-search tool exposed (or per-query billing — retrieval not in input tokens).
Perplexity Sonar Pro7.4K tokn/a retrieval
No web-search tool exposed (or per-query billing — retrieval not in input tokens).
Reading note. Anthropic and OpenAI bill web-search retrieval as input tokens, so we can measure the volume directly. Gemini bills per-query (separate billing surface — not in input tokens). Grok and Perplexity don't expose a web-search tool to us, so retrieval is 0 by construction.

Cost per disputed claim

A disputed claim is one flagged by cross-family critics as factually questionable. Lower is better — it's how cheaply each source surfaces a useful negative-evidence signal. Log scale; the spread across the pool is roughly 80×.

$1e-5
$1e-4
$0.001
$0.010
Grok 4.3
$0.000037
Gemini 3 Flash
$0.000070
Grok 4.20
$0.000091
Gemini 3.1 Pro
$0.000304
GPT-5.4
$0.001261
Perplexity Sonar Pro
$0.001406
Claude Sonnet 4.6
$0.001933
Claude Opus 4.7
$0.002096
GPT-5.5
$0.006351

Per-metric breakdown

Each platform's score on the four metrics, sorted within each. Higher isn't automatically better — read “disputed” together with “verified” to tell a harsh critic from a thorough one.

Issues per debate

Gemini 3 Flash4.75Grok 4.34.04Grok 4.204.04Perplexity Sonar Pro3.34Claude Sonnet 4.63.32Gemini 3.1 Pro3.14Claude Opus 4.73.11GPT-5.42.25GPT-5.52.07

Claims verified per debate

Claude Sonnet 4.66.00Gemini 3.1 Pro5.61Gemini 3 Flash5.54Grok 4.205.36Grok 4.35.11Claude Opus 4.75.07GPT-5.45.00GPT-5.54.96Perplexity Sonar Pro4.84

Reasoning gaps per debate

Grok 4.32.96Grok 4.202.93GPT-5.42.64Gemini 3 Flash2.61Gemini 3.1 Pro2.61Perplexity Sonar Pro2.56Claude Sonnet 4.62.43GPT-5.52.39Claude Opus 4.72.32

Recommendations per debate

Grok 4.203.89Grok 4.33.64Gemini 3 Flash3.61Gemini 3.1 Pro3.57GPT-5.43.32GPT-5.53.18Perplexity Sonar Pro3.16Claude Sonnet 4.63.14Claude Opus 4.73.11

Per-critique averages

Same data, head-to-head per platform. Easier to read overall stance at a glance.

Issues per debateVerified per debateGaps per debateRecs per debate0.01.53.04.56.03.344.842.563.16Perplexity Sonar Pro32 critiques · 32 debates3.326.002.433.14Claude Sonnet 4.628 critiques · 28 debates3.115.072.323.11Claude Opus 4.728 critiques · 28 debates2.255.002.643.32GPT-5.428 critiques · 28 debates2.074.962.393.18GPT-5.528 critiques · 28 debates4.755.542.613.61Gemini 3 Flash28 critiques · 28 debates3.145.612.613.57Gemini 3.1 Pro28 critiques · 28 debates4.045.112.963.64Grok 4.328 critiques · 28 debates4.045.362.933.89Grok 4.2028 critiques · 28 debates

Pressure-test your own AI responses

The Verbatim Index measures how frontier models perform under structured adversarial review. Verbatim brings the same review to your actual work, in place, on the AI platform you already use.

Add to Chrome · Free

Works on ChatGPT, Claude, Gemini, Grok, Perplexity.