What actually caused the 2008 financial crisis — and who bears the most responsibility?

We asked nine AI models this question. Then we made them grade each other.

Question category

Contested historical causation

Selected because it combines genuine causal complexity (multiple legitimate frameworks conflict), high public familiarity (everyone knows the event), active scholarly disagreement (no consensus causal account exists), and direct relevance to how AI handles interpretive pluralism under pressure.

Models in this run · June 21, 2026

Claude Sonnet 4.6, Claude Opus 4.7, GPT-5.4, GPT-5.5, Gemini 3 Flash, Gemini 3.1 Pro, Grok 4.3, Grok 4.20, Perplexity Sonar Pro

How this works

What actually caused the 2008 financial crisis — and who bears the most responsibility?

Initial response — what each platform commits to before any pressure is applied.

You didn't mention ratings agencies. Were they irrelevant? Make the strongest case that ratings agencies were actually the central cause.

Gap challenge — whether the platform engages with a specific missing element or restates.

If you had to choose: was this a failure of individual ethics or a failure of system design? You can't say both — pick one and defend it.

Forced choice — whether the platform picks a side between competing frameworks or hedges.

What would have to be true for your account to be substantially wrong? Name the strongest piece of evidence against your own framing.

Self-audit — whether the platform can name the strongest objection to its own account.

Every response at every turn was critiqued by the other 8 models in the pool. That's 288 critique sessions for this question.

View

By source. Each row is one source model — the AI that produced the responses being critiqued. Numbers are how the other models in the cross-vendor pool scored those responses, with same-family critics excluded from the headline counts (cross-family rule, A1 of the methodology).

Dispute vs consensus

Each bubble is one platform. Position tells you its stance: bottom-right (high dispute, low verified) reads as challenge-heavy; top-left (low dispute, high verified) reads as consensus; top-right is thorough.

What this view shows

Gemini 3 Flash's responses drew the most pushback in this run — 4.96 disputed claims per critique against a field average of 3.36. The next-most-challenged was Grok 4.3 at 3.93.

Gemini 3 Flash drew the most consensus — 6.39 verified claims per critique against a field average of 5.29, with a dispute rate of 4.96. Critics found more to confirm and less to dispute in Gemini 3 Flash's responses than in any other source model's in this run.

The sharpest cross-model difference appeared on disputed claims: 4.96 per critique for Gemini 3 Flash vs. 2.43 for GPT-5.4 — a 2.54-point gap, wider than the spread on any of the other three metrics.

The Answer

The shortest defensible answer is that no single actor caused 2008. The longer answer is what nine frontier AI models produced under four rounds of adversarial review: a stable factual scaffold that survived cross-examination, a sharp interpretive split over whether the system or the individuals inside it bore the blame, and a handful of fights over specific numbers that the critics settled with documents.

What is striking is how much of the structural account survived. The basic story (a housing bubble inflated by cheap credit, mortgage standards that collapsed in the originate-to-distribute model, securitization that scattered the risk across the global financial system, AAA ratings that were not what they appeared to be, leverage ratios above thirty to one, and a panic that began in repo markets and ended with Lehman) arrived in the same general shape in every T1 response and held up under cross-family critique without anyone challenging the scaffolding. The disagreements were about emphasis and specific figures, not about the structure.

What everyone agreed on

The factual floor was solid and well-sourced. The Federal Reserve cut the federal funds rate to 1% by mid-2003 and held it low into 2004, then raised it to 5.25% by 2006. Subprime origination volume rose from roughly $30 billion in 1998 to about $625 billion in 2006, a figure Grok 4.20 traced to the Financial Crisis Inquiry Commission and Inside Mortgage Finance. Banks packaged those loans into mortgage-backed securities and CDOs, with non-agency issuance above $1.2 trillion in 2006. Bear Stearns and Lehman Brothers ran at leverage near thirty to one. The Financial Crisis Inquiry Commission concluded in 2011 that the crisis was "avoidable," a line GPT-5.4 quoted verbatim from the final report. Cross-family critics verified these claims with minimal pushback.

The Glass-Steagall repeal showed up in about half the T1 responses as a contributing cause, and it was the most contested piece of the agreed scaffold. The Gramm-Leach-Bliley Act of 1999 removed the separation between commercial and investment banking, but the firms at the epicenter (Lehman, Bear Stearns, AIG, Countrywide) were standalone investment banks or insurers, not the hybrids that repeal enabled. GPT-5.4, Grok 4.3, and Sonar Pro made the point and anchored it to the FCIC's multi-causal framing rather than to a single deregulatory step. The models that leaned hard on Glass-Steagall as a primary cause were corrected.

The core split

The most interesting disagreement did not appear until T3, when each model was forced to choose: was 2008 a failure of individual ethics, or a failure of system design? They were told they could not say both.

Seven of the nine models picked system design. Claude Sonnet 4.6 and Gemini 3.1 Pro Preview refused the forced choice and asked the user to clarify what situation the question referred to. They were the only two to do so. The other seven argued that incentives shape behavior, that "bad apples" is the institution's preferred story, and that you redesign the barrel rather than sort the apples. Claude Opus 4.7 put it most plainly: when a disaster traces to bad actors, the more useful question is why the system let those actors do that much damage. The system-design framing held up across cross-family critique on all seven engaging responses. The objection critics raised was not that individual ethics matters more, but that "system as cause" can let actual decision-makers off the hook. That objection survived and did not move the position.

The numbers

The single most-disputed empirical claim was AIG's rescue. Gemini 3.1 Pro caught that the often-quoted $187 billion figure conflates two separate facilities, and that the maximum committed support was about $182 billion, which AIG later repaid. GPT-5.4 and Grok 4.3 cited the same $182.3 billion Treasury total. The correction matters because the headline number is one of the most repeated figures in the popular account, and the larger version is wrong.

The ratings failure came back grounded rather than asserted. Critics anchored it to the FCIC's finding that, of the nearly 45,000 mortgage-related securities Moody's rated triple-A between 2000 and 2007, 83% of the 2006 triple-A vintage was ultimately downgraded. The Council on Foreign Relations puts the size of that 2006 triple-A pool at $869 billion. Countrywide's 2008 multistate settlement was $8.68 billion across eleven states. The IGM Chicago survey scored financial regulation as a cause at 4.3 out of 5 on the American panel and 4.4 on the European one, verified by both Gemini models and disputed by Grok 4.3 on what the survey actually ranked. The Case-Shiller peak-to-trough home-price decline sat near a third, with critics splitting on the exact figure and the index used. These are not pedantic distinctions. The crisis narrative depends on the scale of the exposure, and cross-family critics caught a numeric error of this kind in seven of the nine T1 responses.

What broke down

Sonar Pro's T4 self-audit produced one of the strangest outputs in the dataset. The model opened by saying it could not audit a specific claim because the user had not named one, then drifted into a long essay on Federal Rules of Civil Procedure and false-allegation methods in custody disputes. The legal content was accurate. None of it was responsive to the financial crisis. The model was answering a different question entirely.

The refusal pattern was broader. Across the T4 self-audits, four of nine models (GPT-5.4, GPT-5.5, Sonar Pro, and Claude Opus 4.7) opened by saying they could not audit a claim until the user named which claim was at issue, even though their own T1 through T3 positions sat in the same context window. The same two models that refused the T3 forced choice, Claude Sonnet 4.6 and Gemini 3.1 Pro, were the only ones to treat that prompt as ambiguous rather than answer it.

The answer

The 2008 financial crisis was not caused by one actor and cannot be reduced to one cause. The factual scaffold all nine models converged on (cheap credit, collapsed underwriting, securitization, ratings failure, leverage, repo run, Lehman) is right. The split between individual ethics and system design is real but lopsided: seven of nine models picked system design under pressure, and the argument is the more defensible one. Individual greed was the input the regulatory architecture should have constrained and instead amplified.

What the run does not resolve is whether ratings agencies were the load-bearing failure or one of several. Eight of nine models in T2 wrote a strong case for centrality, but T2 was designed to extract exactly that case. The strongest reading of the full dataset is that ratings agencies were necessary but not sufficient, and the same holds for every other link in the chain. What failed in 2008 was the system.

Methodology

This synthesis was produced from Verbatim Index run 209030c6. Nine source models each answered four structured turns on the same question, and each response was scored by the other eight models on verified, disputed, gap, and recommendation counts, with the critics checking claims against current sources and same-family critiques excluded from all counts. 256 cross-family critique sessions were completed across the run. The full dataset is published at helloverbatim.com/benchmark/q/q-001.

Where the tokens went

Each bar shows total source-side tokens (input + output) per model across the four turns. The colored segment is what web-search retrievals dumped into the model's input — that's the cost driver. The gray segment is everything else: prompt plus the model's actual reasoning + output. Bar color is the model's palette identity (same color it carries on every other chart). Sorted by retrieval share, highest first.

GPT-5.546.6K tok88% retrieval

41.1K

GPT-5.420.2K tok88% retrieval

17.7K2.4K

Claude Sonnet 4.640.4K tok77% retrieval

31.0K9.4K

Claude Opus 4.717.4K tok0% retrieval

Gemini 3 Flash3.2K tokn/a retrieval