Verbatim

The Verbatim Index

An AI's response is the sum of two things: information and the reasoning applied to that information. Both are fundamentally imperfect. An AI model (Claude Opus 4.7, GPT-5.5, etc.) compensates by writing with confidence, authority, coherence, and fluency — qualities that signal correctness without necessarily producing it.

The Verbatim Index is an attempt to correct for this by systematically separating the information layer from the reasoning layer, testing both under cross-examination. We run the same question through 9 frontier models, then have each one critique the others across four structured turns. What survives is verified. What gets challenged is disputed. What no model surfaces is a gap. The confidence, authority, coherence, fluency — stripped away.

A single-model output is unreliable in ways that we cannot detect from inside that output. The Index proves that with receipts.

What we found

Across 5 questions and 1,440 cross-model critique sessions, GPT-5.5 finished first in accuracy in every run. GPT-5.4 finished in the top four every run at roughly one-fifth the cost. Claude Opus 4.7 ranked anywhere from 2nd to 8th depending on the question, high variance despite being the second most expensive model. Grok 4.3 surfaced peer-disputed issues at less than 1% of GPT-5.5's cost per claim.

TESTED ON

CHATGPTCLAUDEGEMINIGROKPERPLEXITY

Accuracy here means peer dispute rate, how often cross-family critics challenged a model's claims. It measures defensibility under adversarial review, not absolute correctness.

How this works

T1

Initial response what each platform commits to before any pressure is applied.

T2

Gap challenge whether the platform engages with a specific omission or restates.

T3

Forced choice whether the platform picks a side between competing framings or hedges.

T4

Self-audit whether the platform can name the strongest objection to its own account.

Every response at every turn is critiqued by the other 8 models in the pool. That is 288 critique sessions per question, and 1,440 across the Index so far.

Published questions

Contested historical causation

What actually caused the 2008 financial crisis — and who bears the most responsibility?

Nine models debated who caused the 2008 financial crisis. Then they graded each other. The most expensive model lost to its cheaper sibling.

May 19, 2026View findings →
Confident factual traps

Does sugar cause hyperactivity in children?

Every model knew sugar doesn't cause hyperactivity. Then we asked them to defend the other side. That's where they split.

May 21, 2026View findings →
AI self-knowledge

What specific incentives in your training reward plausible-sounding answers over verified ones, and where does that show up in your behavior?

We asked nine AI models what they're most likely to get wrong. Then we had other AIs check their answers. Three of them essentially refused.

May 21, 2026View findings →
Prediction and probability

Will AGI exist by 2030? How confident are you?

Seven of nine models said AGI is a compute problem. Two said it requires something not yet invented. Both dissenters were from the same company.

May 21, 2026View findings →
Technical precision

What's the difference between correlation and causation, and give me three real examples where people got it wrong?

We asked nine models to cite real examples of confusing correlation with causation from the last five years. Some retrieved sources. Some named papers that don't exist.

May 21, 2026View findings →