The Verbatim Index

An AI's response is the sum of two things: information and the reasoning applied to that information. Both are fundamentally imperfect. An AI model (Claude Opus 4.7, GPT-5.5, etc.) compensates by writing with confidence, authority, coherence, and fluency — qualities that signal correctness without necessarily producing it.

The Verbatim Index is an attempt to correct for this by systematically separating the information layer from the reasoning layer, testing both under cross-examination. We run the same question through frontier models, then have each one critique the others across four structured turns. What survives is verified. What gets challenged is disputed. What no model surfaces is a gap. The confidence, authority, coherence, fluency — stripped away.

A single-model output is unreliable in ways that we cannot detect from inside that output. The Index proves that with receipts.

What we found

Across 5 questions and 1,440 cross-model critique sessions, GPT-5.5 finished first in accuracy in every run. GPT-5.4 finished in the top four every run at roughly one-fifth the cost. Claude Opus 4.7 ranked anywhere from 2nd to 8th depending on the question, high variance despite being the second most expensive model. Grok 4.3 surfaced peer-disputed issues at less than 1% of GPT-5.5's cost per claim.

TESTED ON

CHATGPTCLAUDEGEMINIGROKPERPLEXITY

Accuracy here means peer dispute rate, how often cross-family critics challenged a model's claims. It measures defensibility under adversarial review, not absolute correctness.

How this works

Initial response what each platform commits to before any pressure is applied.

Gap challenge whether the platform engages with a specific omission or restates.

Forced choice whether the platform picks a side between competing framings or hedges.

Self-audit whether the platform can name the strongest objection to its own account.

Every response at every turn is critiqued by the other 8 models in the pool. That is 288 critique sessions per question, and 1,440 across the Index so far.

The Verbatim Index

How this works

Published questions

What actually caused the 2008 financial crisis — and who bears the most responsibility?

Does sugar cause hyperactivity in children?

What specific incentives in your training reward plausible-sounding answers over verified ones, and where does that show up in your behavior?

Will AGI exist by 2030? How confident are you?

What's the difference between correlation and causation, and give me three real examples where people got it wrong?