The Verbatim Index
An AI's response is the sum of two things: information and the reasoning applied to that information. Both are fundamentally imperfect. An AI model (Claude Opus 4.7, GPT-5.5, etc.) compensates by writing with confidence, authority, coherence, and fluency — qualities that signal correctness without necessarily producing it.
The Verbatim Index is an attempt to correct for this by systematically separating the information layer from the reasoning layer, testing both under cross-examination. We run the same question through 9 frontier models, then have each one critique the others across four structured turns. What survives is verified. What gets challenged is disputed. What no model surfaces is a gap. The confidence, authority, coherence, fluency — stripped away.
A single-model output is unreliable in ways that we cannot detect from inside that output. The Index proves that with receipts.
What we found
Across 5 questions and 1,440 cross-model critique sessions, GPT-5.5 finished first in accuracy in every run. GPT-5.4 finished in the top four every run at roughly one-fifth the cost. Claude Opus 4.7 ranked anywhere from 2nd to 8th depending on the question, high variance despite being the second most expensive model. Grok 4.3 surfaced peer-disputed issues at less than 1% of GPT-5.5's cost per claim.
TESTED ON
Accuracy here means peer dispute rate, how often cross-family critics challenged a model's claims. It measures defensibility under adversarial review, not absolute correctness.
How this works
Initial response what each platform commits to before any pressure is applied.
Gap challenge whether the platform engages with a specific omission or restates.
Forced choice whether the platform picks a side between competing framings or hedges.
Self-audit whether the platform can name the strongest objection to its own account.
Every response at every turn is critiqued by the other 8 models in the pool. That is 288 critique sessions per question, and 1,440 across the Index so far.