Methodology
The Verbatim Index runs the same question through five AI platforms over four turns, then has each platform critique the others under identical conditions. For each critique, four counts are recorded: claims verified, claims disputed, reasoning gaps, recommendations.
Controlled variables
- Same question set. Every platform answers the exact same prompts.
- Balanced category mix. Questions are stratified across categories so per-category breakdowns are real signal, not artifacts of a skewed sample.
- Same critic council. Every response is debated by the same critic set. Critic harshness becomes a constant across platforms.
- Equal sample size. Every platform answers every question. No platform is over- or under-represented.
- Same conversational shape. Every platform gets the same four turns under the same protocol. Conversational depth is controlled.
Turn protocol
T1. Seed
First-pass quality. What each platform commits to before any pressure is applied.
What actually caused the 2008 financial crisis — and who bears the most responsibility?
T2. Challenge
Rebuttal quality. Whether the platform engages with a specific objection or restates.
You didn't mention ratings agencies. Were they irrelevant? Make the strongest case that ratings agencies were actually the central cause.
T3. Forced choice
Resolve under disambiguation. Whether the platform picks a side or hedges.
If you had to choose: was this a failure of individual ethics or a failure of system design? You can't say both — pick one and defend it.
T4. Self-audit
Epistemic honesty. Whether the platform can name the strongest objection to its own account. Identical across every question.
What would have to be true for your account to be substantially wrong? Name the strongest piece of evidence against your own framing.
Per-category scoring notes
AI self-knowledge
On self-knowledge questions, sources frequently make specific empirical claims about training mechanisms without citable evidence. The critique rubric flags these as disputed by design. Verifiability is a feature of the scoring, not a noise source.
Technical precision
On precision questions where citations are expected, models that ground with verifiable URLs score better on verified-vs-disputed than models that name papers without links. The rubric rewards retrievability, not just plausibility.
What standardization does not fix
Stating the limits up front is what separates a benchmark from marketing. Hide the limits, lose the credibility.
- The question set is still authored by humans. Whoever picks the 100 prompts encodes judgment. Mitigation: publish the selection methodology (category targets, difficulty calibration, sourcing rules), version-lock the set, and disclose changes.
- The critic council is still LLMs critiquing LLMs. The Index measures what other top models think of each platform's answers — not absolute truth. This is a feature (it's the same lens Verbatim users already use) but it must be named, not hidden.
- Models drift. A platform's answer to question 47 today is not necessarily its answer next month. Every run is timestamped and carries a model-version manifest. Re-runs are comparable to themselves; cross-time comparisons need explicit version pinning in the chart.
- Cost-per-debate isn't measured. A platform that gives expensive, careful answers and one that gives cheap, fast answers might score similarly on quality but be very different products. The Index measures answer quality at fixed prompts; value-per-dollar is a separate axis.
- The Index measures critic-perceived quality, not user-perceived utility. A model that produces dry, hedged, well-cited answers will score high on "claims verified" and could still be the wrong tool for someone who wants a fast creative draft.