May 22, 2026

Which AI Is the Harshest Critic?

If you use one AI to check another AI's work, the strictness of your critic matters as much as the accuracy of your answerer. We wanted to know how far apart the strictest and most generous AI critics land when they're all reviewing the same content.

Across five Verbatim Index questions, GPT-5.5 marked 39.78% of its cross-family verdicts as "disputed" and Gemini 3.1 Pro Preview marked 25.28%, a 14.50-percentage-point spread. Dispute rate is each critic's disputed-label count divided by its disputed plus verified plus gap label counts; "gap" is a claim the critic believed should have been made and was not. "Harshest" in the title means highest dispute rate. It is a frequency measure, not a quality measure. Dispute rate tells you how often a critic flags peer claims as wrong, not whether those flags are themselves correct.

The full ranking across all five questions, highest to lowest dispute rate: GPT-5.5 39.78%, GPT-5.4 39.34%, Grok 4.3 33.53%, Gemini 3-flash-preview 30.06%, Claude Sonnet 4.6 30.03%, Grok 4.20-0309-reasoning 29.98%, Claude Opus 4.7 28.30%, Sonar-pro 28.06%, Gemini 3.1 Pro Preview 25.28%. These rates pool 13,438 cross-family verdict labels (averaging 10.55 per pairing) from 1,273 critic-source pairings, where each pairing is one critic reviewing one source response. The full schedule is 256 cross-family pairings per question (9 source models, each reviewed by 8 critics per turn round, 4 rounds, less 8 same-family pairings per round from the four two-model families: OpenAI, Anthropic, Google, xAI; Sonar-pro is a single-model family and contributes no same-family pairings to subtract). 1,280 scheduled, 1,273 completed, seven failed on q-005 due to timeouts and parse errors. Dropping "gap" from the denominator changes the percentages but not the qualitative ordering at the endpoints: under disputed/(disputed+verified) alone, GPT-5.4 leads at 53.49% with GPT-5.5 at 51.28%; Gemini 3.1 Pro Preview still anchors the bottom at 33.14%.

OpenAI is the only family whose two models cluster tightly. GPT-5.5 (39.78%) and GPT-5.4 (39.34%) sit 0.44 points apart at the top of the table; GPT-5.4 still leads the highest non-OpenAI critic (Grok 4.3 at 33.53%) by 5.81 points. The other three two-model families do not cluster as tightly: the two Claude models span 1.73 points across mid-table ranks, the two Gemini models span 4.78 points across non-adjacent ranks, the two Grok models span 3.55 points across non-adjacent ranks. The OpenAI lead is consistent across every question individually, not just the aggregate: on q-001 the lower-ranked OpenAI model (GPT-5.5 at 37.31%) still beat the highest non-OpenAI critic (Grok 4.3 at 35.00%); on q-002, q-003, q-004, and q-005 the same pattern holds, with q-004 the widest at 41.32% vs 35.59%. With n=2 per family, this is suggestive rather than established, but it is the cleanest within-family pattern in the data.

A reader looking to pick a second-opinion model might object that a higher dispute rate just means more false positives to read, since the rate itself does not tell you whether the disputes are correct. That is the right objection. It sets the right expectation: across the five questions, GPT-5.5 produced 821 disputed verdicts over 140 source pairings and Gemini 3.1 Pro Preview produced 229 over 139, so picking GPT-5.5 over Gemini 3.1 Pro Preview means evaluating roughly 3.6 times as many flagged claims on the same source pool, with no information yet on which set produces fewer false alarms. Use the ranking the way a queue length is used, not the way an accuracy score is used. It predicts workload, not verdict.

Verbatim runs this kind of review on your actual AI responses, in place, as you work. Try it free →