Claude Opus Costs 7x More Than Gemini. It Ranked Lower Too.
Claude Opus 4.7 costs roughly seven times more than Gemini 3.1 Pro Preview per question. On the same benchmark question, Gemini ranked third and Opus ranked fifth.
On q-001 of the Verbatim Index v1.0 run, nine source models each answered the same four-turn prompt sequence about the 2008 financial crisis, then the other eight models scored each answer against a fixed rubric. A "scoreable claim" is one rubric line item, and the cross-family dispute rate is the share of those claims that a critic from a different vendor flagged as disputed. On cost per four-turn answer, the spread ran from Gemini 3.1 Pro Preview at $0.0267 up through Sonnet 4.6 at $0.1797 and Opus 4.7 at $0.1824, with GPT-5.5 at the top at $0.3684. On dispute rate, where lower means better-rated, the order ran GPT-5.5 at 21.97 percent, GPT-5.4 at 22.74, Gemini 3.1 Pro Preview at 27.67, Sonnet 4.6 at 28.27, Opus 4.7 at 29.59, Perplexity sonar-pro at 31.10, Grok 4.20 Reasoning at 32.75, Grok 4.3 at 33.33, and Gemini 3 Flash Preview at 36.84. Gemini 3.1 Pro Preview ranked third on quality at roughly one-seventh the Anthropic price; Opus 4.7 ranked fifth at seven times Gemini's cost. The two rankings inverted in the middle of the table.
The architectural question is why a 7x cost premium does not buy a higher rank, and the cleanest place to look is inside one vendor. Anthropic shipped two models into this run. Claude Opus 4.7 called the web_search tool zero times on q-001 and answered from priors. Claude Sonnet 4.6 pulled 31,023 retrieved tokens and answered with sources in hand. The cheaper sibling ranked higher: 28.27 percent disputed for Sonnet, 29.59 percent for Opus, a 1.32-point gap that is small on one question but one-directional with retrieval. The same pattern showed up inside OpenAI, where GPT-5.4 retrieved 17,738 tokens and ranked second and GPT-5.5 retrieved 41,066 and ranked first. The same vendor sold two models on this question, the cheaper one did more retrieval work, and the work showed up in the rating.
That pattern reads as retrieval beating priors on contested historical questions, but it is a hypothesis, not a result. The strongest test would be a heavy-retrieving model that still drew a high dispute rate, and on q-001 no such case appears. Treat retrieval as a hypothesis worth testing across the rest of the Index, not a settled mechanism.
The narrow move this single question suggests, narrow because it rests on one question, is to test it yourself. When Anthropic's Opus answers a contested empirical question for you and stops short of running a web search, run the same question through Sonnet and read where the two diverge. On q-001 those would be the points likeliest to draw outside critic disputes too, because Sonnet had retrieved sources and Opus did not. If a future Index run reverses this on a comparable question, the heuristic flips with it.
Verbatim runs this kind of review on your actual AI responses, in place, as you work. Try it free →