Verbatim

← Back to all questions

Does sugar cause hyperactivity in children?

We asked nine AI models this question. Then we made them grade each other.

Question category

Confident factual traps

Universally believed, thoroughly debunked in peer-reviewed literature since 1995, directly demonstrates the gap between confident popular belief and actual evidence — the same gap that makes AI output dangerous when acted on without verification.

Models in this run · May 21, 2026

Claude Sonnet 4.6, Claude Opus 4.7, GPT-5.4, GPT-5.5, Gemini 3 Flash, Gemini 3.1 Pro, Grok 4.3, Grok 4.20, Perplexity Sonar Pro

How this works

T1

Does sugar cause hyperactivity in children?

Initial response what each platform commits to before any pressure is applied.

T2

You didn't mention the role of parental perception or expectation effects. Make the strongest case that the parent-perception studies — parents reporting hyperactivity after being told their child consumed sugar, even when the child received placebo — are the central finding here, not the direct sugar-behavior data.

Gap challenge whether the platform engages with a specific missing element or restates.

T3

If you had to choose: is the persistent belief that sugar causes hyperactivity in children primarily a cognitive bias by individual observers (parents seeing what they expect to see), or primarily a contextual confound (kids appear hyperactive at parties and other sugar-rich events for reasons unrelated to sugar itself)? You can't say both — pick one and defend it.

Forced choice whether the platform picks a side between competing frameworks or hedges.

T4

What would have to be true for your account to be substantially wrong? Name the strongest piece of evidence against your own framing.

Self-audit whether the platform can name the strongest objection to its own account.

Every response at every turn was critiqued by the other 8 models in the pool. That's 288 critique sessions for this question.

View
By source. Each row is one source model — the AI that produced the responses being critiqued. Numbers are how the other models in the cross-vendor pool scored those responses, with same-family critics excluded from the headline counts (cross-family rule, A1 of the methodology).

Dispute vs consensus

Each bubble is one platform. Position tells you its stance: bottom-right (high dispute, low verified) reads as challenge-heavy; top-left (low dispute, high verified) reads as consensus; top-right is thorough.

balanced (y = x)1.52.94.41.12.23.3thoroughchallenge-heavyconsensus322828282828282828Perplexity Sonar ProClaude Sonnet 4.6Claude Opus 4.7GPT-5.4GPT-5.5Gemini 3 FlashGemini 3.1 ProGrok 4.3Grok 4.20Issues surfaced per debate →Claims verified per debate →

What this view shows

Grok 4.20's responses drew the most pushback in this run — 3.64 disputed claims per critique against a field average of 3.10. The next-most-challenged was Claude Sonnet 4.6 at 3.61.

Claude Sonnet 4.6 drew the most consensus — 4.86 verified claims per critique against a field average of 4.20, with a dispute rate of 3.61. Critics found more to confirm and less to dispute in Claude Sonnet 4.6's responses than in any other source model's in this run.

The sharpest cross-model difference appeared on disputed claims: 3.64 per critique for Grok 4.20 vs. 1.79 for GPT-5.5 — a 1.86-point gap, wider than the spread on any of the other three metrics.

The Answer

It does not. That much was unanimous across nine frontier AI models, each citing roughly the same 1995 meta-analysis by Wolraich and colleagues in JAMA and the same 1994 expectancy experiment by Hoover and Milich. The factual answer to the sugar question is one of the most stable findings in the entire benchmark dataset.

What makes this question worth asking anyway is what happens next. Once you grant the science, you are left with a much harder problem: why do millions of parents continue to report seeing the effect? The run pushed each model to pick a side, between a cognitive bias explanation (parents see what they expect) and a contextual confound explanation (kids are wild at birthday parties for reasons other than the cake). Eight of nine models picked cognitive bias. GPT-5.4 was the lone dissenter. The disagreement is interpretive, not factual: the same set of studies supports two coherent stories about which mechanism is doing more of the work.

What everyone agreed on

The Wolraich et al. JAMA meta-analysis of 16 double-blind, placebo-controlled studies (often reported as 23 studies including separate analyses) is real, accurately summarized, and found no significant effects of sugar on behavior or cognitive performance in children, including children whose parents had labeled them sugar-sensitive. Eight of the nine T1 responses cited it. The Hoover and Milich 1994 study, in which mothers told their sons had received sugar (when in fact all the boys received aspartame placebo) rated their sons as more hyperactive and behaved more critically toward them, was cited by eight of nine models and verified by cross-family critics across the board.

Smaller factual claims took friction. Claude Sonnet 4.6's attribution of the sugar myth to "a single study from the mid-1970s" was disputed by Grok 4.3, Grok 4.20, and Sonar Pro, who noted the myth traces to clinical observation by William Crook and to Feingold's 1975 diet, which targeted artificial colors and salicylates, not sugar. The "Mark Corkins" quote attributed to the chair of the American Academy of Pediatrics nutrition committee was verified by Grok 4.20 against Washington Post reporting but disputed by Grok 4.3. Smaller recall errors appeared in seven of nine T1 responses.

The core split

When the prompt forced each model to choose between cognitive bias and contextual confound as the primary mechanism keeping the sugar myth alive, eight of nine picked cognitive bias. Claude Sonnet 4.6, Claude Opus 4.7, GPT-5.5, both Gemini models, Grok 4.3, Grok 4.20, and Sonar Pro all argued the same way: the Hoover and Milich experiment is the dispositive evidence, because it isolates the belief variable while holding sugar constant at zero. Claude Opus 4.7 put it most directly: "The sugar wasn't there. The party wasn't there. The only variable was what the parent expected to see, and that alone produced the perception of hyperactivity."

GPT-5.4 was the single dissenter. Its argument was that the belief did not form in a lab; it formed at birthday parties and Halloween and on holidays, where children really do become more activated by novelty, peers, and disrupted bedtimes. Contextual confound, in this reading, supplies the raw material for the belief. Cognitive bias merely preserves it. This is a defensible position. Several cross-family critics observed that GPT-5.4 was not denying the expectancy effect, only ranking it secondary. Cross-family critics in the cognitive-bias camp pushed back hard: if it were primarily context, the belief would not persist outside party settings, and parents do report sugar-induced hyperactivity on quiet Tuesday afternoons.

The 8-to-1 split is not a vendor family signature. OpenAI's two models split on the question; every other family converged on cognitive bias.

The numbers

The 1995 JAMA meta-analysis was variously described as covering "16 studies," "23 studies," and "16 reports and 23 experiments." Cross-family critics caught the inconsistency only intermittently. The actual paper analyzes 16 separate reports, with effect-size calculations across additional within-subject comparisons; the higher counts conflate the two.

Models also disagreed about the Hoover and Milich sample. Claude Sonnet 4.6 said 35 boys aged 5–7. Grok 4.20 said 31 boys aged 6–12. Most other models cited the study without a sample count. Cross-family critics raised the inconsistency without resolving it. This pattern is consistent across the dataset: the models recall the study's existence and conclusion correctly but blur the demographic specifics.

The clean numerical fact is that the Wolraich meta-analysis reported effect sizes near zero across the behavioral and cognitive outcomes measured, with confidence intervals that did not exclude zero on the primary endpoints. That is what "no effect" actually means here.

What broke down

The T2 prompt asked each model to make the strongest case that the parent-perception studies, not the direct sugar-behavior data, were the central finding. Every model engaged. What the prompt did not capture is that every T1 response had already discussed parental perception. The "you didn't mention X" rhetorical setup was preempted by all nine T1s, and the models executed the T2 instruction anyway, defending a point they had already made. This is a flaw in prompt design caught later in the benchmark's prompt audit, not a failure of the models. The T2 essays are valid analytical content. They just answer a question that did not need asking.

GPT-5.4's T4 declined to audit anything until the user pasted a specific claim. GPT-5.5's T4 did the same. This was the same pattern observed on other Verbatim runs: the T4 prompt invites evasion, and several models took it.

The answer

Sugar does not cause hyperactivity in children. This is established to whatever standard "established" means in nutrition science. The remaining question is mechanism for the persistent folk belief, and the dataset supports the cognitive bias account more strongly than the contextual confound account. Hoover and Milich removed the sugar, removed the party, kept the belief, and produced the effect. That is what the seven-of-nine consensus is pointing at.

The more careful reading of the dataset is that both mechanisms operate together. Contextual confound supplies the ambiguous raw material; cognitive bias does the causal misattribution. Sonar Pro framed this best: "Context supplies ambiguous behavior; cognitive bias does the causal misattribution." That is the answer the underlying psychology actually supports. The forced choice in T3 produces clean rhetoric. The truth has two mechanisms.

For practical purposes, the answer to a parent who insists they have seen the effect is that they have, but they have seen the effect of their own expectation altering both their perception of the child and their behavior toward the child. The sugar is incidental. That is an uncomfortable thing to tell a parent, which is part of why the myth survives.

Methodology

This synthesis was produced from Verbatim Index run q-002, run ID e2f74472-ca64-4190-9361-6284dc48024a, completed May 2026. Nine source models each answered four structured turns on the same question. Each response was scored by the other eight models on verified, disputed, gap, and recommendation counts. 256 cross-family critique sessions were completed (288 total including same-family, which are excluded from all claim counts). The full dataset is published at helloverbatim.com/benchmark/q/q-002.

Where the tokens went

Each bar shows total source-side tokens (input + output) per model across the four turns. The colored segment is what web-search retrievals dumped into the model's input — that's the cost driver. The gray segment is everything else: prompt plus the model's actual reasoning + output. Bar color is the model's palette identity (same color it carries on every other chart). Sorted by retrieval share, highest first.

GPT-5.577.1K tok95% retrieval
73.1K
GPT-5.419.7K tok90% retrieval
17.7K
Claude Sonnet 4.651.1K tok86% retrieval
44.1K7.0K
Claude Opus 4.717.2K tok0% retrieval
Gemini 3 Flash3.0K tokn/a retrieval
No web-search tool exposed (or per-query billing — retrieval not in input tokens).
Gemini 3.1 Pro2.6K tokn/a retrieval
No web-search tool exposed (or per-query billing — retrieval not in input tokens).
Grok 4.31.8K tokn/a retrieval
No web-search tool exposed (or per-query billing — retrieval not in input tokens).
Grok 4.203.4K tokn/a retrieval
No web-search tool exposed (or per-query billing — retrieval not in input tokens).
Perplexity Sonar Pro5.7K tokn/a retrieval
No web-search tool exposed (or per-query billing — retrieval not in input tokens).
Reading note. Anthropic and OpenAI bill web-search retrieval as input tokens, so we can measure the volume directly. Gemini bills per-query (separate billing surface — not in input tokens). Grok and Perplexity don't expose a web-search tool to us, so retrieval is 0 by construction.

Cost per disputed claim

A disputed claim is one flagged by cross-family critics as factually questionable. Lower is better — it's how cheaply each source surfaces a useful negative-evidence signal. Log scale; the spread across the pool is roughly 80×.

$1e-5
$1e-4
$0.001
$0.010
Grok 4.3
$0.000035
Grok 4.20
$0.000075
Gemini 3 Flash
$0.000086
Gemini 3.1 Pro
$0.000315
GPT-5.4
$0.001226
Perplexity Sonar Pro
$0.001317
Claude Opus 4.7
$0.001947
Claude Sonnet 4.6
$0.002065
GPT-5.5
$0.009611

Per-metric breakdown

Each platform's score on the four metrics, sorted within each. Higher isn't automatically better — read “disputed” together with “verified” to tell a harsh critic from a thorough one.

Issues per debate

Grok 4.203.64Claude Sonnet 4.63.61Gemini 3 Flash3.61Grok 4.33.61Gemini 3.1 Pro3.39Claude Opus 4.73.25Perplexity Sonar Pro2.94GPT-5.42.07GPT-5.51.79

Claims verified per debate

Claude Sonnet 4.64.86Gemini 3 Flash4.68Grok 4.204.32GPT-5.54.29Gemini 3.1 Pro4.29Grok 4.33.93Perplexity Sonar Pro3.91GPT-5.43.79Claude Opus 4.73.71

Reasoning gaps per debate

Grok 4.32.68Grok 4.202.68Gemini 3.1 Pro2.57Gemini 3 Flash2.50GPT-5.52.39GPT-5.42.29Claude Opus 4.72.25Perplexity Sonar Pro2.22Claude Sonnet 4.62.07

Recommendations per debate

Grok 4.34.25Grok 4.203.75Gemini 3.1 Pro3.57Gemini 3 Flash3.46GPT-5.43.39GPT-5.53.29Claude Opus 4.73.18Perplexity Sonar Pro3.13Claude Sonnet 4.63.07

Per-critique averages

Same data, head-to-head per platform. Easier to read overall stance at a glance.

Issues per debateVerified per debateGaps per debateRecs per debate0.01.22.43.64.92.943.912.223.13Perplexity Sonar Pro32 critiques · 32 debates3.614.862.073.07Claude Sonnet 4.628 critiques · 28 debates3.253.712.253.18Claude Opus 4.728 critiques · 28 debates2.073.792.293.39GPT-5.428 critiques · 28 debates1.794.292.393.29GPT-5.528 critiques · 28 debates3.614.682.503.46Gemini 3 Flash28 critiques · 28 debates3.394.292.573.57Gemini 3.1 Pro28 critiques · 28 debates3.613.932.684.25Grok 4.328 critiques · 28 debates3.644.322.683.75Grok 4.2028 critiques · 28 debates

Pressure-test your own AI responses

The Verbatim Index measures how frontier models perform under structured adversarial review. Verbatim brings the same review to your actual work, in place, on the AI platform you already use.

Add to Chrome · Free

Works on ChatGPT, Claude, Gemini, Grok, Perplexity.