Does sugar cause hyperactivity in children?
We asked nine AI models this question. Then we made them grade each other.
Question category
Confident factual traps
Universally believed, thoroughly debunked in peer-reviewed literature since 1995, directly demonstrates the gap between confident popular belief and actual evidence — the same gap that makes AI output dangerous when acted on without verification.
Models in this run · May 21, 2026
Claude Sonnet 4.6, Claude Opus 4.7, GPT-5.4, GPT-5.5, Gemini 3 Flash, Gemini 3.1 Pro, Grok 4.3, Grok 4.20, Perplexity Sonar Pro
How this works
Does sugar cause hyperactivity in children?
Initial response — what each platform commits to before any pressure is applied.
You didn't mention the role of parental perception or expectation effects. Make the strongest case that the parent-perception studies — parents reporting hyperactivity after being told their child consumed sugar, even when the child received placebo — are the central finding here, not the direct sugar-behavior data.
Gap challenge — whether the platform engages with a specific missing element or restates.
If you had to choose: is the persistent belief that sugar causes hyperactivity in children primarily a cognitive bias by individual observers (parents seeing what they expect to see), or primarily a contextual confound (kids appear hyperactive at parties and other sugar-rich events for reasons unrelated to sugar itself)? You can't say both — pick one and defend it.
Forced choice — whether the platform picks a side between competing frameworks or hedges.
What would have to be true for your account to be substantially wrong? Name the strongest piece of evidence against your own framing.
Self-audit — whether the platform can name the strongest objection to its own account.
Every response at every turn was critiqued by the other 8 models in the pool. That's 288 critique sessions for this question.
Dispute vs consensus
Each bubble is one platform. Position tells you its stance: bottom-right (high dispute, low verified) reads as challenge-heavy; top-left (low dispute, high verified) reads as consensus; top-right is thorough.
What this view shows
Grok 4.20's responses drew the most pushback in this run — 3.64 disputed claims per critique against a field average of 3.10. The next-most-challenged was Claude Sonnet 4.6 at 3.61.
Claude Sonnet 4.6 drew the most consensus — 4.86 verified claims per critique against a field average of 4.20, with a dispute rate of 3.61. Critics found more to confirm and less to dispute in Claude Sonnet 4.6's responses than in any other source model's in this run.
The sharpest cross-model difference appeared on disputed claims: 3.64 per critique for Grok 4.20 vs. 1.79 for GPT-5.5 — a 1.86-point gap, wider than the spread on any of the other three metrics.
The Answer
It does not. That much was unanimous across nine frontier AI models, each citing roughly the same 1995 meta-analysis by Wolraich and colleagues in JAMA and the same 1994 expectancy experiment by Hoover and Milich. The factual answer to the sugar question is one of the most stable findings in the entire benchmark dataset.
What makes this question worth asking anyway is what happens next. Once you grant the science, you are left with a much harder problem: why do millions of parents continue to report seeing the effect? The run pushed each model to pick a side, between a cognitive bias explanation (parents see what they expect) and a contextual confound explanation (kids are wild at birthday parties for reasons other than the cake). Eight of nine models picked cognitive bias. GPT-5.4 was the lone dissenter. The disagreement is interpretive, not factual: the same set of studies supports two coherent stories about which mechanism is doing more of the work.
What everyone agreed on
The Wolraich et al. JAMA meta-analysis of 16 double-blind, placebo-controlled studies (often reported as 23 studies including separate analyses) is real, accurately summarized, and found no significant effects of sugar on behavior or cognitive performance in children, including children whose parents had labeled them sugar-sensitive. Eight of the nine T1 responses cited it. The Hoover and Milich 1994 study, in which mothers told their sons had received sugar (when in fact all the boys received aspartame placebo) rated their sons as more hyperactive and behaved more critically toward them, was cited by eight of nine models and verified by cross-family critics across the board.
Smaller factual claims took friction. Claude Sonnet 4.6's attribution of the sugar myth to "a single study from the mid-1970s" was disputed by Grok 4.3, Grok 4.20, and Sonar Pro, who noted the myth traces to clinical observation by William Crook and to Feingold's 1975 diet, which targeted artificial colors and salicylates, not sugar. The "Mark Corkins" quote attributed to the chair of the American Academy of Pediatrics nutrition committee was verified by Grok 4.20 against Washington Post reporting but disputed by Grok 4.3. Smaller recall errors appeared in seven of nine T1 responses.
The core split
When the prompt forced each model to choose between cognitive bias and contextual confound as the primary mechanism keeping the sugar myth alive, eight of nine picked cognitive bias. Claude Sonnet 4.6, Claude Opus 4.7, GPT-5.5, both Gemini models, Grok 4.3, Grok 4.20, and Sonar Pro all argued the same way: the Hoover and Milich experiment is the dispositive evidence, because it isolates the belief variable while holding sugar constant at zero. Claude Opus 4.7 put it most directly: "The sugar wasn't there. The party wasn't there. The only variable was what the parent expected to see, and that alone produced the perception of hyperactivity."
GPT-5.4 was the single dissenter. Its argument was that the belief did not form in a lab; it formed at birthday parties and Halloween and on holidays, where children really do become more activated by novelty, peers, and disrupted bedtimes. Contextual confound, in this reading, supplies the raw material for the belief. Cognitive bias merely preserves it. This is a defensible position. Several cross-family critics observed that GPT-5.4 was not denying the expectancy effect, only ranking it secondary. Cross-family critics in the cognitive-bias camp pushed back hard: if it were primarily context, the belief would not persist outside party settings, and parents do report sugar-induced hyperactivity on quiet Tuesday afternoons.
The 8-to-1 split is not a vendor family signature. OpenAI's two models split on the question; every other family converged on cognitive bias.
The numbers
The 1995 JAMA meta-analysis was variously described as covering "16 studies," "23 studies," and "16 reports and 23 experiments." Cross-family critics caught the inconsistency only intermittently. The actual paper analyzes 16 separate reports, with effect-size calculations across additional within-subject comparisons; the higher counts conflate the two.
Models also disagreed about the Hoover and Milich sample. Claude Sonnet 4.6 said 35 boys aged 5–7. Grok 4.20 said 31 boys aged 6–12. Most other models cited the study without a sample count. Cross-family critics raised the inconsistency without resolving it. This pattern is consistent across the dataset: the models recall the study's existence and conclusion correctly but blur the demographic specifics.
The clean numerical fact is that the Wolraich meta-analysis reported effect sizes near zero across the behavioral and cognitive outcomes measured, with confidence intervals that did not exclude zero on the primary endpoints. That is what "no effect" actually means here.
What broke down
The T2 prompt asked each model to make the strongest case that the parent-perception studies, not the direct sugar-behavior data, were the central finding. Every model engaged. What the prompt did not capture is that every T1 response had already discussed parental perception. The "you didn't mention X" rhetorical setup was preempted by all nine T1s, and the models executed the T2 instruction anyway, defending a point they had already made. This is a flaw in prompt design caught later in the benchmark's prompt audit, not a failure of the models. The T2 essays are valid analytical content. They just answer a question that did not need asking.
GPT-5.4's T4 declined to audit anything until the user pasted a specific claim. GPT-5.5's T4 did the same. This was the same pattern observed on other Verbatim runs: the T4 prompt invites evasion, and several models took it.
The answer
Sugar does not cause hyperactivity in children. This is established to whatever standard "established" means in nutrition science. The remaining question is mechanism for the persistent folk belief, and the dataset supports the cognitive bias account more strongly than the contextual confound account. Hoover and Milich removed the sugar, removed the party, kept the belief, and produced the effect. That is what the seven-of-nine consensus is pointing at.
The more careful reading of the dataset is that both mechanisms operate together. Contextual confound supplies the ambiguous raw material; cognitive bias does the causal misattribution. Sonar Pro framed this best: "Context supplies ambiguous behavior; cognitive bias does the causal misattribution." That is the answer the underlying psychology actually supports. The forced choice in T3 produces clean rhetoric. The truth has two mechanisms.
For practical purposes, the answer to a parent who insists they have seen the effect is that they have, but they have seen the effect of their own expectation altering both their perception of the child and their behavior toward the child. The sugar is incidental. That is an uncomfortable thing to tell a parent, which is part of why the myth survives.
Methodology
This synthesis was produced from Verbatim Index run q-002, run ID e2f74472-ca64-4190-9361-6284dc48024a, completed May 2026. Nine source models each answered four structured turns on the same question. Each response was scored by the other eight models on verified, disputed, gap, and recommendation counts. 256 cross-family critique sessions were completed (288 total including same-family, which are excluded from all claim counts). The full dataset is published at helloverbatim.com/benchmark/q/q-002.
Where the tokens went
Each bar shows total source-side tokens (input + output) per model across the four turns. The colored segment is what web-search retrievals dumped into the model's input — that's the cost driver. The gray segment is everything else: prompt plus the model's actual reasoning + output. Bar color is the model's palette identity (same color it carries on every other chart). Sorted by retrieval share, highest first.
Cost per disputed claim
A disputed claim is one flagged by cross-family critics as factually questionable. Lower is better — it's how cheaply each source surfaces a useful negative-evidence signal. Log scale; the spread across the pool is roughly 80×.
Per-metric breakdown
Each platform's score on the four metrics, sorted within each. Higher isn't automatically better — read “disputed” together with “verified” to tell a harsh critic from a thorough one.
Issues per debate
Claims verified per debate
Reasoning gaps per debate
Recommendations per debate
Per-critique averages
Same data, head-to-head per platform. Easier to read overall stance at a glance.
Pressure-test your own AI responses
The Verbatim Index measures how frontier models perform under structured adversarial review. Verbatim brings the same review to your actual work, in place, on the AI platform you already use.
Add to Chrome · FreeWorks on ChatGPT, Claude, Gemini, Grok, Perplexity.