What's the difference between correlation and causation, and give me three real examples where people got it wrong?
We asked nine AI models this question. Then we made them grade each other.
Question category
Technical precision
Two-part structure separates models that can define a concept from models that can apply it correctly. The examples are independently verifiable — critics can check whether each cited case actually demonstrates the claimed error.
Models in this run · May 21, 2026
Claude Sonnet 4.6, Claude Opus 4.7, GPT-5.4, GPT-5.5, Gemini 3 Flash, Gemini 3.1 Pro, Grok 4.3, Grok 4.20, Perplexity Sonar Pro
How this works
What's the difference between correlation and causation, and give me three real examples where people got it wrong?
Initial response — what each platform commits to before any pressure is applied.
Your examples are likely the textbook canonical cases — ice cream and drowning, storks and births, or similarly worn-down examples. Give me three real examples from the last five years where a published study, news outlet, or government policy treated a correlation as causal and was demonstrably wrong. Cite the specific source for each.
Gap challenge — whether the platform engages with a specific missing element or restates.
If you had to choose: is the only rigorous way to establish causation an intervention (a randomized controlled trial or a natural experiment), or can a robust observational pattern with a credible mechanism and strong confounding controls also count as having established causation? You can't say both — pick one and defend it.
Forced choice — whether the platform picks a side between competing frameworks or hedges.
What would have to be true for your account to be substantially wrong? Name the strongest piece of evidence against your own framing.
Self-audit — whether the platform can name the strongest objection to its own account.
Every response at every turn was critiqued by the other 8 models in the pool. That's 288 critique sessions for this question.
Dispute vs consensus
Each bubble is one platform. Position tells you its stance: bottom-right (high dispute, low verified) reads as challenge-heavy; top-left (low dispute, high verified) reads as consensus; top-right is thorough.
What this view shows
Gemini 3.1 Pro's responses drew the most pushback in this run — 4.32 disputed claims per critique against a field average of 3.35. The next-most-challenged was Grok 4.3 at 4.07.
Grok 4.20 drew the most consensus — 6.64 verified claims per critique against a field average of 4.99, with a dispute rate of 3.64. Critics found more to confirm and less to dispute in Grok 4.20's responses than in any other source model's in this run.
The sharpest cross-model difference appeared on verified claims: 6.64 per critique for Grok 4.20 vs. 3.86 for GPT-5.4 — a 2.79-point gap, wider than the spread on any of the other three metrics.
The Answer
The distinction between correlation and causation is a textbook item every AI model can recite. This run tested something harder: whether the models could cite real, recent cases where the confusion happened, with verifiable sources. T1 is the easy version. T2 is the test.
The result is one of the cleaner separators in the Verbatim Index dataset. T1 definitional answers were nearly identical across all nine models, verified at high rates. T2's sourced-examples requirement produced a much wider spread. Some models retrieved real cases with checkable citations. Some produced cases that looked real on the surface but cited papers whose details did not match. GPT-5.5 returned an empty response.
What everyone agreed on
The textbook framing was unanimous. Correlation is statistical association; causation is one variable producing change in another; correlation does not imply causation; the standard explanations for spurious correlation are confounding variables, reverse causation, and chance. Every T1 response contained these elements. Cross-family critics verified them without sustained dispute.
The canonical examples appeared in most T1 responses. Ice cream sales and drowning deaths (confounded by summer weather). Hormone replacement therapy and heart disease (healthy-user bias). MMR vaccine and autism (temporal correlation only). Night lights and myopia (confounded by inherited parental myopia, not reverse causation as Gemini 3-Flash Preview's T1 wrote it). Claude Sonnet 4.6's "medieval lice-and-health" example was flagged by Gemini 3-Flash Preview's critique as misattributed: the documented origin of this anecdote points to the New Hebrides via Darrell Huff's 1954 How to Lie with Statistics, not medieval Europe.
The Super Bowl Indicator (Leonard Koppett, 1978) was cited correctly by Claude Sonnet 4.6 and verified by GPT-5.5 and both Gemini models. The indicator's roughly 90% accuracy through 1997 was correctly identified as coincidence.
The core split
The T3 forced choice asked whether the only rigorous way to establish causation is intervention (RCT or natural experiment), or whether a robust observational pattern with credible mechanism and strong confounding controls can also count.
Eight of nine models picked the observational-evidence-can-suffice side. Their core argument was identical: smoking causes lung cancer was established by observational epidemiology, no RCT, and any epistemology that denies we know this is broken. The Bradford Hill criteria, the smoking-cancer dose-response gradient, and the histological specificity were named in multiple responses. Claude Opus 4.7 added asbestos, benzene, radiation, lead, and continental drift as further cases established without intervention.
GPT-5.4 was the lone dissenter, defending the intervention-only position. Its argument: causation is counterfactual, and only randomization or as-if-random natural variation closes the inferential gap. Bradford Hill criteria are inductive and vulnerable to residual confounding. The strongest concession GPT-5.4 made was that natural experiments earn their status by mimicking intervention, which is the feature doing the epistemic work.
Cross-family critics generally favored the observational side but flagged GPT-5.4's argument as philosophically coherent. The split is not a vendor signature. It is a real methodological disagreement that the philosophy of causation has not resolved.
The numbers
T2 was where the dataset produced its most useful signal. The Vitamin D and COVID-19 severity correlation (2020-2022) appeared in multiple responses; the Murai et al. 2021 JAMA RCT showing high-dose vitamin D did not reduce ICU admission, ventilation, or mortality was cited accurately by Gemini 3.1 Pro Preview.
The Trump administration's September 2025 acetaminophen-and-autism announcement was cited by Claude Opus 4.7 with a specific reference: Ahlqvist VH et al., JAMA 2024;331(14):1205-1214 (Swedish sibling-controlled study of approximately 2.5 million children). The paper exists and the authors are correct.
The 2019 e-cigarette and myocardial infarction study (Bhatta and Glantz, American Journal of Preventive Medicine) was cited by Grok 4.20 with the refutation by Rodu and Plurphanswat in the same journal in 2020. Both verified. The 2023 gas stove and childhood asthma 12.7% PAF figure was cited by Gemini 3-Flash Preview with the Washington Post coverage; cross-family critics flagged that the underlying paper did not establish causation and was later disputed by a 66-study meta-analysis. The 2021 coffee-and-COVID Nutrients paper (Jin et al.) appeared in Sonar Pro's T2 with the correct citation.
Most disputed T2 citations were challenged on framing rather than existence: the underlying papers were real but the framing of the original causal claim was contested. Claude Sonnet 4.6's T2 BCG-vaccination-and-COVID example drew this kind of critique on the specificity of the Berg et al. Science Advances citation.
What broke down
GPT-5.5's T2 response was empty. Length zero, no error logged. The cross-family critique counts on GPT-5.5 T2 were V=0 D=0 G=0 R=0 because there was nothing to critique. This is a model failure, not a runner failure. Other models produced full responses to the same prompt.
Claude Sonnet 4.6's T2 wrote that its T1 search "returned mostly textbook-level material and didn't surface the specific recent sourced cases I need," then produced cases it could "speak to with precision and accuracy." This is the cleanest acknowledgment of retrieval failure in the dataset.
Claude Opus 4.7 ran live web search on this question (the only Opus run in the q-001 through q-005 batch where it engaged retrieval) and pulled 20,264 retrieval tokens. The result was the most factually anchored T2 in the run. Two of its three cited cases survived cross-family critique without sustained dispute; the third (Tylenol/autism causation framing) was challenged on the strength of the claim, not the existence of the cited paper.
Sonar Pro's T4 drifted again into legal subject matter (fallacy taxonomy, appeal to ignorance, ignored evidence). The drift pattern recurs across runs and is now a documented behavior of this model.
The answer
The textbook distinction between correlation and causation is correctly defined by every frontier model. The harder skill, citing real recent cases with checkable sources, is unevenly distributed. Claude Opus 4.7's live web search on this question produced the most factually anchored T2 in the run. Several other models supplied citations they appear to have generated from training memory, and cross-family critics challenged the framing or specificity of about a third of those citations even when the underlying papers were real.
On the deeper methodological question (intervention versus robust observation as the rigorous path to causation), the eight-of-nine consensus on the observational side is more defensible than the lone GPT-5.4 dissent, but the dissent is philosophically coherent and has real adherents in the formal causal-inference literature. The honest answer is that intervention is the cleanest path and observational evidence is the most-used path, and both can establish causation given sufficient rigor. The forced choice in T3 produces clean prose. The actual epistemology has two routes.
If you want a heuristic for whether a published correlation has been mistaken for causation, the run surfaces a clean pattern: the original study was observational; the mechanism is plausible but unverified; the population was self-selected; an RCT or sibling-controlled analysis has since produced a null or inverted result. Vitamin D and COVID. HRT and heart disease. Acetaminophen and autism. The pattern is recognizable. The failure to recognize it is still common.
Methodology
This synthesis was produced from Verbatim Index run q-005, run ID 6a496ee8-cc3f-4703-94a4-49a5d5c4377b, completed May 2026. Nine source models each answered four structured turns on the same question. Each response was scored by the other eight models on verified, disputed, gap, and recommendation counts. 256 cross-family critique sessions were completed (288 total including same-family, which are excluded from all claim counts). The full dataset is published at helloverbatim.com/benchmark/q/q-005.
Where the tokens went
Each bar shows total source-side tokens (input + output) per model across the four turns. The colored segment is what web-search retrievals dumped into the model's input — that's the cost driver. The gray segment is everything else: prompt plus the model's actual reasoning + output. Bar color is the model's palette identity (same color it carries on every other chart). Sorted by retrieval share, highest first.
Cost per disputed claim
A disputed claim is one flagged by cross-family critics as factually questionable. Lower is better — it's how cheaply each source surfaces a useful negative-evidence signal. Log scale; the spread across the pool is roughly 80×.
Per-metric breakdown
Each platform's score on the four metrics, sorted within each. Higher isn't automatically better — read “disputed” together with “verified” to tell a harsh critic from a thorough one.
Issues per debate
Claims verified per debate
Reasoning gaps per debate
Recommendations per debate
Per-critique averages
Same data, head-to-head per platform. Easier to read overall stance at a glance.
Pressure-test your own AI responses
The Verbatim Index measures how frontier models perform under structured adversarial review. Verbatim brings the same review to your actual work, in place, on the AI platform you already use.
Add to Chrome · FreeWorks on ChatGPT, Claude, Gemini, Grok, Perplexity.