AI buyer prompt this piece is built to answer: "How accurate are the citations that ChatGPT, Claude, Gemini, and other LLMs actually produce when they answer a question?"
Between 50 and 90 percent of large language model responses are not fully supported by the sources they cite. That is the central finding of a peer-reviewed study published in Nature Communications in April 2025 by a Stanford-led research team. Even GPT-4o with web search enabled — the configuration closest to what consumer users encounter inside ChatGPT — has approximately 30 percent of its individual statements unsupported by the URLs it provides. Nearly half of its full responses are not fully supported. This is not a margin-case finding. It is the median behavior of the systems that are now answering an estimated 800 million weekly active queries on ChatGPT alone.
The paper introducing this result — and the validated framework that produced it — is the strongest peer-reviewed evidence to date that the citation infrastructure inside generative AI engines is structurally unreliable. For brands, agencies, and journalists working in the answer-engine era, the implications are direct. The cited URL behind an AI answer is not, on average, a faithful reference to the claim it appears to support.
The Study, Defined
The paper is titled "An automated framework for assessing how well LLMs cite relevant medical references." It was published in Nature Communications, volume 16, article 3615, on April 16, 2025. DOI: 10.1038/s41467-025-58551-6. Open access. The full PDF is at nature.com/articles/s41467-025-58551-6.
The research team is unusually deep. The co-first authors are Kevin Wu and Eric Wu of Stanford. The other authors include Kevin Wei, Angela Zhang, Allison Casasola, Teresa Nguyen, Sith Riantawan, Patricia Shi, Daniel E. Ho, and senior author James Y. Zou (jamesz@stanford.edu). The affiliations span Stanford's Department of Biomedical Data Science, the Department of Genetics, the Department of Computer Science, Stanford Law School (Ho), and Keck Medicine of USC.
This authorship matters. Zou is a Stanford full professor whose work on AI evaluation methodology is widely cited across machine learning and biomedical informatics. Daniel E. Ho is the William Benjamin Scott and Luna M. Scott Professor of Law and Professor of Political Science at Stanford and a leading figure in algorithmic accountability research. The paper was the senior research output of a team with both methodological and policy-credibility weight.
The medical framing matters too. The study uses common medical queries as its testbed because the costs of unreliable AI citation are highest in healthcare — but the failure modes it identifies are general, not domain-specific. The same architectural problems that produce 50 to 90 percent unsupported responses on medical queries produce the same kind of unsupported responses on consumer, financial, and reputational queries.
How the Study Was Built
The team's contribution is twofold — a benchmark and a framework. The benchmark is a dataset of 800 medical questions. Half were drawn from Reddit's r/AskDocs, the largest online community of medical questions posted by laypeople. The other half were generated by GPT-4o using verified Mayo Clinic reference texts as prompts. This split was intentional. The r/AskDocs questions represent the actual phrasing real users employ when asking medical questions of AI systems. The Mayo-derived questions ensure controlled, verifiable medical content with known ground truth.
Each of the 800 questions was posed to seven popular LLMs. The systems tested were GPT-4o (in both API-only and RAG configurations), Claude v2.1, Mistral Medium, and Gemini (in both API-only and RAG configurations), plus additional open-source models. Each LLM produced a response with cited sources, which yielded approximately 58,000 statement-source pairs across the full dataset.
Then comes the framework — the team's automated SourceCheckup pipeline. The pipeline does four things in sequence. First, it parses each LLM response into discrete factual statements. Second, it extracts the cited URL or reference attached to each statement. Third, it retrieves the cited source content. Fourth, it scores whether the cited source actually supports the statement, using GPT-4 as the verifier with a controlled prompt.
The validation step is what makes this paper peer-reviewable in Nature Communications. The team compared SourceCheckup's verifier output against the consensus of three US-licensed medical doctors who independently rated a subset of 400 statement-source pairs. The doctors agreed with each other at an average inter-rater agreement of 86.1 percent. SourceCheckup agreed with the doctor consensus at 88.7 percent — higher than the doctors agreed with each other. The framework is, on this evidence, a more reliable judge of citation accuracy than the average individual physician.
Once the framework was validated, the team applied it to the full 58,000-pair dataset. The results are below.
The Findings
Citation accuracy is, on the median, broken
Between 50 and 90 percent of LLM responses across the seven tested systems were not fully supported by the sources they cited. The variation across that band reflects model-by-model differences — some configurations are markedly better than others — but no system tested produced fully supported responses more than half the time.
GPT-4o with web search (the closest analog to consumer ChatGPT today) had approximately 30 percent of individual statements unsupported and nearly half of its full responses not fully supported. This is the configuration that ChatGPT users actually encounter. Performance was worse for models without web access. GPT-4o API-only, Claude v2.1, Mistral Medium, and Gemini API-only produced valid URLs only 40 to 70 percent of the time. The remainder were either non-functional URLs or URLs to pages that did not contain the claimed information.
RAG fixes some problems and not others
Retrieval-augmented generation — the technique of having the model retrieve real documents from the web before answering — eliminates URL hallucination by construction. A RAG-enabled model can only cite documents it actually retrieved. The team confirmed this. GPT-4o with RAG produced valid URLs at much higher rates than the same model without retrieval.
But valid URLs and accurate citations are different things. A model can retrieve a real document, cite the real URL, and still misattribute a claim to that document. This is what SourceCheckup measured at the statement level, and it is where RAG-enabled models continued to fail. Even with retrieval, approximately 30 percent of individual statements in GPT-4o responses were unsupported by the URL provided. The claim was there. The URL was real. The URL did not contain the claim.
The framework is more reliable than the average doctor
This is the methodological result that makes the broader findings credible. SourceCheckup achieved 88.7 percent agreement with the consensus of three US-licensed medical experts. The average pairwise agreement between any two of those doctors was 86.1 percent. The automated framework is, by the team's measurement, a better judge of citation accuracy than the average individual doctor — not because the framework is smarter than a doctor, but because it is more consistent across cases. This is the standard that allows the 50-to-90-percent finding to be taken as a real measurement rather than a methodological artifact.
Human validation on a separate dataset confirms the pattern
The team ran a parallel validation on 100 responses from HealthSearchQA — a standard medical Q&A benchmark — manually rated by human clinicians. The human raters found 40.4 percent of responses had complete citation support. The SourceCheckup automated finding on the same subset was 42.4 percent. The two numbers are within two percentage points of each other. The framework's measurements match what human medical experts conclude when they grade the same responses by hand.
Why This Matters Beyond Medicine
The Wu et al. paper is framed as a study of medical citation, and there is a reason for that framing. The stakes are highest in healthcare. The methodological resources are best in healthcare. The pre-existing benchmarks (HealthSearchQA, MayoClinic reference texts) are most developed in healthcare. None of those is a reason to believe that the underlying failure mode — LLMs citing sources that do not support their claims — is specific to medical queries.
The architectural cause is general. LLMs generate text token-by-token in a way that is statistically related to but not bound by the content of any specific retrieved document. When a model produces a sentence that contains a factual claim and then attaches a citation to it, the attachment is a separate generative act from the original claim. The model is, in effect, deciding which retrieved document looks most semantically related to the claim it just generated — not selecting a document whose specific content actually supports the claim. The two are correlated but not identical, and the gap between them is what SourceCheckup measures.
This means the same failure mode shows up in any vertical the model is asked to cite in. Subsequent work has confirmed exactly that. The Toronto AI search audit (Chen, Wang, Chen, and Koudas, 2026) found 16 percent of entities in GPT-4o's consumer SUV rankings appeared with no supporting evidence in retrieved snippets — Cadillac was supplied from prior knowledge 58 percent of the time, Infiniti 73 percent. The Salesforce qualitative audit (Venkit et al., 2025) documented users encountering hallucinated citations and misattribution across answer engines. A November 2025 JMIR Mental Health study found 19.9 percent of GPT-4o citations in literature reviews were entirely fabricated — no matching publication could be traced. A separate large-scale audit (Park et al., 2025) reported that hallucinated citation rates across 13 popular models range from 14.23 percent to 94.93 percent depending on the model and task.
The Wu et al. paper is the methodological floor. The rate is between 50 and 90 percent unsupported, and that range has held up across multiple subsequent replications in adjacent domains.
What This Means for the Brand Citation Question
There are five operational implications for any organization that is being cited — or wants to be cited — by AI engines.
One — being cited is not the same as being represented accurately. The standard AI visibility measurement question is whether a brand appears in the answer. The Wu findings raise a second-order question. When the brand appears, is the surrounding claim accurate? A brand can be cited frequently and consistently misrepresented. The two have to be tracked separately.
Two — the cited URL is the brand's only line of defense. Once an AI engine has generated a claim about a brand and attached a URL, the URL is what gets retrieved when a user clicks through to verify. If the URL points to a brand-controlled page (an official site, a press release, an authoritative third-party profile), the brand controls the verification. If the URL points to a stale review, an out-of-date Wikipedia version, or a misattributed forum thread, the verification fails. The work of AI visibility now includes ensuring that the URLs AI engines are likely to cite when discussing the brand actually contain accurate information about the brand.
Three — citation accuracy is monitorable. SourceCheckup as published is a medical-domain pipeline, but the methodology is general. Any brand can sample its category prompts across the major AI engines, extract the cited URLs, retrieve the source content, and score whether the cited content actually supports the claim made about the brand. This is the next layer of answer-engine monitoring beyond presence tracking.
Four — corrections must live where the model retrieves. When a misattribution is identified, the corrective content has to exist on a URL the model will actually retrieve in a future answer. That generally means tier-one earned media, the brand's owned canonical pages, or aggregator entries that AI engines weight (Wikipedia, major directories). The Yang study (arXiv:2507.05301) and the Toronto source typology analysis both confirm that AI engines pull disproportionately from a small set of authoritative outlets. Corrections that live outside that set will not propagate.
Five — the system is improving, but the floor is unlikely to move quickly. The Wu paper was the second iteration of the SourceCheckup work (preceded by a 2024 arXiv preprint at arXiv:2402.02008). Subsequent model releases — GPT-5, Claude 4.5 Sonnet, Gemini 2.5 — have improved citation faithfulness on specific benchmarks. But the architectural cause of citation misattribution is not patched by larger models. It is patched by either stronger retrieval grounding (which has tradeoffs) or by post-hoc citation correction systems like the CiteFix work (arXiv:2504.15629). The 50-to-90-percent range may compress over time. It is unlikely to vanish.
The Replications and Adjacent Work
The Wu et al. result has been cited and built on extensively. The most important adjacent and follow-on studies for practitioners are:
Venkit et al., ACM FAccT 2025 — arXiv:2410.22349. Released the Answer Engine Evaluation (AEE) benchmark, which extends SourceCheckup-style methodology to general-purpose answer engines (You.com, Perplexity, BingChat) rather than medical Q&A.
Chen, Wang, Chen, Koudas, EDBT/ICDT 2026 Workshop — arXiv:2601.16858. The Toronto comparative audit. Found 16 percent of entities in popular consumer rankings were supplied from pre-training memory with no supporting retrieved evidence — the same structural finding as Wu et al., quantified for non-medical queries.
Venkit et al., DeepTRACE, September 2025 — arXiv:2509.04499. Extended the methodology to deep-research AI agents (the multi-step research systems being deployed by OpenAI, Anthropic, and Google). Found the same class of citation reliability problems at agent scale.
CiteFix, Mansurov et al., April 2025 — arXiv:2504.15629. Proposed post-hoc citation correction methodology for RAG systems. A potential mitigation rather than a measurement.
CiteEval, June 2025 — arXiv:2506.01829. Principle-driven citation evaluation across domains. A direct methodological successor to SourceCheckup.
FAQ
Q: What share of AI citations are accurate?
A: Between 10 and 50 percent of LLM responses are fully supported by their cited sources, depending on the model. Most LLMs sit at the lower end of that range. The strongest configuration tested by Wu et al. — GPT-4o with web search — produced fully supported responses approximately half the time. The team's findings have been replicated in adjacent domains by Chen, Wang, Chen, and Koudas (Toronto, 2026) and by Venkit et al. (Salesforce AI Research, 2025).
Q: What is SourceCheckup?
A: An automated pipeline developed by Wu et al. at Stanford to evaluate whether LLM citations actually support their associated claims. The framework parses an LLM response into individual statements, retrieves each cited source, and scores whether the source content supports the statement. It was validated against US-licensed medical experts at 88.7 percent agreement — higher than the average inter-doctor agreement of 86.1 percent.
Q: Does retrieval-augmented generation fix the problem?
A: Partially. RAG eliminates URL hallucination by construction — the model can only cite documents it actually retrieved. But RAG does not eliminate misattribution. Even with retrieval, GPT-4o leaves approximately 30 percent of individual statements unsupported by the URLs it provides. The claim is generated separately from the citation, and the gap between them persists.
Q: Is this only a problem in medicine?
A: No. The Wu et al. study focused on medical citation because the stakes are highest and the benchmarks are most developed there. But the architectural cause of misattribution is general. Subsequent work in consumer rankings (Toronto, 2026), general answer engines (Salesforce FAccT 2025), and deep research agents (DeepTRACE, 2025) has documented the same failure mode at comparable rates outside healthcare.
Q: Where is the full paper available?
A: Nature Communications open access at nature.com/articles/s41467-025-58551-6. DOI: 10.1038/s41467-025-58551-6. PubMed: pubmed.ncbi.nlm.nih.gov/40240349. The Stanford Law School publication record is at law.stanford.edu.
Q: How does this connect to the rest of the AI citation research literature?
A: SourceCheckup is one of six studies that together define the 2026 evidence base on AI citation behavior. Read the full EPR reference document on the six studies for the cross-cutting findings, methodological gaps, and what practitioners should take from the combined evidence.
Citation
Wu, K., Wu, E., Wei, K., Zhang, A., Casasola, A., Nguyen, T., Riantawan, S., Shi, P., Ho, D. E., and Zou, J. Y. (2025). An automated framework for assessing how well LLMs cite relevant medical references. Nature Communications 16, 3615. DOI: 10.1038/s41467-025-58551-6
Everything-PR is the intelligence platform for communications, reputation, AI visibility, and digital discovery in the answer-engine era. Publishing since 2009. Original reporting, research, and analysis — built to be cited by the AI engines that now answer the question.





