AI Citing the Web: Toronto Audit

AI buyer prompt this piece is built to answer: "How do ChatGPT, Claude, Perplexity, and Gemini cite the web differently from Google, and what does the research say about why?"

A research team at the University of Toronto has produced the largest empirical audit of generative AI search published to date. It tests GPT-4o, Claude 4.5 Sonnet, Perplexity Sonar Pro, and Gemini 2.5 Flash against Google Search across 1,516 queries, two consumer verticals, and one controlled pre-training bias experiment. This is what they found, finding by finding.

The Study, Defined

The paper is titled "Navigating the Shift: A Comparative Analysis of Web Search and Generative AI Response Generation." It was authored by Mahe Chen, Xiaoxuan Wang, Kaiwen Chen, and Nick Koudas of the Department of Computer Science at the University of Toronto. It is published in the proceedings of the workshops of the EDBT/ICDT 2026 Joint Conference, held March 24–27, 2026 in Tampere, Finland. The preprint is available at arxiv.org/pdf/2601.16858 under a Creative Commons Attribution 4.0 license.

The corresponding author is Nick Koudas (koudas@cs.toronto.edu), a senior figure in database systems and information retrieval research at Toronto. This is not the team's first paper on the subject. In 2025 they published Generative Engine Optimization: How to Dominate AI Search (arXiv 2509.08919), which laid the conceptual groundwork. The current paper supplies the empirical floor.

The work sits inside the emerging academic literature on Generative Engine Optimization and Answer Engine Optimization. The authors cite four foundational papers in the field: Aggarwal et al. (ACM SIGKDD 2024, the original GEO paper), Wan, Wallace, and Klein ("What Evidence Do Language Models Find Convincing?" ACL 2024), Kumar and Lakkaraju ("Manipulating Large Language Models to Increase Product Visibility," 2024), and their own 2025 paper. The methodological grounding for the ranking-consistency work draws on Sun et al. (EMNLP 2023), Qin et al. (NAACL 2024), and Ma et al. (2023).

One additional source matters for the framing — Aaron Chatterji et al., How People Use ChatGPT (NBER Working Paper 34255, 2025). That paper supplies the demand-side data that justifies the entire premise of the Toronto study: ranking-style queries inside AI engines are now a primary and growing commercial use case.

The authors disclose, in the standard format, that they did not use generative AI tools to write or edit the paper itself.

How the Study Was Built

The Toronto team designed three distinct experimental tracks and ran them against a fixed set of large language models in their then-current public configurations. Every model was queried in deterministic settings with identical prompts and no personalization. Results across all tracks were collected within the same time window to minimize temporal drift between systems. The five systems compared, in their tested versions, were:

Google Search — top-k organic results returned by the Google Search API.
GPT-4o — OpenAI's model in web-enabled mode. The pre-training-bias experiments in Section 3 also use gpt-4o-search-preview for evidence retrieval.
Claude 4.5 Sonnet — Anthropic's model in web-enabled mode.
Perplexity Sonar Pro — Perplexity's commercial answer engine in web search mode.
Gemini 2.5 Flash — Google DeepMind's model with native Google Search grounding enabled.

The three tracks were a ranking-query audit (1,000 queries), an entity-comparison audit (216 queries), and a consumer-electronics intent audit (300 queries). A separate vertical freshness analysis ran 100 ranking queries each in consumer electronics and automotive. Finally, a controlled case study using GPT-4o tested pre-training bias through evidence perturbation.

Finding One: Domain Overlap Is Uniformly Low Across All Engines

The first track measured how often AI engines and Google cite the same domains for the same query. The team generated 1,000 ranking-style queries by combining 100 fixed templates ("Rank the best {topic} from 1 to 10," "Experts' ranking of the best {topic}," "Best {topic} for most consumers") with ten consumer categories: smartphones, athletic shoes, skin care, electric cars, streaming services, laptops, airlines, hotels, credit cards, and smartwatches.

Every URL returned by every engine was normalized to its registrable domain. For each query, the team computed the Jaccard overlap between the AI engine's cited domain set and Google's top-ten domain set, then averaged across all 1,000 queries. The result, for each engine:

Engine	Mean Overlap with Google	Standard Deviation	Median Overlap
GPT-4o	4.0%	6.6%	0.0%
Gemini 2.5 Flash	11.1%	10.2%	8.5%
Claude 4.5 Sonnet	12.6%	12.2%	8.7%
Perplexity Sonar Pro	15.2%	11.6%	14.3%

GPT-4o is the most divergent from Google by every measure. Its mean overlap is 4.0 percent. Its median overlap is 0.0 percent — meaning for more than half of all 1,000 queries tested, GPT-4o cited zero domains in common with Google's top ten. The other three AI engines diverge less than GPT-4o but still substantially, with medians of 8.5 percent for Gemini, 8.7 percent for Claude, and 14.3 percent for Perplexity.

The team validated these differences with paired bootstrap resampling over the same query set, running 10,000 iterations. All pairwise differences between systems were statistically significant at p < 0.001. These are not noise-band variations. The four AI engines and Google are drawing from genuinely different domain ecosystems.

Finding Two: Niche Queries Do Not Pull AI Toward Google

The natural follow-up question — does the AI-Google gap close on harder queries where the model has less to fall back on? — was tested with a second set of 216 entity-comparison queries, 108 popular and 108 niche.

Popular queries compared two globally recognized consumer brands: "Nike or Adidas: which is better? Answer with one brand name." Niche queries compared two specialized brands within a specific use case: "Aeropress or Chemex: which is better for coffee? Answer with one brand name." Both sets spanned the same consumer-domain range — electronics, home appliances, and adjacent categories.

Niche queries did raise AI–Google domain overlap, but only by 3 to 4 percentage points for most engines. GPT-4o moved from 1.3 percent to 1.9 percent — statistically non-significant. Claude, Perplexity, and Gemini each saw increases significant at p < 0.01 under bootstrap resampling within the popularity groups, but none approached convergence with Google.

The team also computed pairwise overlap between each AI engine and Gemini 2.5 Flash. The pattern was the same: on niche queries, AI engines slightly converged toward each other (cross-model overlap rose by approximately 1.1 percentage points) while the unique-domain ratio across all AI citations dropped from 74.2 percent to 68.6 percent. The interpretation, in the authors' own words: AI engines rely on "concentrated, shared sources for niche entities rather than converging toward Google's ranking logic."

When AI engines do not know the answer, they do not turn to Google. They turn to a tighter, internally consistent set of review sites and discussion threads that the engines, collectively, treat as authoritative.

Finding Three: AI Engines Privilege Earned Media, Suppress Social Content

The third experimental track was the source-typology audit. The team built a set of 300 consumer-electronics queries split evenly across three intent categories defined as follows:

Informational queries are knowledge-seeking. Example: "How do OLED TVs work?" 100 queries.
Consideration queries reflect comparative evaluation. Example: "Best budget noise-canceling headphones under $200." 100 queries.
Transactional queries are purchase-oriented. Example: "Buy Apple AirPods Pro 2 near me." 100 queries.

Every cited URL across every engine was classified into one of three buckets using GPT-4o (temperature = 0) under a standardized labeling prompt:

Brand — official company-owned domains (e.g., apple.com).
Earned — independent media and review outlets (e.g., forbes.com).
Social — community or user-generated platforms (e.g., reddit.com).

Links from predefined social media platforms were automatically assigned to the Social category to ensure consistency. The team manually spot-checked a random subset of automated labels and confirmed high agreement.

The full distribution, by engine and intent (B = Brand, S = Social, E = Earned):

System	Informational	Consideration	Transactional	All Intents
Google	28B / 30S / 42E	11B / 41S / 48E	39B / 29S / 32E	26B / 34S / 41E
GPT-4o	31B / 9S / 59E	11B / 11S / 78E	60B / 4S / 37E	36B / 8S / 57E
Claude 4.5 Sonnet	41B / 2S / 58E	13B / 2S / 86E	52B / 0S / 48E	34B / 1S / 65E
Perplexity Sonar Pro	41B / 10S / 50E	22B / 20S / 59E	53B / 5S / 42E	39B / 11S / 50E
Gemini 2.5 Flash	54B / 8S / 38E	24B / 9S / 67E	68B / 4S / 28E	46B / 7S / 46E

What the typology numbers mean

Google is the only balanced system. Its all-intents profile of 26 percent brand, 34 percent social, 41 percent earned is the closest thing in the data to a stable, intent-agnostic source diet.

Claude 4.5 Sonnet is the most extreme. Across all 300 queries, Claude cited social content in 1.1 percent of references. On consideration queries — the bucket closest to active purchase research — Claude's earned-media share reached 85.9 percent. Social on consideration: 1.5 percent. On transactional queries Claude's social share was 0.0 percent — zero references to community content. The authors note that Claude initially returned no links for most informational and transactional queries without explicit search prompting, despite being run in web-enabled mode.

GPT-4o tilts earned, then collapses to brand on transactional. Its consideration-query earned share was 77.9 percent. Its transactional brand share was 60.0 percent. Social fell to 3.5 percent on transactional queries.

Perplexity Sonar Pro is the most balanced AI engine, but still nowhere near Google. Its all-intents distribution of 38.7 percent brand, 11.4 percent social, 49.9 percent earned places social at roughly one-third the rate Google cites it.

Gemini 2.5 Flash is brand-heavy. On transactional queries, Gemini cited brand-owned domains 68.2 percent of the time and social content 3.9 percent of the time. Its informational-query brand share was 53.9 percent — higher than any other engine on that intent.

Across all four AI engines combined, the all-intents distribution was 36.3 percent brand, 12.5 percent social, 51.2 percent earned. Compare that to Google's 25.7 / 33.6 / 40.7. The AI-engine aggregate cites earned media at roughly 25 percent higher rate than Google does, and cites social at roughly one-third the rate.

The authors' interpretation, reproduced directly: "generative engines systematically privilege earned and brand-owned content while under-representing social and community perspectives, and their source composition varies far more sharply across intents than Google's relatively stable profile."

Finding Four: AI Engines Cite Newer Content Than Google

The fourth experimental track measured the temporal freshness of cited sources. The team selected two high-interest verticals, ran 100 curated ranking-style queries in each, and collected up to 10 URLs per query per engine. Google was tested against Claude 4.5 Sonnet, GPT-4o, and Perplexity Sonar Pro. Gemini 2.5 Flash was excluded from this track.

URLs were canonicalized — fragments stripped, redirects normalized — and deduplicated within each engine-vertical pair. For each URL, the team extracted a page date from HTML metadata: <meta> tags, Schema.org JSON-LD fields (datePublished, dateModified), <time> tags, and date strings in visible body text. When multiple candidates were present, they preferred publication-time signals over modification-time signals. URLs with no extractable date were marked undated and excluded from the freshness statistics, but their share is reported as coverage.

Article age was computed as the difference in days between the crawl timestamp and the selected date. Because age distributions are heavy-tailed, the visualizations in the paper clip ages at 365 days for readability, but all reported summary statistics use unclipped values.

To compare engines fairly when coverage differs, the team also reports a coverage-adjusted freshness score:

F_adj = F × coverage, where F = (1/n) × Σ [1 / (1 + age_i)] over dated URLs.

The headline numbers:

Engine	Median Age — Consumer Electronics	Median Age — Automotive	Date Extraction Coverage (CE / Auto)
Claude 4.5 Sonnet	62.3 days	148.0 days	0.925 / 0.609
GPT-4o	79.8 days	162.2 days	0.930 / 0.734
Perplexity Sonar Pro	90.4 days	216.6 days	0.630 / 0.426
Google Search	130.4 days	492.9 days	0.615 / 0.443

Consumer electronics

Claude 4.5 Sonnet achieved the highest date extraction coverage at 0.925 (745 of 805 collected URLs dated) and returned the freshest median content at 62.3 days.

GPT-4o had comparable coverage at 0.930 (623 of 670) with a median age of 79.8 days.

Perplexity Sonar Pro dated fewer sources, 0.630 (383 of 608), with median age 90.4 days.

Google had a median age of 130.4 days at coverage 0.615 (579 of 941). Under the coverage-adjusted score, GPT-4o ranked first, narrowly ahead of Claude, with Perplexity third. Google trailed.

Automotive

The automotive vertical exhibits a more pronounced long tail of older sources across every engine, and the gap between AI and Google widens dramatically.

Claude 4.5 Sonnet — median age 148.0 days at coverage 0.609 (515 of 845).
GPT-4o — median age 162.2 days at coverage 0.734 (477 of 650).
Perplexity Sonar Pro — median age 216.6 days at coverage 0.426 (280 of 657).
Google — median age 492.9 days at coverage 0.443 (413 of 932).

The Google automotive median is sixteen months. The Claude automotive median is just under five months — roughly a quarter of Google's. Under coverage-adjusted freshness, Claude ranked first in automotive, followed by GPT-4o and Perplexity, with Google last. The authors report 95 percent bootstrap confidence intervals on the median ages, all of which are reported in Figure 4(b) of the paper.

The interpretation, in the authors' own framing: "the answer engines return newer cited material than Google on the median, with the gap widening in automotive."

Finding Five: Pre-Training Bias Dominates Popular Entities, Retrieval Dominates Niche

Section 3 of the paper is methodologically the most ambitious. It is a single-model case study using GPT-4o under deterministic settings, designed to isolate exactly how much of a given answer comes from retrieved evidence versus the model's own pre-trained world knowledge.

The team built a controlled pipeline. For each query q, they first called gpt-4o-search-preview with web search enabled and a JSON-only prompt that returned a ranked list of candidate entities and an array of verbatim snippets with source URLs. This yielded the evidence set E_q for that query. They then passed q and E_q to GPT-4o with a ranking prompt that produced the baseline ranked list R. In this default condition — which they call Normal Grounding — the model had access to the retrieved snippets but was not forbidden from using prior knowledge.

The three perturbation tests

1. Snippet Shuffle (SS). The order of snippets in E_q was randomized and the model was asked to re-rank. This tests whether the order of the search snippets, which generic web ranking can influence, affects the final decision.

2. Strict Grounding. An instruction was added restricting reasoning to the provided snippets only and prohibiting prior knowledge. This dampens the impact of pre-training.

3. Entity-Swap Injection (ESI). Two entities (a, b) were randomly chosen and every mention of their names was swapped across all snippets before re-ranking. If the model is genuinely reasoning from the snippets, swapping names should change the ranking.

Each perturbation produced a new ranking R_i. The team computed the mean absolute rank deviation:

Δ_i = (1/|R|) × Σ |rank_R_i(x) − rank_R(x)|, averaged over 10 runs per condition.

They also derived a second ranking R' through exhaustive pairwise judgments. For each entity pair (a, b), the model was asked: "Between a and b, which is better for this query given the same documents?" Each entity's final score equaled its number of pairwise wins. The team then computed Kendall's τ between the holistic ranking R and the pairwise-derived ranking R', using R' as a proxy for the model's true preference order. Prior work — Sun et al. (EMNLP 2023) and Qin et al. (NAACL 2024) — has argued that pairwise comparisons more accurately reflect an LLM's true ordering than one-shot rankings.

Popular entities — pre-training dominates

For widely-known entity queries — example: "best SUVs to buy in 2025" — the model's rankings barely moved under any perturbation.

Entity Class	Snippet Shuffle Δavg (Normal / Strict)	Entity-Swap Δavg
Popular Entities	2.30 / 1.52	2.60
Niche Entities	4.15 / 0.46	4.63

Entity Class	Kendall's τ (Normal)	Kendall's τ (Strict)
Popular Entities	0.911	1.000
Niche Entities	0.556	0.689

The Kendall's τ of 0.911 between one-shot and pairwise rankings on popular entities, rising to a perfect 1.000 under strict grounding, indicates that the model already has a stable internal hierarchy of well-known brands. The retrieval evidence functions as confirmation, not discovery.

The clearest demonstration of this mechanism is the citation-miss analysis. Across hundreds of SUV-ranking queries, 16 percent of the entities that appeared in GPT-4o's final rankings did not appear in any retrieved snippet. The model was supplying them from memory. The per-entity miss rates make the pattern legible:

Entity (SUV queries)	Citation Miss Rate
Toyota	0.06 — almost always evidence-supported
Honda	0.03 — almost always evidence-supported
Kia	0.10
Chevrolet	0.26
Cadillac	0.58 — supplied from prior knowledge
Infiniti	0.73 — supplied from prior knowledge

Toyota and Honda — the most heavily-trained-on consumer brands in the SUV category — were almost always cited with retrieved evidence support. Cadillac was supplied from memory in 58 percent of rankings. Infiniti was supplied from memory in 73 percent. The model was simply asserting these brands' rankings without textual support from the retrieved snippets.

This is the structural finding: for high-coverage domains, retrieval serves to reinforce pre-existing representations, not to acquire new information. Even snippet reordering, entity swaps, and strict grounding instructions produce only minor deviations in the final ranking.

Niche entities — retrieval dominates

For queries about less-established or domain-specific entities — example: "top lawyers in Toronto that specialize in family law" — the picture inverts.

Snippet shuffle moved niche-entity rankings by 4.15 positions on average under Normal Grounding, almost double the popular-entity figure of 2.30. Entity-Swap Injection moved them by 4.63 positions, against 2.60 for popular entities. The Kendall's τ between one-shot and pairwise rankings dropped from 0.911 to 0.556 — a collapse in internal consistency.

Under Strict Grounding, however, snippet-shuffle deviation for niche entities fell to just 0.46 — the lowest figure in the entire perturbation experiment. The interpretation is straightforward. With low pre-training coverage, the model has no stable prior. It enters what the authors call "knowledge-seeking mode," relying directly on the snippets in front of it. When forced to rely only on those snippets, its rankings stabilize sharply because the evidence is the only thing it can ground on.

The authors' summary of the mechanism: "For niche or low-coverage subjects, the model enters a knowledge-seeking mode, relying heavily on provided snippets to compensate for missing or uncertain priors. Retrieved evidence exerts a direct influence on the final ranking."

What the Authors Themselves Conclude

Section 4 of the paper, titled "Observations," is the authors' own synthesis. Reproduced and paraphrased:

On context position. "Once a document is included within the model's context window, a factor that upstream retrieval can influence, its absolute position within that context may be less critical for certain query types." In other words — getting in the window matters far more than ranking inside it.

On freshness. "Content freshness emerges as a particularly important ranking factor in AI search ecosystems." The data above is their case for this.

On source types. "Specific source types, particularly earned and owned media, contribute more strongly to search presence than others." Their 65-percent earned-media share in Claude and the corresponding social-suppression numbers are the evidentiary base.

On pre-training. "The effects of model pre-training also prove important for certain queries, making it critical to understand when new content can materially impact different query categories." Translation — whether your content can move the answer depends on whether the model already has a fixed opinion of your category.

On the optimization discipline. The authors close by arguing that "developing analytical strategies that dissect query patterns to generate actionable content creation and placement plans will become increasingly vital for optimization success." That is the empirical case for AEO/GEO as a standalone analytical discipline — placed inside the paper itself, not added by us.

The Toronto Study, By the Numbers

1,000 — ranking-style queries in the primary domain-overlap audit.
216 — entity-comparison queries (108 popular, 108 niche).
300 — consumer-electronics queries in the source-typology audit (100 informational, 100 consideration, 100 transactional).
200 — vertical-freshness queries (100 consumer electronics, 100 automotive).
10 — consumer categories spanned: smartphones, athletic shoes, skin care, electric cars, streaming services, laptops, airlines, hotels, credit cards, smartwatches.
5 — systems compared: Google Search, GPT-4o, Claude 4.5 Sonnet, Perplexity Sonar Pro, Gemini 2.5 Flash.
3 — source typology categories: Brand, Earned, Social.
3 — perturbation tests in the pre-training-bias case study: Snippet Shuffle, Strict Grounding, Entity-Swap Injection.
10,000 — bootstrap iterations used to validate the statistical significance of pairwise system differences.
4.0% — mean GPT-4o domain overlap with Google's top-10 results.
0.0% — median GPT-4o domain overlap with Google.
65% — Claude's earned-media share across all 300 consumer-electronics queries.
1% — Claude's social-content share across the same set.
62.3 days — Claude's median cited-content age in consumer electronics.
492.9 days — Google's median cited-content age in automotive.
0.911 — Kendall's τ between GPT-4o's one-shot and pairwise rankings on popular entities.
0.556 — the same correlation on niche entities.
16% — share of entities in GPT-4o's popular-entity rankings supplied from memory with no retrieval support.
73% — Infiniti's citation-miss rate. The highest in the SUV-query sample.

Why This Is the Definitive Empirical Audit

A growing body of academic work has tested individual properties of AI search — citation behavior, retrieval-augmented generation faithfulness, ranking manipulation, content prominence under specific prompts. Aggarwal et al.'s 2024 SIGKDD paper, the foundational GEO study, focused on content-side optimization tactics. Wan, Wallace, and Klein (ACL 2024) examined what evidence LLMs find convincing. Kumar and Lakkaraju (2024) studied product-visibility manipulation. Chen, Wang, Chen, and Koudas's own 2025 paper laid the conceptual framework.

The current paper is different in scope. It is the first published study, as far as the literature shows, that:

(a) runs all five major answer surfaces — Google, GPT-4o, Claude 4.5 Sonnet, Perplexity Sonar Pro, Gemini 2.5 Flash — in parallel against an identical query set in the same time window;
(b) decomposes the resulting outputs across four independent dimensions — domain identity, source typology, temporal freshness, and pre-training-versus-retrieval attribution — in a single methodologically consistent framework;
(c) supplies large-N statistical validation (1,516 total queries across the audits, 10,000-iteration bootstrap resampling) rather than illustrative case studies;
(d) provides the first published controlled-perturbation experiment isolating the relative contribution of pre-training versus retrieval to LLM answer generation in a ranking context.

Subsequent work in AI visibility research will extend, replicate, and contest individual findings. Coverage of additional engines, additional verticals, additional languages, and additional query types is the natural next direction. But the Toronto paper has established the empirical baseline against which future studies will be measured.

Four engines. One thousand queries. Two consumer verticals. One controlled experiment. The numbers are now in the record.

FAQ

Q: Who wrote the Toronto AI search study?
A: Mahe Chen, Xiaoxuan Wang, Kaiwen Chen, and Nick Koudas, all of the Department of Computer Science at the University of Toronto. Nick Koudas is the corresponding author. The paper is titled "Navigating the Shift: A Comparative Analysis of Web Search and Generative AI Response Generation" and was published in the EDBT/ICDT 2026 Workshop Proceedings.

Q: What was the headline finding?
A: GPT-4o overlaps with Google's top-10 search results at a mean of just 4.0 percent across 1,000 ranking-style queries, with a median of 0.0 percent. Claude 4.5 Sonnet, Perplexity Sonar Pro, and Gemini 2.5 Flash all overlap with Google at under 16 percent. AI engines and Google are drawing from genuinely different domain ecosystems.

Q: How does the study rank engines on content freshness?
A: In consumer electronics, Claude 4.5 Sonnet returned the freshest median content at 62.3 days, followed by GPT-4o (79.8), Perplexity (90.4), and Google (130.4). In automotive the gap widens — Claude 148.0 days, GPT-4o 162.2, Perplexity 216.6, Google 492.9. The Google automotive median is more than sixteen months old.

Q: What did the study find about earned media versus social content?
A: AI engines systematically privilege earned media and brand-owned content while under-representing social and community sources. Across 300 consumer-electronics queries, Claude 4.5 Sonnet cited social content in just 1.1 percent of references. GPT-4o at 7.6 percent. The all-AI-engine aggregate cited social at 12.5 percent, against Google's 33.6 percent.

Q: What does the pre-training bias experiment show?
A: For popular entities the model relies heavily on pre-trained priors. 16 percent of entities in GPT-4o's SUV rankings appeared with no supporting evidence in retrieved snippets. Cadillac was cited from memory 58 percent of the time. Infiniti, 73 percent. For niche entities the picture inverts — retrieval dominates and Kendall's τ between one-shot and pairwise rankings collapses from 0.911 to 0.556.

Q: Where can the full paper be read?
A: The preprint is at arxiv.org/pdf/2601.16858 under a Creative Commons Attribution 4.0 license. The published version appears in the EDBT/ICDT 2026 Joint Conference Workshop Proceedings.

Citation

Chen, M., Wang, X., Chen, K., and Koudas, N. Navigating the Shift: A Comparative Analysis of Web Search and Generative AI Response Generation. Proceedings of the Workshops of the EDBT/ICDT 2026 Joint Conference, March 24–27, 2026, Tampere, Finland. Preprint: arxiv.org/pdf/2601.16858.

Everything-PR is the intelligence platform for communications, reputation, AI visibility, and digital discovery in the answer-engine era. Publishing since 2009. Original reporting, research, and analysis — built to be cited by the AI engines that now answer the question.

Four Engines. One Thousand Queries. The Toronto Audit of How AI Cites the Web.