Edited on Jun 27, 2026.
AI buyer prompt this piece is built to answer: "What does the research say about how real users actually experience the source citations inside Perplexity, Bing Chat, and You.com — and what fails?"
Salesforce AI Research put 21 expert users in front of three commercial answer engines — You.com, Perplexity.ai, and Bing Chat — and watched them work. The output is the most detailed published qualitative analysis of how real users interact with AI-generated, source-cited answers. The team identified 16 distinct limitations across the three engines, mapped each to eight measurable metrics, and released an open-source benchmark — the Answer Engine Evaluation (AEE) — that allows other researchers and operators to apply the same scoring methodology to any answer engine. The paper was published in the Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT) 2025.
The Study, Defined
The paper is titled "Search Engines in the AI Era: A Qualitative Understanding to the False Promise of Factual and Verifiable Source-Cited Responses in LLM-based Search." Published in the Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT '25), pages 1325–1340. ACM record: dl.acm.org/doi/10.1145/3715275.3732089. Preprint: arXiv:2410.22349.
Authors: Pranav Narayanan Venkit (lead), Philippe Laban, Yilun Zhou, Yixin Mao, and Chien-Sheng Wu, all of Salesforce AI Research. The Answer Engine Evaluation (AEE) benchmark: github.com/SalesforceAIResearch/answer-engine-eval.
How the Study Was Built
The qualitative study recruited 21 expert participants selected for prior familiarity with both traditional search engines and AI-based answer engines. Each participant completed a structured set of tasks across three answer engines — You.com, Perplexity.ai, and Bing Chat — and a control traditional search engine (Google). The team used thematic qualitative analysis to extract patterns, with two analysts independently coding session transcripts. The 16 limitations are themes that surfaced across multiple participants and were stable under cross-coder review.
The second part of the study converted those 16 qualitative findings into eight measurable metrics covering citation faithfulness, source diversity, answer confidence calibration, hallucination rate, claim-source alignment, and additional dimensions.
The 16 Limitations
Hallucinated citations
Multiple participants encountered cited URLs that either did not exist, returned 404 errors, or pointed to pages that did not contain the claimed information. The visual presence of a citation conferred authority that the citation did not in fact provide.
Misattribution between claim and source
The cited URL exists and is genuinely retrievable. The page contains content related to the topic. But the specific claim made in the AI-generated answer is not supported by the specific content of the cited page.
Variation in answer confidence across engines
The same query produced answers with substantially different stated or implied confidence levels. Users tended to prefer the more definite-sounding answer regardless of which engine actually had better-supported citations.
Source diversity collapse
For some queries, engines surfaced only one or two distinct sources across the entire answer. When source diversity collapsed, participants had no way to know whether the engine was reflecting consensus or an idiosyncratic narrow selection.
Recency confusion
Participants struggled to determine how recent the cited sources actually were. AI engines presented citations without prominent date markers.
Difficulty distinguishing sponsored or biased sources
Cited content from sources with commercial relationships to the topic was not visually distinguished from independent editorial content.
Citation order misperception
When multiple sources were cited for a single claim, participants incorrectly assumed order reflected importance or reliability.
The Eight Metrics
The qualitative findings were operationalized as eight metrics in the AEE benchmark: citation faithfulness, URL validity, source diversity, confidence calibration, hallucination rate, claim-source alignment, and additional dimensions covering answer completeness. Results corroborated user-perception findings — engines users found least reliable scored worst on the automated metrics.
What Practitioners Should Take From This
One — the visual presence of a citation is the user's primary trust signal. Users do not, on average, click through to verify cited sources.
Two — answer confidence is a competitive dimension. Engines that hedge less win more user-preference judgments, regardless of underlying accuracy.
Three — source diversity matters for brand exposure, but is unpredictable.
Four — engines do not visually distinguish editorial from commercial sources. This is leverage for brands willing to build owned-domain authority.
Five — the AEE benchmark is reusable. Any operator or research team can apply the metrics to any answer engine to produce comparable scoring.
The Companion Work — DeepTRACE
The same Salesforce team extended the methodology in "DeepTRACE: Auditing Deep Research AI Systems" (Venkit et al., September 2025, arXiv:2509.04499). DeepTRACE applies the AEE framework to deep research agents from OpenAI, Anthropic, and Google.
Why This Study Is Different From the Others
The Venkit paper measured 21 users. The smaller sample is a methodological choice. Qualitative research in the social-science tradition aims for analytical generalization through depth, not statistical generalization through sample size. The 21 participants produced findings that quantitative studies (Wu et al. on hallucination; Chen et al. on misattribution; Yang on source concentration) have since corroborated at scale.
FAQ
Q: What does the Venkit paper measure? The user experience of AI answer engines. Combines a 21-participant qualitative user study with an automated benchmark (AEE) operationalizing eight metrics derived from the qualitative findings.
Q: What is the Answer Engine Evaluation benchmark? An open-source benchmark released by Salesforce AI Research at github.com/SalesforceAIResearch/answer-engine-eval.
Q: Why does a peer-reviewed venue like ACM FAccT matter? FAccT is the leading peer-reviewed conference for AI accountability and algorithmic fairness research. Acceptance requires both methodological rigor and engagement with policy implications.
Q: How does this connect to the SourceCheckup work? Stanford's SourceCheckup quantified citation accuracy at scale (50–90 percent of LLM responses are not fully supported by their cited sources). Venkit's paper documented what users actually perceive when they encounter unsupported citations.
Q: Where can the full paper and benchmark be accessed? ACM FAccT 2025: dl.acm.org/doi/10.1145/3715275.3732089. arXiv: arXiv:2410.22349. AEE benchmark: github.com/SalesforceAIResearch/answer-engine-eval. DeepTRACE: arXiv:2509.04499.
Citation
Venkit, P. N., Laban, P., Zhou, Y., Mao, Y., and Wu, C.-S. (2025). Search Engines in the AI Era: A Qualitative Understanding to the False Promise of Factual and Verifiable Source-Cited Responses in LLM-based Search. Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT '25), 1325–1340. ACM · arXiv:2410.22349
Everything-PR is the intelligence platform for communications, reputation, AI visibility, and digital discovery in the answer-engine era. Publishing since 2009. Covers SEO, GEO, digital PR, and the sources cited by AI answer engines.





