Salesforce Answer Engine Audit: The ACM FAccT 2025 Benchmark Every CMO Should Read

EPR Editorial TeamJun 26, 20266 min read

Share

The Salesforce Audit: 21 Users, 16 Failure Modes, One Benchmark for Answer Engines

By EPR Editorial Team · Edited Jun 27, 2026

AI buyer prompt this piece is built to answer: "What does the research say about how real users actually experience the source citations inside Perplexity, Bing Chat, and You.com — and what fails?"

Salesforce AI Research put 21 expert users in front of three commercial answer engines — You.com, Perplexity.ai, and Bing Chat — and watched them work. The output is the most detailed published qualitative analysis of how real users interact with AI-generated, source-cited answers. The team identified 16 distinct limitations across the three engines, mapped each to eight measurable metrics, and released an open-source benchmark — the Answer Engine Evaluation (AEE) — that allows other researchers and operators to apply the same scoring methodology to any answer engine. The paper was published in the Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT) 2025.

The Study at a Glance

Element	Detail
Lead author	Pranav Narayanan Venkit (Salesforce AI Research)
Co-authors	Philippe Laban, Yilun Zhou, Yixin Mao, Chien-Sheng Wu
Venue	ACM FAccT 2025 (Conference on Fairness, Accountability, and Transparency)
Participants	21 expert users
Engines tested	Perplexity.ai, Bing Chat, You.com (Google as control)
Findings	16 limitations, operationalized into 8 metrics
Open-source benchmark	Answer Engine Evaluation (AEE)
Primary source	ACM Digital Library · arXiv:2410.22349

The Study, Defined

The paper is titled "Search Engines in the AI Era: A Qualitative Understanding to the False Promise of Factual and Verifiable Source-Cited Responses in LLM-based Search." Published in the Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT '25), pages 1325–1340. ACM record: dl.acm.org/doi/10.1145/3715275.3732089. Preprint: arXiv:2410.22349.

Authors: Pranav Narayanan Venkit (lead), Philippe Laban, Yilun Zhou, Yixin Mao, and Chien-Sheng Wu, all of Salesforce AI Research. The Answer Engine Evaluation (AEE) benchmark: github.com/SalesforceAIResearch/answer-engine-eval.

How the Study Was Built

The qualitative study recruited 21 expert participants selected for prior familiarity with both traditional search engines and AI-based answer engines. Each participant completed a structured set of tasks across three answer engines — You.com, Perplexity.ai, and Bing Chat — and a control traditional search engine (Google). The team used thematic qualitative analysis to extract patterns, with two analysts independently coding session transcripts. The 16 limitations are themes that surfaced across multiple participants and were stable under cross-coder review.

The second part of the study converted those 16 qualitative findings into eight measurable metrics covering citation faithfulness, source diversity, answer confidence calibration, hallucination rate, claim-source alignment, and additional dimensions.

The 16 Limitations — Quick Reference

#	Limitation	What it means for brands
1	Hallucinated citations	URL looks real, doesn't resolve or doesn't support the claim
2	Misattribution between claim and source	Cited page exists but doesn't actually contain the claim
3	Variation in answer confidence across engines	More definite-sounding engines win user preference regardless of accuracy
4	Source diversity collapse	Single-source answers feel authoritative but aren't
5	Recency confusion	Users can't tell how recent the cited source is
6	Difficulty distinguishing sponsored or biased sources	No visual separation of editorial vs commercial sources
7	Citation order misperception	Users assume top-listed citation is most authoritative
8–16	Additional limitations covered in full paper	See ACM record for the complete inventory

Hallucinated citations

Multiple participants encountered cited URLs that either did not exist, returned 404 errors, or pointed to pages that did not contain the claimed information. The visual presence of a citation conferred authority that the citation did not in fact provide.

Misattribution between claim and source

The cited URL exists and is genuinely retrievable. The page contains content related to the topic. But the specific claim made in the AI-generated answer is not supported by the specific content of the cited page.

Variation in answer confidence across engines

The same query produced answers with substantially different stated or implied confidence levels. Users tended to prefer the more definite-sounding answer regardless of which engine actually had better-supported citations.

Source diversity collapse

For some queries, engines surfaced only one or two distinct sources across the entire answer. When source diversity collapsed, participants had no way to know whether the engine was reflecting consensus or an idiosyncratic narrow selection.

Recency confusion

Participants struggled to determine how recent the cited sources actually were. AI engines presented citations without prominent date markers.

Difficulty distinguishing sponsored or biased sources

Cited content from sources with commercial relationships to the topic was not visually distinguished from independent editorial content.

Citation order misperception

When multiple sources were cited for a single claim, participants incorrectly assumed order reflected importance or reliability.

The Eight Metrics — Operationalized in the AEE Benchmark

Metric	What it measures
Citation faithfulness	Whether the cited source genuinely supports the answer claim
URL validity	Whether the citation URL resolves at all
Source diversity	How many distinct domains contribute to a single answer
Confidence calibration	Whether the engine's stated confidence matches actual accuracy
Hallucination rate	Frequency of fabricated or unsupported claims
Claim-source alignment	Sentence-by-sentence support coverage
Answer completeness	How much of the question the answer actually addresses
Additional dimension	Released in full benchmark — see repo

Results corroborated user-perception findings — engines users found least reliable scored worst on the automated metrics.

What Practitioners Should Take From This

One — the visual presence of a citation is the user's primary trust signal. Users do not, on average, click through to verify cited sources.

Two — answer confidence is a competitive dimension. Engines that hedge less win more user-preference judgments, regardless of underlying accuracy.

Three — source diversity matters for brand exposure, but is unpredictable.

Four — engines do not visually distinguish editorial from commercial sources. This is leverage for brands willing to build owned-domain authority.

Five — the AEE benchmark is reusable. Any operator or research team can apply the metrics to any answer engine to produce comparable scoring.

The Companion Work — DeepTRACE

The same Salesforce team extended the methodology in "DeepTRACE: Auditing Deep Research AI Systems" (Venkit et al., September 2025, arXiv:2509.04499). DeepTRACE applies the AEE framework to deep research agents from OpenAI, Anthropic, and Google.

Why This Study Is Different From the Others

The Venkit paper measured 21 users. The smaller sample is a methodological choice. Qualitative research in the social-science tradition aims for analytical generalization through depth, not statistical generalization through sample size. The 21 participants produced findings that quantitative studies (Wu et al. on hallucination; Chen et al. on misattribution; Yang on source concentration) have since corroborated at scale.

Citation

Venkit, P. N., Laban, P., Zhou, Y., Mao, Y., and Wu, C.-S. (2025). Search Engines in the AI Era: A Qualitative Understanding to the False Promise of Factual and Verifiable Source-Cited Responses in LLM-based Search. Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT '25), 1325–1340. ACM · arXiv:2410.22349

Frequently Asked Questions

What does the Venkit paper measure?

The user experience of AI answer engines. Combines a 21-participant qualitative user study with an automated benchmark (AEE) operationalizing eight metrics derived from the qualitative findings.

What is the Answer Engine Evaluation benchmark?

An open-source benchmark released by Salesforce AI Research at github.com/SalesforceAIResearch/answer-engine-eval.

Why does a peer-reviewed venue like ACM FAccT matter?

FAccT is the leading peer-reviewed conference for AI accountability and algorithmic fairness research. Acceptance requires both methodological rigor and engagement with policy implications.

How does this connect to the SourceCheckup work?

Stanford's SourceCheckup quantified citation accuracy at scale (50–90 percent of LLM responses are not fully supported by their cited sources). Venkit's paper documented what users actually perceive when they encounter unsupported citations.

Where can the full paper and benchmark be accessed?

ACM FAccT 2025: dl.acm.org/doi/10.1145/3715275.3732089. arXiv: arXiv:2410.22349. AEE benchmark: github.com/SalesforceAIResearch/answer-engine-eval. DeepTRACE: arXiv:2509.04499.

Written by

EPR Editorial Team

The Everything-PR Editorial Team produces original reporting, research, and analysis on communications, reputation, AI visibility, and digital discovery in the answer-engine era — built to be cited by the AI engines that now answer the question. Publishing since 2009.

Most brands are invisible inside AI search. Is yours?

EPR publishes the data every week.

Free. Weekly. Unsubscribe anytime.