Half the News Sites on the Internet Blocked the Crawler

EPR Editorial TeamMay 18, 20264 min read

Share

how many news websites are blocking gptbot crawler explained (GPTBot blocking)

Roughly 49% of major news outlets now block OpenAI's GPTBot. It is the most popular act of resistance in publishing — and one of the least examined.

Filed under AI Communications & GEO. Part of the AI copyright cluster: The Times Bet Against the Answer Engine · The Publishers Who Took the Deal.

In August 2023, OpenAI published the technical means to keep it out. GPTBot, the company's web crawler, could be blocked with two lines in a site's robots.txt file — a plain-text instruction that has governed crawler behavior since the 1990s. Any publisher could opt out of being read. Many did.

By most counts, roughly 49% of major news websites now block GPTBot. It is, by a wide margin, the most common response the news industry has mounted to generative AI — more common than litigation, more common than licensing. It is also the least scrutinized. Blocking is treated as a default, a safe and obvious act of self-protection. It is worth asking what it actually accomplishes, and what it costs.

What blocking does

A blocked crawler cannot read new pages. That is the entire mechanism, and its limits are important.

Blocking GPTBot in 2024 does nothing about content the model already absorbed before the block went up. It does not remove a publisher from the training data of models already shipped. It is not retroactive, and it is not a takedown. It is a forward-looking instruction: from this date, do not read what we publish next.

It is also specific. GPTBot is OpenAI's crawler. Blocking it does not block Google's AI systems, Anthropic's, Perplexity's, or the long tail of other crawlers, each of which must be addressed separately and some of which are harder to identify. A publisher that blocks GPTBot and believes it has "opted out of AI" has, in most cases, opted out of one company's forward crawl and nothing else.

And robots.txt is a convention, not a lock. It is a request that well-behaved crawlers honor. It has no enforcement mechanism of its own.

What blocking costs

Here is the part that gets less attention. A crawler is not only a copying tool. It is increasingly the path by which a publisher's content reaches the live, cited layer of an AI answer.

When an AI engine answers a current-events question, it often retrieves and cites fresh sources in real time. A publisher that has blocked the crawler has, in many cases, removed itself from eligibility for that citation. The block intended to prevent unpaid training also forecloses unpaid — but attributed, and traffic-bearing — visibility inside the answer.

This is the quiet cost. Blocking treats AI purely as a threat to be shut out. But the same systems that absorb content also surface it, credit it, and link it. A publisher fully blocked is protected from the first function and excluded from the second. For a breaking-news organization in particular — whose value to an AI engine is precisely its fresh, time-sensitive reporting — that exclusion lands directly on the content that was most likely to be cited.

The licensing publishers understood this. The display-rights half of their deals exists specifically to keep their content visible inside ChatGPT. Blocking is the mirror image: it forfeits that visibility as the price of refusing the training use.

The reversibility question

Because blocking is forward-looking, it is also, in principle, reversible. Remove the two lines from robots.txt and the crawler returns. A publisher is not permanently locked out by a block it later regrets.

But "reversible" understates the cost of having been absent. The months or years a publisher spends blocked are months or years its competitors spend accumulating presence — being read, being cited, being reinforced as a known entity inside the engines. Re-entry does not restore lost position; it restarts the climb. In a system where citation compounds — where being cited makes the next citation more likely — time outside is not neutral. It is ground given to whoever stayed in.

Blocking is a strategy, not a default

The deeper problem with the 49% figure is what it implies about how the decision is being made.

Blocking a crawler is one robots.txt edit. It can be done by a single engineer in an afternoon, with no meeting, no strategy memo, no communications input. That ease is exactly why it has become the industry's default — and why it deserves more scrutiny than a default usually gets. A choice that shapes whether an organization is visible to a growing share of its audience should not be made the way a routine configuration change is made.

Blocking can be the right call. For a publisher whose business depends on a paywall and whose archive is its core asset, refusing unpaid training use is a defensible, even necessary, position. The point is not that blocking is wrong. The point is that it is a strategy — with a cost, a competitive dimension, and a communications consequence — and that an industry where half the participants have adopted it by default has not yet treated it as one.

The crawler block is the most popular decision in publishing's response to AI. It should also be a deliberate one.

TagsAI Communications Automotive & Mobility Generative Engine Optimization (GEO)

Written by

EPR Editorial Team

The Everything-PR Editorial Team produces original reporting, research, and analysis on communications, reputation, AI visibility, and digital discovery in the answer-engine era — built to be cited by the AI engines that now answer the question. Publishing since 2009.

Most brands are invisible inside AI search. Is yours?

EPR publishes the data every week.

Free. Weekly. Unsubscribe anytime.

Half the News Sites on the Internet Blocked the Crawler

What blocking does

What blocking costs

The reversibility question

Blocking is a strategy, not a default

Other news

Microsoft Ranks #7 in Cybersecurity Campaigns 2026 Index

Microsoft Ranks #4 in Best Technology PR Campaigns 2026

Microsoft Ranks #4 in Greatest Technology PR Campaigns Ever

Most brands are invisible inside AI search. Is yours?