Updater
September 08, 2025 , in technology

The Fight Against AI Web Scrapers Heats Up

As AI web scrapers continue to proliferate, some news publishers are fighting back against the bots mining their web content. Eidosmedia explores the mounting resistance and the major tactics publishers are deploying against AI scrapers.

Eidosmedia AI Web Scrapers

Can AI Web Scrapers Be Stopped? Publisher Tactics and Resistance | Eidosmedia

While some news publishers have embraced the rapid rise of AI — retooling strategies to chase citations instead of clicks, and even reaching licensing agreements with AI companies to use their content — others have maintained their conviction that AI web scraping is a violation of copyright and a threat to journalism. With new weapons emerging to defend their content from AI, some publishers are choosing to fight back.

The current state of AI web scraping

Despite obtaining content from willing publishers, AI bots continue to “scrape” other web content without permission. Since the 1990s, that permission has been relayed by a website’s robots.txt file — the gatekeeper informing hungry website crawlers what content is fair game or off-limits. But the robots.txt file is more of a courteous suggestion than an enforceable boundary. Bryan Becker of Human Security offers further explanation to Press Gazette:

“Robots.txt has no enforcement mechanism. It’s a sign that says ‘please do not come in if you’re one of these things’ and there’s nothing there to stop you. It’s just always been a standard of the internet to respect it.”

“Companies the size of Google, they respect it because they have the eyes of the world on them, but if you’re just building a scraper, it’d almost be more work for you to respect it than to ignore it, because you’d have to make extra code to check it.”

The rise of third-party web scrapers

Publishers that have opted to block AI companies from visiting their website altogether have only given rise to third-party content-scrapers, “which openly boast about how they can get through paywalls, effectively steal content to order, allowing AI companies to answer ‘live’ news queries with stolen information from publishers.”

Press Gazette cites ample evidence of third-party bots scraping reputable sources — such as AI search engine Perplexity successfully replicating a Wired article that was behind a robots.txt file (Perplexity later updated its policies to respect robots.txt), and the Press Gazette itself using third-party scrapers to access paywalled content on the Financial Times website. Additionally, “In off-the-record conversations with major newspaper publishers in the UK, experts confirmed that third-party scrapers are an increasing issue.”

How AI web scraping hurts publishers

The toll AI is taking on publishers is significant — and measurable.

For some, it’s a matter of declining web traffic. Toshit Panigrahi, CEO of Tollbit, told Press Gazette that a popular sports website had “13 million crawlers from AI companies” that resulted in “just 600 site visits.” For others, it’s rising bandwidth. ComputerWorld reports Wikipedia experienced “a 50% increase in the bandwidth consumed since January 2024,” a jump the Wikimedia Foundation attributes to “automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models.” This sizable increase in bandwidth has forced Wikipedia’s Site Reliability team into a state of perpetual war against AI scrapers.

The mounting resistance against AI scrapers

The Internet Engineering Task Force (IETF)’s AI Preference Working Group (AIPREF) is one of the largest and most influential allies in the AI resistance. As reported by ComputerWorld, the principal objective of AIPREF is to “contain AI scrapers” through two interrelated mechanisms:

  • Clear preferences — First, AIPREF seeks to establish “‘a common vocabulary to express authors’ and publishers’ preferences regarding use of their content for AI training and related tasks.’”
  • New and improved boundaries — Then, AIPREF “will develop a ‘means of attaching that vocabulary to content on the internet, either by embedding it in the content or by formats similar to robots.txt, and a standard mechanism to reconcile multiple expressions of preferences.’”

The ultimate idea here is to transform the “please don’t” of current robots.txt files into a “this is forbidden” hard line, giving publishers a clear say in what AI can and cannot mine for content. However, without real regulation, legal repercussions, or a way of enforcing these restrictions, the best AIPREF can do is clarify its preferences and hope AI companies respect publishers' explicit wishes.

Gloves off

But for those on the frontlines of the fight, the promise of new protocols and a hope for AI compliance is not enough. Increasingly, publishers are fighting back with emerging countermeasures.

  • AI Tarpits — A cybersecurity tactic known as tarpitting has been drafted into the fight against AI by tenacious developers. One such developer explained to Ars Technica how his AI tarpit “Nepenthes” works by “trapping AI crawlers and sending them down an ‘infinite maze’ of static files with no exit links, where they ‘get stuck’ and ‘thrash around’ for months.” Satisfying though this tactic may be, ComputerWorld warns that sophisticated AI scrapers can successfully avoid tarpits and worse, “even when they work, tarpits also risk consuming host processor resources.”
  • Poisoning — If you do have the spare processing power to set up a successful tarpit, it can afford a rare opportunity to go on the offensive. As Ars Technica explains, trapped AI scrapers “can be fed gibberish data, aka Markov babble, which is designed to poison AI models.”
  • Proof of work — Another emerging defense against AI is proof of work challenges like Anubis, described by The Register as “a sort of CAPTCHA test, but flipped: instead of checking visitors are human, it aims to make web crawling prohibitively expensive for companies trying to feed their hungry LLM bots.” A single human visitor only briefly sees the Anubis mascot while their browser completes a cryptographic proof of work challenge, but for AI companies deploying hordes of content scraping bots, these computations can require “...a whole datacenter spinning up to full power. In theory, when scanning a site is so intensive, the spider backs off.”

Cloudflare strikes back

One of the most recent and significant blows against AI has been landed by Cloudflare, the internet’s leading infrastructure provider. Inundated with clients struggling to protect their websites from AI scrapers, Cloudflare has reversed its original “opt-out” model and now automatically blocks AI bots. Press Gazette reports Cloudflare’s decision was “backed by more than a dozen major news and media publishers including the Associated Press, The Atlantic, Buzzfeed, Conde Nast, DMGT, Dotdash Meredith, Fortune, Gannett, The Independent, Sky News, Time and Ziff Davis.”

Cloudflare is also offering a more aggressive approach called AI Labyrinth, a tarpit/poisoning-inspired tool designed to ensnare AI scrapers. Citing a CloudFlare blog post, The Verge explains that when Labyrinth “detects ‘inappropriate bot behavior,’ the free, opt-in tool lures crawlers down a path of links to AI-generated decoy pages that ‘slow down, confuse, and waste the resources’ of those acting in bad faith.”

Can AI web scrapers really be stopped?

The age of publishers watching passively as AI bots scrape their content is over. Some, like The Guardian and The Wall Street Journal, are striking deals and throwing open the gates to AI. Others are communicating firm boundaries, setting technical traps, and collaborating with like-minded leaders to develop effective defenses. Whether AI web scrapers can be brought to heel remains to be seen, but for the publishers resisting the AI takeover, it’s clear the fight must continue if they want to retain control over their content.

Interested?

Find out more about Eidosmedia products and technology.

GET IN TOUCH