The intense battle to stop AI bots from taking over the internet

Artificial intelligence systems need to be trained on text – which has led their creators to gather up words from right across the web

Andrew Griffin
Friday 05 July 2024 17:08
Comments
Reddit is cracking down on AI bots data scraping: 'We do not believe in the misuse of public content'

Support truly
independent journalism

Our mission is to deliver unbiased, fact-based reporting that holds power to account and exposes the truth.

Whether $5 or $50, every contribution counts.

Support us to deliver journalism without an agenda.

Louise Thomas

Louise Thomas

Editor

A number of companies have taken major steps to stop scrapers from attempting to take their text.

It is the latest front in an ongoing and apparently escalating battle between websites that allow people to read text and the AI companies that wish to use it to build their new tools.

The rise of artificial intelligence has brought a number of companies looking to train new and smarter AI technologies. But the large language model systems that underpin many of them – such as ChatGPT – require vast amounts of text to be trained.

That has led some companies to scrape text from the web so that it can be fed into those systems for that training. That in turn has led to frustration from the owners of text-based websites, who argue not only that the companies do not have permission to use their data, but also that it is slowing down the performance of the internet.

Elon Musk, for instance, has repeatedly suggested that X, formerly Twitter, gets a huge amount of traffic from such scraping systems. X is one of many sites that have introduced strict “rate limiting” rules, which try and restrict bots from reloading its site too much – though some have suggested that has also been used to disguise problems with X’s seemingly troubled website.

Last week, Reddit introduced a host of changes that attempted to block bots from scraping its website. It said that it too would use rate limiting, as well as blocking unknown bots and instructing such systems to stay away from its website.

It noted that those rules could potentially limit other automated systems that are important for transparency, such as the Internet Archive, which saves web pages for later access. But it insisted that important tools for researchers would still have access to Reddit.

“Anyone accessing Reddit content must abide by our policies, including those in place to protect redditors. We are selective about who we work with and trust with large-scale access to Reddit content,” it said when it introduced those new rules.

Some companies have entered into deals to give AI companies access to their or their users’ data. Both OpenAI and Google have signed deals with Reddit so that they can take its users’ posts for training their artificial intelligence systems, for instance.

Others have launched legal proceedings. The New York Times has sued OpenAI and Microsoft over its artificial intelligence systems, arguing that it has infringed on the paper’s copyright by using its articles to train them.

Now internet infrastructure company Cloudflare has introduced a range of similar tools, and told customers that it is a way of declaring their “AIndependence”. All Cloudflare customers will get an “easy button” to “block all AI bots”, it said.

Last year, Cloudflare had introduced a change to block AI bots that “behave well”. Despite the fact that system was intended at bots that do follow the rules, Cloudflare’s customers “overwhelmingly” decide to block them, it said.

Now the company has introduced a feature that will forcefully block all known bots. It will look for fingerprints of scrapers and stop them ever visiting websites, it said.

Join our commenting forum

Join thought-provoking conversations, follow other Independent readers and see their replies

Comments

Thank you for registering

Please refresh the page or navigate to another page on the site to be automatically logged inPlease refresh your browser to be logged in