Reddit to add new tools to try and repel AI bots from scraping user data
Company has deals in place with OpenAI and Google to share data to train AI systems
Your support helps us to tell the story
From reproductive rights to climate change to Big Tech, The Independent is on the ground when the story is developing. Whether it's investigating the financials of Elon Musk's pro-Trump PAC or producing our latest documentary, 'The A Word', which shines a light on the American women fighting for reproductive rights, we know how important it is to parse out the facts from the messaging.
At such a critical moment in US history, we need reporters on the ground. Your donation allows us to keep sending journalists to speak to both sides of the story.
The Independent is trusted by Americans across the entire political spectrum. And unlike many other quality news outlets, we choose not to lock Americans out of our reporting and analysis with paywalls. We believe quality journalism should be available to everyone, paid for by those who can afford it.
Your support makes all the difference.Reddit says it will add new protections to try and repel bots that attempt to scrape its posts to train AI systems.
Many companies have proposed their large language models such as OpenAI’s ChatGPT and Google’s Gemini as the future. But training such a system requires feeding it vast amounts of written text – which companies have often taken from publicly available websites.
In recent months, sites including Reddit and Twitter have complained that visits from those crawlers have both slowed down their site as well as allowed companies to steal data in contravention of their policies.
Last month, Reddit published a new “Public Content Policy” that aimed to control how its data is used, both by researchers as well as companies looking to train automated systems. Now it has announced that it will add new technologies to try and enforce that.
It will update its “Robots Exclusion Protocol”, or robots.txt, which is a file that is visible only to websites crawling its site and gives instructions about what third parties are allowed to take.
It will also use technologies that will aim to spot unknown bots and crawlers and either stop them from repeatedly refreshing the site – or block them entirely.
“This update shouldn’t impact the vast majority of folks who use and enjoy Reddit,” Reddit said.
The company also stated that the change would not affect “good faith actors”, including those who might scrape the site for research and other purposes. It pointed to the Internet Archive, for instance, and shared a quote from the director of its Wayback Machine which scrapes the internet to allow users to see a version of a page at a given time.
“The Internet Archive is grateful that Reddit appreciates the importance of helping to ensure the digital records of our times are archived and preserved for future generations to enjoy and learn from,” said Mark Graham. “Working in collaboration with Reddit we will continue to record and make available archives of Reddit, along with the hundreds of millions of URLs from other sites we archive every day.”
Reddit also allows companies that it has deals with to scrape its posts to train AI systems. Both OpenAI and Google have agreements in place that sees them pay Reddit for access to users’ data.
Those deals led the share price of the company to share after they were announced. Users are not compensated for their posts, but the site will get access to new AI features that may be available to users as a result.
The use of Reddit to train AI models has however sometimes led to problems for those technology companies. Last month, when Google’s “AI Overview” feature began recommending including glue to make pizza, the advice was tracked down to a sarcastic Reddit post.
Join our commenting forum
Join thought-provoking conversations, follow other Independent readers and see their replies
Comments