OpenAI launches bot that will crawl the internet to educate GPT
Website owners will have to explicitly opt out if they do not want their data harvesting
Your support helps us to tell the story
This election is still a dead heat, according to most polls. In a fight with such wafer-thin margins, we need reporters on the ground talking to the people Trump and Harris are courting. Your support allows us to keep sending journalists to the story.
The Independent is trusted by 27 million Americans from across the entire political spectrum every month. Unlike many other quality news outlets, we choose not to lock you out of our reporting and analysis with paywalls. But quality journalism must still be paid for.
Help us keep bring these critical stories to light. Your support makes all the difference.
OpenAI has built a new bot that will crawl over the internet, gathering information to educate artificial intelligence systems.
Operators of websites will be forced to actively opt out, and block the bot, if they want to stop it taking data from their site.
Artificial intelligence systems such as OpenAI's ChatGPT rely on vast amounts of data to train their models and learn how to give the correct outputs. So far, much of that data has been taken freely from the web.
That has prompted numerous complaints from authors and other web users. Many have criticised OpenAI and others for taking personal information and copyrighted content to train their models, with that writing potentially informing or even being replicated in the system's answers.
Artificial intelligence companies have also faced criticism from others who claim that such crawlers are stretching their web infrastructure. Elon Musk, for instance, has said that the load from such bots has forced Twitter to place limits on how many posts users could see on the site.
OpenAI's existing ChatGPT 3.5 and 4 were trained on data taken from the internet that was taken up to late 2021. There is no way for owners of that data or the websites it was gathered from to remove it from OpenAI's models.
Now OpenAI says that the new system, named 'GPTBot', will be crawling over data and writing on the web to gather more information to train future models.
It told website administrators that they should include instructions to the bot to stop it from crawling a website, if they did not want that information to be gathered. Administrators are able to include such information in a file called "robots.txt", which gives instructions to other crawlers such as those used by Google for its search results.
OpenAI says the bot "may potentially be used to improve future models". It also says that it is built to "remove sources" that require a paywall, gather personally identifiable information or have text that violates its rules.
It suggested that letting the bot access sites "can help AI models become more accurate and improve their general capabilities and safety".
Join our commenting forum
Join thought-provoking conversations, follow other Independent readers and see their replies
Comments