OpenAI launches bot that will crawl the internet to educate GPT

Website owners will have to explicitly opt out if they do not want their data harvesting

Andrew Griffin
Tuesday 08 August 2023 15:48 BST
Comments
Artificial Intelligence States
Artificial Intelligence States (Copyright 2023 The Associated Press. All rights reserved)

Support truly
independent journalism

Our mission is to deliver unbiased, fact-based reporting that holds power to account and exposes the truth.

Whether $5 or $50, every contribution counts.

Support us to deliver journalism without an agenda.

Louise Thomas

Louise Thomas

Editor

OpenAI has built a new bot that will crawl over the internet, gathering information to educate artificial intelligence systems.

Operators of websites will be forced to actively opt out, and block the bot, if they want to stop it taking data from their site.

Artificial intelligence systems such as OpenAI's ChatGPT rely on vast amounts of data to train their models and learn how to give the correct outputs. So far, much of that data has been taken freely from the web.

That has prompted numerous complaints from authors and other web users. Many have criticised OpenAI and others for taking personal information and copyrighted content to train their models, with that writing potentially informing or even being replicated in the system's answers.

Artificial intelligence companies have also faced criticism from others who claim that such crawlers are stretching their web infrastructure. Elon Musk, for instance, has said that the load from such bots has forced Twitter to place limits on how many posts users could see on the site.

OpenAI's existing ChatGPT 3.5 and 4 were trained on data taken from the internet that was taken up to late 2021. There is no way for owners of that data or the websites it was gathered from to remove it from OpenAI's models.

Now OpenAI says that the new system, named 'GPTBot', will be crawling over data and writing on the web to gather more information to train future models.

It told website administrators that they should include instructions to the bot to stop it from crawling a website, if they did not want that information to be gathered. Administrators are able to include such information in a file called "robots.txt", which gives instructions to other crawlers such as those used by Google for its search results.

OpenAI says the bot "may potentially be used to improve future models". It also says that it is built to "remove sources" that require a paywall, gather personally identifiable information or have text that violates its rules.

It suggested that letting the bot access sites "can help AI models become more accurate and improve their general capabilities and safety".

Join our commenting forum

Join thought-provoking conversations, follow other Independent readers and see their replies

Comments

Thank you for registering

Please refresh the page or navigate to another page on the site to be automatically logged inPlease refresh your browser to be logged in