Scientists warn of threat to internet from AI-trained AIs

Steps must be taken to label AI-generated content from human-generated ones, scientists say

Vishwam Sankaran
Tuesday 20 June 2023 06:33 BST
Comments
Related video: Regulating Artificial Intelligence: EU Moves Closer to Passing one of World’s First Laws

Your support helps us to tell the story

From reproductive rights to climate change to Big Tech, The Independent is on the ground when the story is developing. Whether it's investigating the financials of Elon Musk's pro-Trump PAC or producing our latest documentary, 'The A Word', which shines a light on the American women fighting for reproductive rights, we know how important it is to parse out the facts from the messaging.

At such a critical moment in US history, we need reporters on the ground. Your donation allows us to keep sending journalists to speak to both sides of the story.

The Independent is trusted by Americans across the entire political spectrum. And unlike many other quality news outlets, we choose not to lock Americans out of our reporting and analysis with paywalls. We believe quality journalism should be available to everyone, paid for by those who can afford it.

Your support makes all the difference.

Future generations of artificial intelligence chatbots trained using data from other AIs could lead to a downward spiral of gibberish on the internet, a new study has found.

Large language models (LLMs) such as ChatGPT have taken off on the internet, with many users adopting the technology to produce a whole new ecosystem of AI-generated texts and images.

But using the output data from such AI systems to further train subsequent generations of AI models could result in “irreversible defects” and junk content, according to a new, yet-to-be peer-reviewed study.

AI models like ChatGPT are trained using vast amounts of data pulled across internet platforms that have mostly remained human generated until now.

But AI-generated data using such models have a growing presence on the internet.

Researchers, including those from the University of Oxford in the UK, attempted to understand what happened when several subsequent generations of AIs are trained off each other.

They found the widespread use of LLMs to publish content on the internet on a large scale “will pollute the collection of data to train them” and lead to “model collapse”.

“We discover that learning from data produced by other models causes model collapse – a degenerative process whereby, over time, models forget the true underlying data distribution,” scientists wrote in the study, posted as a preprint in arXiv.

The new findings suggested there to be a “first mover advantage” when it comes to training LLMs.

Scientists liken this change to what happens when AI models are trained on music created by human composers and played by human musicians. The subsequent AI output then trains other models, leading to a diminishing quality of music.

With subsequent generations of AI models likely to encounter poorer quality data at their source, they may start misinterpreting information by inserting false information in a process scientists call “data poisoning”.

They warned that the scale at which data poisoning can happen drastically changes after the advent of LLMs.

Just a few iterations of data can lead to major degradation, even when the original data is preserved, scientists said.

And over time, this could lead to mistakes compounding and forcing models that learn from generated data to misunderstand reality.

“This in turn causes the model to misperceive the underlying learning task,” researchers said.

Scientists cautioned that steps must be taken to label AI-generated content from human-generated ones, along with efforts to preserve original human-made data for future AI training.

“To make sure that learning is sustained over a long time period, one needs to make sure that access to the original data source is preserved and that additional data not generated by LLMs remain available over time,” they wrote in the study.

“Otherwise, it may become increasingly difficult to train newer versions of LLMs without access to data that was crawled from the Internet prior to the mass adoption of the technology, or direct access to data generated by humans at scale.”

Join our commenting forum

Join thought-provoking conversations, follow other Independent readers and see their replies

Comments

Thank you for registering

Please refresh the page or navigate to another page on the site to be automatically logged inPlease refresh your browser to be logged in