In focus

Parts of the internet are disappearing around us – and they might be lost forever

We see the internet as storing vast swathes of information forever – but parts of it are under more threat than we realise, writes Andrew Griffin

Monday 27 May 2024 10:46 BST
Comments
Billions of sites have already disappeared from the web
Billions of sites have already disappeared from the web (Reuters)

The Internet Archive – the most capacious library ever made – is home to 835 billion webpages. A single backup of its library collection requires more than 145 petabytes of space. By comparison, the world’s largest physical library is the US Library of Congress, and is home to about 175 million items, according to Guinness World Records.

And yet even as the world generates more data than ever, much of it is falling away. Some 38 per cent of webpages that existed in 2013 are no longer there, according to new research from the Pew Research Center, and even 8 per cent of those that existed in 2023 are now gone.

Increasingly, the web is made up of content that is being both produced and consumed by automated systems: a report last month from cybersecurity company Imperva said that almost exactly half of all internet traffic came from bots.

“As a lifelong lover of the web, it’s hard not to feel a little hopeless right now,” wrote software engineer and writer Molly White in a piece published earlier this month. She catalogued a host of problems: search engines that are now filled with “auto-generated nothingness”, social networks with algorithms that “encourage sameness, vapid engagement farming and rage bait while stifling creativity”, and much else besides.

This can lead to a kind of nostalgia for an earlier version of the internet, she noted. Some of that is the usual banal yearning for youth that afflicts everyone who ages. But some significant part of it is a recognition that the internet has changed fundamentally, and become less of a fun place to be.

The version of the internet that we want to return to may no longer be there, however. The historic web needs upkeep, which does not always happen.

Some 54 per cent of Wikipedia pages include a link in their references section to a page that no longer exists, that same Pew Research Center study found. Government and news pages likewise tend to point to pages that have disappeared.

Sometimes that happens because sites are moved wholesale, breaking the links to them. Other times, pages and servers may be disconnected either on purpose or because they are neglected and fall into disuse.

On social networks, the effect is even more dramatic. Almost one in five tweets disappear within months, for the most part, because the accounts that posted them were removed. That might be important for content moderation or privacy but it is also another way that the web disappears.

The web is not just a metaphor; like one spun by a spider, links disappearing may shake the whole thing, in other seemingly unconnected ways. A tweet may disappear from an article; an article’s link may disappear from Wikipedia’s reference page. The internet has been built to depend on itself, and so a relatively small number of disappearances can ripple right across it.

That is why the Internet Archive is working so hard to preserve the web and stop it from rotting away beneath us. Its “Wayback Machine” is a valuable archive of parts of the web that are disappearing. But even with its huge scale, it is not totally complete: it cannot capture more complex kinds of website, and pages that are not linked to may be missed by it completely.

What’s more, large parts of the internet now exist in private. Facebook works hard to ensure that its pages can only be viewed by people who are logged in; social networks might be today’s town square, but they come with entry requirements that can make them hard to see and archive.

And as the web fills up with bots that post and bots that interact with posts – what experts have referred to as “sludge” that sits alongside the valuable parts of the internet – it becomes yet harder to collect important parts of the internet. Some sites such as Reddit and Twitter/X have stopped allowing third parties access to their data, ostensibly to avoid it being hoovered up by automated systems that want it to train yet more AI systems and in doing so clog up the bandwidth.

The disappearance of links from the internet might not be the only way that the internet falls away. A decade ago, Vint Cerf – often described as one of the “fathers of the internet” – warned that much of the early web could be lost forever because even if we have the files themselves then we may no longer have the programs needed to view them. He called it “bit rot”, analogous to the “link rot” that means websites disappear.

The danger was such that we faced a “forgotten generation, or even a forgotten century”, he said in 2015. That wasn’t just digital files, he noted: the push to digitise important documents in an attempt to secure them might actually mean that we lose them, since unlike a physical document they could require software and hardware that itself no longer exists.

Molly White’s article, which documented in fine detail all of the problems with the modern web, ended with a half-plea and half-prediction that things could turn around. It might feel like we are trapped within the walls of an increasingly small and stifling internet, she noted, but we can climb over them and build a new internet. “We can have a different web, if we want it,” she concluded.

Join our commenting forum

Join thought-provoking conversations, follow other Independent readers and see their replies

Comments

Thank you for registering

Please refresh the page or navigate to another page on the site to be automatically logged inPlease refresh your browser to be logged in