Facebook outage: Single wrong command took down ‘backbone’ of network, says company

The outage on WhatsApp, Instagram and Facebook occurred because of a wrong command issued during a ‘routine maintenance job’

Anuj Pant
Wednesday 06 October 2021 08:18 BST
Comments
‘Faulty Configuration Change’ To Blame For Outage Facebook Says
Leer en Español

Your support helps us to tell the story

From reproductive rights to climate change to Big Tech, The Independent is on the ground when the story is developing. Whether it's investigating the financials of Elon Musk's pro-Trump PAC or producing our latest documentary, 'The A Word', which shines a light on the American women fighting for reproductive rights, we know how important it is to parse out the facts from the messaging.

At such a critical moment in US history, we need reporters on the ground. Your donation allows us to keep sending journalists to speak to both sides of the story.

The Independent is trusted by Americans across the entire political spectrum. And unlike many other quality news outlets, we choose not to lock Americans out of our reporting and analysis with paywalls. We believe quality journalism should be available to everyone, paid for by those who can afford it.

Your support makes all the difference.

Facebook’s largest outage in history was caused by a wrong command that resulted in what the social media giant said was “an error of our own making”.

“We’ve done extensive work hardening our systems to prevent unauthorised access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making,” said the new post published on Tuesday.

Santosh Janardhan, Facebook’s vice president of engineering and infrastructure, explained in the post why and how the six-hour shutdown occurred and the technical, physical and security challenges the company’s engineers faced in restoring services.

The primary reason for the outage was a wrong command during routine maintenance work, according to Mr Janardhan.

Facebook’s engineers were forced to physically access data centres that form the “global backbone network” and overcome several hurdles in fixing the error caused by the wrong command.

Once these errors were fixed, however, another challenge was thrown at them, in the form of managing a “surge in traffic” that would come as a result of fixing the problems.

Mr Janardhan, in the post, explained how the error was triggered “by the system that manages our global backbone network capacity.”

“The backbone is the network Facebook has built to connect all our computing facilities together, which consists of tens of thousands of miles of fibre-optic cables crossing the globe and linking all our data centres,” the post said.

Two phases of the newly completed Facebook data centre sit at the base of mountains in the Rush Valley on 5 October 2021 in Eagle Mountain, Utah. Facebook was shut down yesterday for more than seven hours reportedly due in part to a major disruption in communication between the company's data centres
Two phases of the newly completed Facebook data centre sit at the base of mountains in the Rush Valley on 5 October 2021 in Eagle Mountain, Utah. Facebook was shut down yesterday for more than seven hours reportedly due in part to a major disruption in communication between the company's data centres (Getty Images)

The entirety of Facebook’s user requests, including loading up news feeds or accessing messages, is dealt with from this network, which handles requests from smaller data centres.

To effectively manage these centres, engineers perform day-to-day infrastructure maintenance, including taking part of the “backbone” offline, adding more capacity or updating software on routers that manage all the data traffic.

“This was the source of yesterday’s outage,” Mr Janardhan said.

“During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centres globally,” he added.

What complicated matters was that the erroneous command that caused the outage could not be audited because a bug in the company’s audit tool prevented it from stopping the command, said the post.

A “complete disconnection” between Facebook’s data centres and the internet then happened, something that “caused a second issue that made things worse.”

The entirety of Facebook’s “backbone” was removed from operation, making data centre locations designate themselves as “unhealthy”.

“The end result was that our DNS servers became unreachable even though they were still operational,” said the post.

Domain Name Systems (DNS) are systems through which web page addresses typed by users are translated into Internet Protocol (IP) addresses that can be read by machines.

“This made it impossible for the rest of the internet to find our servers.”

Mr Janardhan said this gave rise to two challenges. The first was that Facebook’s engineers could not access the data centres through normal means because of the network disruption.

The second was the company’s internal tools that it normally uses to address such issues were rendered “broke”.

The engineers were forced to go onsite to these data centres, where they would have to “debug the issue and restart the systems”.

This, however, did not prove to be an easy task, because Facebook’s data centres have significant physical and security covers that are designed to be “hard to get into”.

Mr Janardhan pointed out how the company’s routers and hardware were designed so that they are difficult to modify, despite physical access.

“So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online,” he said.

Engineers then faced a final hurdle - they could not simply restore access to all users worldwide, because the surge in traffic could result in more crashes. Reversing the vast dips in power usage by the data centres could also put “everything from electrical systems to caches at risk”.

“Storm drills” previously conducted by the company meant they knew how to bring systems back online slowly and safely, the post said.

“I believe a tradeoff like this is worth it - greatly increased day-to-day security vs a slower recovery from a hopefully rare event like this,” Mr Janardhan concluded.

Facebook’s outage - which impacted all its services including Whatsapp and Instagram - led to a personal loss of around $7bn for chief executive Mark Zuckerberg as the company’s stock value dropped. Mr Zuckerberg has apologised to users for any inconvenience the break in service caused.

Join our commenting forum

Join thought-provoking conversations, follow other Independent readers and see their replies

Comments

Thank you for registering

Please refresh the page or navigate to another page on the site to be automatically logged inPlease refresh your browser to be logged in