Home » Facebook Downtime: A Tragic Song of Remote Work

Facebook Downtime: A Tragic Song of Remote Work

by admin

“Weibo collapsed”, “Zhihu collapsed” and “Little Red Book collapsed” we have all experienced. Can you imagine the scene of “WeChat collapsed”?

On the afternoon of January 18, 2021, many netizens reported that they could not receive messages in WeChat group chats and private messages. “WeChat bug” once rushed to the hot search. Tencent WeChat team responded:

“Due to system jitter, some WeChat users encountered a delay in receiving messages at around 14:00 today, and the repair has now been completed.”

The number of people affected by the “system jitter” is indeed relatively small. Historically, WeChat has only occasionally “crashed” some functional modules such as Moments and Red Packets. Relatively speaking, it is “as stable as Mount Tai”.

However, in the past October 4th, netizens from all over the world, except for mainland China, had a real experience of what they called their WeChat collapse. With 3.5 billion active users, Facebook’s full-line business was once inaccessible globally for 6 hours.

These services include Facebook and its related services Instagram, WhatsApp, Messenger, Oculus, etc.; as well as its enterprise-level products, and even Facebook’s corporate intranet. Among them, WhatsApp and Facebook Messenger are the company’s two “WeChat” instant messaging products, with 2 billion users and 1.3 billion users worldwide (with overlap), both of which are higher than WeChat (including overseas WeChat)’s 1.24 billion users and There are 606 million users of QQ.

The cause of this unprecedented failure was that Facebook had to carry out a lot of remote work after the epidemic, resulting in the absence of maintenance staff on site, which caused the accident to continue to be delayed. As a result, the remote work of countless small and medium-sized enterprises and even government departments around the world has been severely affected, causing waves of secondary disasters.

The century’s new crown epidemic forced people to stay at home and rely on the Internet to complete most of their work and interpersonal communication. The temporary remote office measures were gradually normalized and permanent, and it also made people look forward to the possibility of a new way of life. But only a simple service interruption is required, and all of this may be beaten back to the original point. The 6-hour Facebook downtime is an excellent time for us to rethink all this.

what happened?

According to the information currently available, this large-scale failure of Facebook should start from a routine maintenance.

Facebook’s vice president of infrastructure, Santosh Janardhan, said an order they issued during the maintenance process inadvertently shut down the backbone connections to all Facebook’s data centers in the world.

There are mainly two different conspiracy theories surrounding this matter.

  • First, this happened just before a “whistleblower” “ignored child safety” on Facebook and Instagram on the eve of the U.S. Congressional hearing, 6 hours may be enough to “destroy the dead”;

  • The other is that 1.5 billion recent Facebook user profiles have flowed out. Some people say that the black market is priced at $5,000 for every 1 million user profiles. Six hours may also be used to remedy or cover up something.

At present, the possibility of self-direction and performance downtime due to the “whistleblower” is almost zero. Officials have repeatedly explained that the downtime was not caused by a hacker attack, and there is no evidence that user data was leaked as a result.

However, “If it is not necessary, do not add entities.” This incident was caused by a simple misoperation, perhaps a simpler and more reliable explanation.

In addition to Facebook’s official website, CloudFlare, which is responsible for third-party public DNS resolution and CDN services, is also analyzing the official blog. Observed from the outside, it is the problem with Facebook’s BGP (Border Gateway Protocol).

In layman’s terms, DNS is the “map” of the Internet, used to tell you “where x is”; while BGP is the “navigation” part of this “map”, telling you “how to get to x the fastest”.

To understand this concept accurately, one must first understand:

What we call “Internet” literally means “inter-network (net)”, that is, “network of networks”. It is the result of countless small networks connected to each other like “islands”. These small networks may be “China Telecom”, “Tsinghua University” or “X Company Beijing Office”.

Compared with all networked computers in the world, the national network of an entire country-such as China or Russia-can be regarded as a huge small network connected to other outer “islands” through “bridges” such as submarine cables. But because they follow the same protocol, the networking methods are exactly the same.

BGP is to tell users that in a geographical sense, you must pass through certain “islands” and “bridges” to reach your destination. Generally speaking, BGP will intelligently choose the shortest distance among a variety of different routes. Of course, “shortest” does not mean “optimal”, because some “bridges” such as 5G data connections are charged.

When Facebook’s DNS server notices the problem, it automatically stops distributing BGP routing information and waits for the connection to return to normal. Because devices from all over the world continuously initiate unsuccessful access requests, it will cause a more serious impact on the upper-level DNS server and spread the impact more severely.

This kind of thing happened once in China. On May 19, 2009, a private fight between two hackers who stole game assets caused the third-party domain name resolution service DNSPOD to be attacked and paralyzed. China Telecom ceased its network services, making it unable to provide domain name resolution services, and many websites using DNSPOD services could not be accessed.

See also  Evergrande flies to the stock market after selling assets for 1.5 billion dollars

At that time, Baofeng Yingyin, an audio-visual player with an installed capacity of about 120 million units nationwide, would regularly automatically access the server to check for updates. It also continued to initiate domain name resolution requests due to DNSPOD failures, which eventually killed the entire telecom operator’s local domain name server. The internet is severely disconnected across the country.

In this incident, Facebook’s internal DNS server itself was still working, but actively chose to stop the resolution in order to protect the wider network. Although it is not difficult to repair it, a series of chain reactions exacerbated the problem.

How could it be so serious?

The lack of network connection and the loss of domain name resolution cut off the connection between Facebook engineers working remotely and the server, and also disabled many of their usual maintenance tools. A Facebook insider broke the news on Reddit. The situation at the time was:

  • Those who can repair can’t connect to the router and don’t have the right to log in.

  • People with authority can neither repair nor connect,

  • The only employee who has physical access to the routing equipment in the computer room does not have permission and will not repair it.

As internal communication tools were also disconnected, the three waves of people had difficulty cooperating, which made the situation worse.

The chaos within the company is all-encompassing. Employees originally used the company’s own communication tools to communicate, and sometimes even if they needed to access business partners such as Google Docs and Zoom meeting software, they also required single sign-on with Facebook accounts. The system crash brought this to a halt.

Some employees had logged in to Google Docs and other environments with their company accounts before the incident, and the impact was relatively small; some went online in a hurry, but found that they could only use work emails based on Microsoft Outlook, Apple’s Facetime, etc. Contact with colleagues for alternative services.

Zheng Jun, a reporter from Sina Technology in Silicon Valley, wrote:

“A FB friend said that everyone was embarrassed today. They didn’t know what happened or what to do. They had to pretend that nothing happened and work for a non-existent website.”

Obviously, the repair work could not be completed remotely. The engineers rushed to the main data center in California to participate in the repair. During this period, some employees could not use the access control to enter the company building and conference room, and the doors in these places can only be opened with the access control card, and there is no keyhole.

The Verge even got even more dramatic news at one time-because the access control card failed, the engineer could only take the cutting machine and forcibly saw through the server cage of the data center. However, the latter report was not confirmed and was withdrawn.

However, once people are “physically transferred” to the right place, things are relatively easier to handle. You only need to “activate the secure access protocol” instead of using a chainsaw.

However, even if the problem has been solved, it is necessary to gradually “open the gates and release the water”, otherwise opening all channels at one time is like “8 stars sending out rails in parallel”, which will cause more system crashes. The load must be gradually increased, and users in regions other than the United States will have to wait longer before resuming access.

In the end, everything went back to normal, including Facebook’s stock price, which had fallen by as much as 5%.

Are all “remote” to blame?

In May 2020, China has roughly contained the first wave of the epidemic, while the epidemic in the United States has begun to heat up sharply. At that time, Facebook stated that all qualified employees need to work from home after closing public office space. This measure is a short-term and temporary measure. Once the epidemic is under control and the office is reopened, only certain employees, especially the most senior and experienced employees, will be allowed to work remotely for a long time.

After a year, Facebook updated its policy on June 9 this year, expanding the scope of long-term telecommuting permits to any employee who has the ability to complete work at home.

Zuckerberg wrote:

“In the past year, we have learned that employees can do good work anywhere. I am more optimistic that remote work is possible on a large scale, especially as remote video and virtual reality continue to improve. .”

Naturally, jobs that cannot be done remotely usually include those in hardware equipment or data centers. But judging from the situation of today’s incident, it is obvious that even the data center and gateway positions are already “remote”.

Zuckerberg also said that Facebook will begin to allow employees to work remotely across national borders. Facebook will allow US employees to request remote work in Canada, the UK and the European Union. By January 2022, the company will allow employees to move permanently between seven countries in Europe.

According to statistics, Facebook’s total workforce is about 60,000. Most of the offices in the United States reopened at 50% of the capacity in early September and are scheduled to be fully opened in October.

See also  Social Security Fund: Further play the role of long-term institutional investors to help the stable operation of the domestic capital market

Facebook is not the only tech giant that has chosen to embrace telecommuting more deeply. Specific to itself, the reasons are on the one hand the normalization, prolongation of the epidemic and its fierce variants, on the other hand, the scope of Facebook’s business itself includes the meaning of allowing people to communicate remotely. Augmented reality and so-called “meta-universe” services based on Oculus are also in its planning. It itself plans to create a network territory that crosses physical distances and national borders, and its own employees are the first to use it, which can be used as an internal exercise. .

Up to this point, people’s main concern is whether remote office will affect efficiency, and whether it can fully simulate and replace the experience of on-site office, and cause the so-called “chemical reaction”.

Zuckerberg said that employees who want to work in a Facebook office will be required to have at least half of their time to work. This is to ensure that the office stays alive and that employees who enter the office make the most of the space and become part of the community. In addition, they also plan to organize regular on-site gatherings for office and remote personnel to maintain the relationship between colleagues.

Another issue worthy of attention includes wages. Remote work can easily lead to the result of “working in my hometown and getting a salary in Beijing” or “working in Thailand and getting a salary in Silicon Valley”. The salary that the company originally set for employees generally combines local prices and housing costs (if rental subsidies are not included), so long-term remote means that a certain degree of salary reduction needs to be negotiated with employees. However, considering the different work pace and preferences of different people, from the perspective of maximizing work efficiency, employees and companies are often willing to accept changes.

Starting today, people will have to think about a more basic principled question in addition to these questions: what should I do if I can’t connect to the network infrastructure required for telecommuting?

This issue was only raised during the Zoom meeting, because the network environment and distance between people are different. Until today, more than a year after the outbreak, having a voice or video conference may still be a torture. A net picture said that the feeling of going to work and meeting now is a bit like “calling souls”:

However, we have WeChat groups and Dingding groups. Although voice and video are a bit extravagant, sending individual voice message fragments in the group, or even typing directly, has long been regarded as a daily operation by us, and there is no possibility of problems. In this sense, WeChat has also become a telecom operator that crosses physical networks and national borders, and is a veritable infrastructure.

This time, it was this kind of service that we thought was completely impossible to drop.

The risk of being deliberately ignored

For the first hour or two, people just tweeted the pictures and teased them on Twitter next door. The later, the more people couldn’t laugh.

Many people suddenly wake up and find that they are the only way to contact each other with their colleagues, friends and even family members who are separated from each other in their lives. The only way to contact each other is online. The phone number may be a few years ago, so I can’t change it. No. If we say goodbye, we really don’t know when we will see each other again.

During the challenging epidemic period, WhatsApp, as the international version of “WeChat,” allows people from all over the world to keep in touch with the communities around them, and therefore made many unique contributions. Many important activities would not be possible without it. The official WhatsApp website lists some of them:

  • Survivors of India’s “indentured labor” system shared epidemic information through WhatsApp groups to solve the difficulties of peers’ low education and blocked news;

  • A WhatsApp group in Pakistan raised 21 million rupees to help disadvantaged groups;

  • Jordan’s Employment Promotion Program uses WhatsApp to help women find jobs;

  • Teachers in Syrian refugee camps shared video lessons with parents on WhatsApp;

  • A group of Italian mayors used WhatsApp to keep track of each other’s real-time updates. The elementary school in Naples, the country, used WhatsApp to send homework assignments during the suspension of classes;

  • Medical staff in Paris form a WhatsApp group to update the current hospital beds, resources and other information at any time;

For many people, not being able to access Facebook is just an inconvenience. But for some small businesses in developing countries, there is no other reliable way to communicate with customers, which may be a serious problem.

India has 340 million Facebook users, the most in the world, and WhatsApp is also an important tool for communication between individuals and businesses in the country. According to research firm eMarketer, India has nearly 490 million active WhatsApp users.

These two platforms not only assume the roles of China’s “Weibo” and “Official Account”, they can promote products, and they can also be used as online stores to sell products like “small programs”. Thousands of Indian companies were forced to close down, and relevant customers were unable to purchase necessities of life online.

In Brazil, government officials and even the education system are using WhatsApp. Students can receive test results from WhatsApp. The hospital also uses WhatsApp to make appointments and remote consultations.

See also  Chinese delegation’s Winter Olympics equipment debuted and Han Xiaopeng said with a smile that he wanted to come back after watching

The six-hour service interruption has reduced Zuckerberg’s paper wealth by $6 billion. However, due to tools such as WhatsApp, Messenger and Instagram, global governments, businesses, charities, communities and ordinary people The loss caused by contact is probably incalculable.

This physical downtime clearly shows that remote office relies on a fragile infrastructure, which is fragile due to lack of backup. Of course, the more redundant backups, the safer it is, but it also makes users tired. Therefore, people always want to abandon the shackles of security and “run naked.”

Like Facebook’s funny company intranet setting, this technology giant also provides people around the world with another fragile structure. People’s reliance on social media tools, especially Facebook products, is difficult to get rid of in their daily lives.

Even as some U.S. senators insisted, the spin-off of technology giants such as Facebook may not change the situation much, because this will only change the entity that develops products, but will not change people’s preference for a single platform. Willingness to gather.

Regardless of whether Facebook’s failure was caused by unintentional error, human manipulation or malicious code, it is disturbing that the temporary shutdown of a company can affect so many Internet users across national borders around the world. This shows how fragile the international Internet ecosystem that supports the global operation of the post-epidemic era is, and new risks are almost at hand.

In human history, there have been many similar lessons:

  • During the Second World War, Nazi Germany began to invade small countries and gradually attacked the Soviet Union, swallowed France, and pointed at England;

  • The US government had obtained relevant intelligence before 9/11;

  • The signs of a housing bubble before the 2008 financial tsunami were quite obvious;

  • When the education and training institutions began to be reorganized a few months ago, relevant policies have been issued for several years.

Even focusing on the field of information security, the same is true. In May of this year, the Colonial Pipeline (also translated as “colonial pipeline”) in the United States was attacked by ransomware. It once closed the oil artery across many states and paid about 4 million U.S. dollars in cryptocurrency. Recovered after the ransom.

However, common malware and ransomware attacks can be prevented by basic methods such as updating corporate computer security software, strict employee security measures, and upgrading to the latest version of the operating system. A related person from Microsoft explained why Windows 11 forcedly increased the hardware requirements, so that some 3-5 years old machines could not be upgraded, said:

“What we learned from Windows 10 is that if you make security settings dispensable, people won’t turn them on. This is a big lesson.”

The risk is there, and the warning is never absent. But because of other imminent issues, risks and warnings have been deliberately ignored by people until they finally broke out.

The Internet is a fragile ecosystem built on a global submarine cable and distributed server network. It is easy to forget that the Internet is not just a conceptual network, or even an intangible “meta universe”. They are all built on physical infrastructure, which is very important; who controls these infrastructures is even more important.

Since human beings have a social nature of grouping together, it may be destined that they cannot accept the “cunning rabbit three caves” type of constantly changing social media service providers, quitting Facebook, Twitter, YouTube or any other centralized single service, the cost is too high, and it seems unnecessary; Then, these giants themselves and the supervisory departments on them have also been given greater responsibilities at the same time. Because their goal has surpassed the steady profitability of the enterprise, but is tied to the food and clothing of the hundreds of millions of people and small and medium-sized enterprises who make a living.

The society is scrutinizing whether the giants can fulfill their social responsibilities and ensure the smooth operation of infrastructure. If not, the policy will soon shift to accelerate competition, break monopolies, provide alternatives, and strengthen supervision. This is exactly what all countries are doing now-no Internet company should become “big to fail”.

Within a few hours after Facebook repaired the Internet, a congressional hearing in which its former employees accused the company of “just making money, regardless of public interest” also opened. As usual, we will also hear rebuttals from the vice president of legal affairs to Zuckerberg himself; but the disconnection itself, more eloquently than any other material, shows that people need to always be vigilant about large platforms and limit their further encroachment Our lives and work make us lose other choices in fact.

Perhaps the first thing everyone should do when facing this vigilance is to exchange other types of contact information with friends and colleagues in their WeChat group that they have never met before.

Reference

  • https://www.theverge.com/2021/10/4/22709575/facebook-outage-instagram-whatsapp

  • Update about the October 4th outage

  • https://www.newsweek.com/15-billion-facebook-users-personal-information-posted-sale-after-hack-1635439

  • https://blog.cloudflare.com/october-2021-facebook-outage/

  • https://www.theverge.com/2021/10/4/22709260/what-is-bgp-border-gateway-protocol-explainer-internet-facebook-outage

  • https://www.cnbeta.com/articles/deep/84831.htm

  • Comment
    byu/Gunjob from discussion
    insysadmin

  • https://weibo.com/1644101945/KBgLTaqUD

  • https://www.theverge.com/2021/10/4/22709575/facebook-outage-instagram-whatsapp

  • https://www.facebook.com/careers/life/what-remote-and-flexible-work-will-look-like-at-facebook

  • https://www.whatsapp.com/coronavirus

  • https://www.bbc.com/zhongwen/simp/world-58814884

  • http://www.xinhuanet.com/world/2021-05/12/c_1127438689.htm

  • https://www.cnbeta.com/articles/tech/1186687.htm

This article is from the WeChat public account “航通社” (ID: lifeissohappy), author: Shuhang, 36氪 published with authorization.

.

You may also like

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

Privacy & Cookies Policy