Recent outages from critical services across the net have created massive disruption over the last few days: Whether it was Amazon’s S3 service failure, which took down thousands of sites, Cloudflare’s “Cloudbleed” security issue, which forced many sites to ask users to reset their passwords, or Google WiFi’s accidental reset, which wiped out customer’s internet profile, it seems the Internet infrastructure has been getting substantially more unstable recently.
The packetized technology that underlies most of the Internet was created by Paul Baran as part of an effort to protect communications by moving from a centralized model of communication to a distributed one. While the Internet Society questions whether the creation of the Internet was in direct response to concerns about nuclear threat, it clearly agrees that “later work on Internetting did emphasize robustness and survivability, including the capability to withstand losses of large portions of the underlying networks.”
From there, the foundation was laid for an Internet that treated the distributed model as a key component to ensuring reliability. Almost 50 years later, consolidation around hosting and the development of the cloud have created a model that increases concentration on top of few key players: Amazon, Microsoft, and Google now host a large number of sites across the web. Many of those companies customers have opted to host their infrastructure in a single set of data centers, potentially increasing the frailty of the web by re-centralized large portion of the net.
That’s what happened when Amazon’s S3 service, essentially a large hard drive used by companies like Spotify, Pinterest, Dropbox, Trello, Quora, and many others, lost one of its data centers. Companies that had stored their content in that one data center essentially stopped functioning properly, prompting experts to recommend that companies look at storing data across multiple data centers to increase reliability.
On a different end of the spectrum, other services that are being used for reliability purpose have been experiencing their own issues. Cloudflare, which provides security and hosting services for thousands of web sites, revealed last week that its service had a security bug which could leak passwords from its customers’ sites, forcing thousands of sites to ask their customers to change their passwords.
While those issues may only be fixed by the owners of the respective sites, the problem of centralization is one that is slowly expanding back into the consumer realm. People using Google Wifi and Google Chromecast found themselves forced to reinstall their systems last week as a bug wiped out centralized configuration files for many of those devices, forcing them offline for a period of time.
As more people and more devices get connected to the Internet, the lure of centralizing control to make it easier for companies to manage them is bumping its head against the initial design of the Internet to drive reliability and scalability. With every new largely centralized system that comes online, the Internet becomes more brittle, as centralization creates an increased number of single points of failure. In a world where hackers are looking for new ways to take down infrastructures, those centralized services must double down on increasing security and reliability if we want the Internet to survive.
Startups relying on standardized infrastructures can easily go to market but complete reliance on a single set of servers is akin to building a castle on a swamp. While companies like Amazon, Microsoft, Google, and others have a responsibility to ensure the infrastructures they provide remain stable, it is important for any company to consider how to best balance their offerings across different data centers and how to adapt in case of failures.
The challenges presented in those recent outages are nothing new to the Internet and many of the smarter companies have taken lessons from history and built their offering in a way that ensures reliability and stability. For example, while many companies were flailing because of this week’s S3 failure, Netflix, one of the poster boys for Amazon services, was fine. In 2012, the company suffered from a major outage and learned its lesson. It built a set of tools to ensure that content keeps streaming even if the underlying data centers go dark and created a bunch of programs called “the Simian Army” to disrupt its own services.
Having successfully proven them to work, the company now open sourced that software so anyone can use and improve it. Many of the companies which failed yesterday may be well served to take advantage of it to avoid the next infrastructure failure.