Many digital companies, including travel startups like Airbnb and Foursquare, have been knocked offline for hours during the business day today.
The startups rely on web hosting and data services that are provided "in the cloud" on EC2, rented computing capacity via huge data centers run by Amazon.
In its most recent update, Amazon Web Services says it is "experiencing degraded performance for EBS volumes in a single Availability Zone in the US-EAST-1 Region."
Some analysts suspected this is a bug with provisioned iops in EBS.
One travel startup reports that its "volumes that don't have provisioned iops are totally fine." Companies that set up a multi-region fault-tolerant configuration are also fine.
While we don't yet know the cause of this crash, a recent academic study found that vulnerable software includes "Amazon’s EC2 Java library and all cloud clients based on it.”
Whatever the reason, the disruption is enough to make a startup think about at least having a spare server.
Airbnb had a Zen tweet about the news:
Apologies. Our site is having a case of the Mondays... We'll Airbrb as soon as possible
AWS holds an estimated 80% of the cloud services market. But given its recent series of crashes, one could be forgiven for wondering if they're doing enough to prevent off-line moments.
On the other hand, the cloud has brought advantages. Cost savings is one. Hosting 1 terabyte of data a decade ago could cost $1 million a year and now it costs $50, according to Jim Davidson of Farelogix.
AWS's SLA annual Uptime percentage is 99.95%, which translates to about 263 minutes of downtime per year.
Switching hosts isn't easy either. Moving EC2 instances would require copying 100s of gigabytes of files for startups the size of Airbnb.
As Tnooz contributor Steven Joyce of Rezgo has pointed out:
10 years ago these start-ups were hosting on dedicated managed servers in a single data center somewhere. Only these data centers were small and expensive and suffered from (arguably) more outage issues.
We hosted with a data center whose building power was cut off by a construction accident. It took over 12 hours to get service restored. I’m not saying crashes should be tolerated, I’m just saying that they happen and to assume otherwise is foolish.
One company benefiting from the crisis seems to be Pager Duty, which provides SaaS IT on-call schedule management, alerting and incident tracking.