How a storm in the cloud brought Room 77 and other websites back down to earth

Last week, hotel booking service Room 77 was knocked offline in a mammoth cloud-computing outage that also crashed the websites of dozens of companies.

Some very big players, too, such as Yelp, due to a lightning storm downing the off-site data centers run by Amazon Web Services (AWS).

At 11:10 pm EST on Friday 29 June, Room 77's engineering team received an automated notification from its alert system that its website was down, which signaled the start of a 19-hour outage that also affected Netflix, Pinterest and Instagram.

Like many start-ups, Room 77 relies on web hosting and data services that are hosted in the cloud by rented computing capacity via huge data centers run by Amazon.

On Friday evening, the AWS Service Health Dashboard revealed that the company had lost its main power supply as well as backup power generator in the Washington, DC,-metro area.

A glitch in time, ain’t Amazon prime

Two causes are to blame for the service disruption, say Amazon spokesman Drew Herdener: A lightning storm on the US East Coast that took out power to several main AWS data centers and their back-up generator, plus an additional second added to the world clock between Saturday and Sunday to make up for quirks in the Earth’s rotation.

"You ever wish you had an extra second or two? This is not one of those times," tweeted Reddit, the social news site that also was blasted offline by leap-second change and the storm-related outage.

Life’s a glitch and then you cry

At the time of the outage, most of Room 77’s staff members were having dinner or were at home. Within minutes, they began to react to the problem.

Says Kevin Fliess, vice president of products and general manager:

"As a web-based business you have to expect some unplanned downtime. Within ten minutes of the outage everyone on the team from technology to marketing knew the situation and we began working through a set of tasks to ensure that our customers were well supported during the outage. "Our first priority was to ensure that our existing customers holding reservations were well supported."

He adds:

"During the outage we were able to provide support via email and through out toll-free number. Having these other support channels in place helped ensure smooth business continuity. If we had web-based support only, it would have been much more painful."

Room 77’s servers were restored on Saturday at about 7:30 PM EDT.

Amazon looking more like a dwarf now?

AWS holds an estimated 80% of the cloud services market. Yet this week some experts were wondering if this is the end of AWS’s near monopoly grip on the infrastructure-as-a-service market, similar to how Research in Motion (RIM) saw the pace of its loss of smart phone market share dramatically increase after an October 2011 global outage of service to Blackberry devices.

This was not AWS's first outage. Two weeks earlier, it had a six-hour outage. In April, it had a four-day “epic fail”. In March, a former AWS employee who left for travel search startup Hipmunk wrote publicly on online forum Reddit that AWS services were full of glitches. (Incidentally, Hipmunk's servers are on AWS, but on AWS West rather than AWS East, so it avoided the power outage trouble.)

Google is a likely new rival to AWS. Last Thursday, the Internet behemoth announced its plans to launch a cloud-services platform called Compute Engine at prices that undercut AWS’s rate sheet on a like-for-like service basis.

Google is offering 3.75GB of memory with 1 virtual core and 420GB of hard disk space for $0.145 an hour, compared with Amazon’s nearly identical service (except for 10 gigabytes less of hard disk space) at $0.16 an hour.

The closest offering by Rackspace, the number-two largest provider after AWS, costs $0.24 an hour and includes merely 4 gigabytes of hard disk space. It's rumored in online forums that Microsoft will adjust its price list for its similar service to respond to the new market dynamics.

Amazon’s Herdener says the AWS cloud provider prices its plans by depth of reliability, with the costliest plans including redundancies distributing customers' loads among multiple centers and making outages up to 99 percent unlikely on AWS’s much-touted Elastic Compute Cloud (EC2) server.

Meanwhile, startups scared off by the cloud and worried that AWS and other services are too broad or risky for their needs might consider buying a Storage Area Network (SAN), which comes with plenty of data redundancies (including back-up power supplies) for about $50,000 from a supplier like HP or IBM.

Developing back-ups for the back-ups

At Room 77, Fliess says they are taking several measures to ensure that this does not happen again.

"For example, we will be investing in greater data center redundancy across geographical regions, improving monitoring and alert systems, and providing better customer support."

The lightning bolt that hit AWS is an opportunity for companies that can help startups back-up their data seamlessly, instead of keeping all their data in one basket.

New startups, like CliQr, claim to have technology to allow companies to easily move internal enterprise applications between clouds, such as "private clouds" run on their own machines, or "public clouds" that could be operated by multiple cloud service providers like Amazon.com, Google, Microsoft and Rackspace.

In Room 77’s case, Fliess plans for greater "data center redundancy across geographical regions" and “improving monitoring and alert systems."

Fliess adds:

"As a start-up we've leveraged AWS as a cost-effective channel for data storage and computing capacity. "AWS as well as their competitors are offering services from multiple datacenters, although it costs a lot more to deploy servers across multiple geographical regions. "As our traffic grows, we are increasing our infrastructure investment in order to create redundancy across geographical regions. That way, if a catastrophic event knocks out a single data center, we'll have uninterrupted web and mobile operations."

Twitter as emergency customer tool

Many companies used Twitter to keep their consumers updated, shining a spotlight on social media’s importance as a back-up customer service arm for many start-ups.

Developing an emergency plan for using social media sites like Facebook, Twitter, and Google+ to alert customers may be something travel start-ups will consider.

Two hours into the crisis, Room 77 tweeted the bad news to its customers. Room 77 was offline for 19 hours, a frustratingly long time for a consumer-facing company not to be able to reach customers. Social media may earn a reputation for being handy in a pinch.

NB:Lightning bolt image via Shutterstock.

Clarification: This post was updated on 12 July to redact a reference to IHG Hotels. A representative of the company had originally confirmed that IHG had experienced downtime but another representative has clarified that this was not the case.

How a storm in the cloud brought Room 77 and other websites back down to earth

More From PhocusWire

From Our Partners

Subscribe Now!