I’m grateful for the transparency of Netflix. This helps the overall cloud community understand to complexity of building redundant architectures built on public cloud resources.
Originally posted on Gigaom:
Last week’s Amazon Web Services outage might have outsmarted Netflix’s Chaos Monkeys, but the content-distribution giant isn’t about to turn its back on cloud computing. According to a Friday blog post from the Netflix (s nflx) cloud team, the outage (which started with a generator failure and resulted in a cascading bug that took down AWS’s (s amzn) Elastic Load Balancer feature) exposed some flaws in Netflix’s operations both within and beyond its control, but it was a relatively small blip in what has been better overall availability since the company made the move entirely to the cloud.
That the AWS outage resulted in a control plane backlog that prohibited customers from failing over into Availability Zones not affected by the generator failure was Amazon’s fault. However, Netflix’s Greg Orzell and Ariel Tseitlin write, the outage also highlighted some problems with its own load-balancing architecture that ended up compounding the problem by “essentially caus[ing] gridlock inside most of our services as they tried to traverse our middle-tier.” Netflix is working to fix this problem.
Still, they note, Netflix has had better overall uptime since moving to the cloud and is “still bullish on the cloud.” In part, that’s because Netflix has been able to architect its cloud-based services to be resilient even when AWS fails. Some of those decisions proved wise last week: