So, Amazon had four major outages of their cloud services last year with the longest and most notable being the one that took Netflix streaming down for 23 hours during the busy holiday period. This article over on TechTarget discusses how Cloud up time is many times higher than enterprise data center up time. This got me to considering how Cloud up time should be measured compared to data center up time. If you have any experience managing SLA’s you have discovered how up time is defined (if at all) determines your vendor’s incentive and therefore level of efforts in restoring your application.
Applications are the name of the game and especially in Cloud based services. Normally when you consider a Cloud solution you mainly look at hosting a specific application or set of applications. This is where the legal jargon of what a vendor considers availability versus what you may consider availability come into play. In the traditional data center we measure the availability of the major subsystems from HVAC to SAN and Network. But how does this all translates to a service provider? If your virtual machine (VM) is up and running but the Cloud the private connection to your database hosted in your DC is unavailable how is up time calculated? Who has responsibility in ensuring the connection stays available? This is just one possibility of cross organizational support issues that exist in a Cloud environment.
It doesn’t mean much to have a VM physically up and running but not have access to the application. Some vendors will still consider the above state as operational and not count against up time. I do sympathize with Cloud vendors as this is a slippery slope. How do you create a demarcation between the Cloud provider’s assets and the customer’s. You can use application monitoring tools that create synthetic transactions to measure up time but what if the application failure is on your end and not the Cloud provider’s?
This is one of the areas I think providers like Rackspace have an opportunity to gain market share with their OpenStack Cloud. Their “Fanatical” support approach looks to help customers keep their applications up and running no matter the source of the issue. They will even upon request log into you VM and help figure out the issue. I’ve dinged Rackspace in the past for playing word games with their SLA’s but bottom line they do have some of the best and most transparent support in the hosted business.
Even with this different take on Cloud support, measuring and evaluating up time metrics is a interesting challenge for Cloud customers. It introduces a new skill is vendor SLA management. If you are looking to migrate your first application to the Cloud, I’d take an especially hard look at the Cloud provider’s wording for service availability and what parts of the infrastructure and service they are ensuring is available and the level of support offered
As an enterprise customer looking at Cloud providers how do you evaluate (and value) Cloud providers on availability?