Does the Cloud fail?
The short answer: yes.
The consultant answer: it depends.
One could also ask the question “Should the Cloud fail?”
My answer is: yes, failures are bound to happen in the Cloud. Why? Because the cloud is designed and built by humans. And we all know humans are not perfect, present company included.
A “perfect” example of this is the latest Azure outage. While we still have to wait for the final RCA (Root Cause Analysis) a lot can (and has been already) written on that topic. I do recommend however to listen to the Microsoft Azure Podcast on the matter. It provides some fundamental human insight into the problem.
Yes, forces of nature did show that Mother Nature will always have the final word. But one part of the discussion included some internal Microsoft services still to be reliant on Azure ASM (aka Classic aka old school) technology. While Microsoft has been pushing everyone to ASR, they kept on using ASM themselves. And that has now failed. There’s no doubt that migrating from ASR to ASM was on some project sheet or to-do list. But they just never got around to it. Doesn’t that sound familiar? It should, and it’s called being human.
There’s also the matter of some storage account that wasn’t correctly configured and therefore not as highly available as everyone believed. Again, this sounds all too familiar, doesn’t it?
Still, I do find myself puzzled by the fact that the failure also impacted European users. But I’m not going to speculate any further, let’s wait and see in the final RCA in a few weeks.
I’ve heard and read comments on Microsoft’s poor design, their lack of In-Region availability zones, etc. The fact of the matter is, humans have designed the infrastructure of the competition too. So are they less prone to failure? Probably not, they just haven’t surfaced yet is a more likely answer.
But let’s take a moment to put this into perspective. The Cloud doesn’t present more failures compared to on-premises Enterprise IT. In fact, one might argue the opposite is even true. Cloud failures are likely to be more visible, but that doesn’t mean they happen more often.
What differentiates a good IT engineer from a great one? In my opinion, their ability to learn from their mistakes. Microsoft has been very open to the nature of the failure to date and has presented a clear vision to learn from their mistakes and adapt.
Some will say “too little, too late”. I’m not one of them. I do believe Microsoft does deserve some credit for their willingness to move forward and solve their issues. Those who think it can’t happen to them… that’s probably worth a future blog article.
September 24, 2018