Multi Cloud Resilience
Recently I have heard comments made by a senior IT Executive questioning why can’t we have Active-Active high availability pattern across multi-cloud? That’s a great question and I’ve no doubt plenty of IT Executives are pondering the exact same question in their sleepless night. Well, the simple answer is YES but with a few caveats.
Before you get overly excited about snatching the multi-cloud always-on Holy Grail, let’s examine some basic principles for resilience architecture. I would sum it up with two well known acronyms: Recovery Time Objective (RTO) and Recovery Point Objective (RPO). High availability should not be just focusing on reducing system downtime; data loss caused by component failure or disaster event must also be considered.
For example, let say you have successfully implemented a multi-cloud distributed application which is running simultaneously on AWS and Azure. Super cool. This Active-Active resilient pattern can withstand a total loss of one cloud vendor that is armed with multi-AZ resilience (i.e. data centres) without suffering any outage. That’s a great achievement because unplanned system downtime is virtually zero (i.e. 100% uptime with RTO=Zero). However, what if I say the small price to pay for this amazing architecture masterpiece is customer transaction data up to 15 minutes could be lost, and the lost data will never be recovered. Do you still think this Active-Active multi-cloud design is going to be well received by the stakeholders knowing some customer data will be permanently lost? Perhaps not.
In my view it’s relatively easy, at least not impossible, to achieve always-on high availability. It’s certainly easier for some application than others. For instance a stateless application, that is each request or transaction is completely autonomous and doesn’t require client or session data for processing, can be deployed across multiple cloud vendors operating in Active-Active mode. In the current diverse IT landscape a stateless application could be one that is collecting telemetric data from machinery like speed and temperature.
It could be an application that receives Kafka message and then distributes it to different queue or micro service based on the message header. It could also be an application that allows you leave comments on a social media post. Generally speaking data ingestion stateless application is much easier to adopt Active-Active resilient pattern because each application stack can operate independently. It is certainly a lot harder to implement Active-Active for a stateful mobile banking application where transaction sequence, data integrity and consistency is absolutely paramount.
Regardless of the types of application there is still one big catch; can you afford to lose data (i.e. RPO > 0). If telemetric data is going to be collected every 2 minutes, then losing a few minutes of data is not going to destroy the application. If the lost Kafka message is to retrieve the latest ASX stock quote, then losing one message means the end user would have to repeat the same request after receiving the time-out error. No real material damage except for poor user experience. The simple question comes down to whether you are prepared to accept some data loss in order to achieve Active-Active always-on resilience.
If the answer to the data loss question is synonymous to ‘have the cake and eat it too’ (i.e. RTO=RPO=Zero) then the task becomes a lot harder. You’d need to introduce bidirectional synchronous data replication between databases (or data nodes) deployed in separate cloud. Assuming you have found the perfect replication technology; network latency will still be a major challenge to overcome because some application is very sensitive to response time, or the high performance impact has rendered the application un-usable.
How about choosing asynchronous data replication which would significantly reduce the performance impact? That’s true, but how do we guarantee no data loss while running with asynchronous replication? Well don’t despair, it can be done if your application is designed for distributed transaction loaded with two-phase commit capability (e.g. XA). In a nutshell if transaction data has not been successfully committed to all distributed data stores within the permitted time threshold, then the application will rollback the transaction assuming data is either corrupted or lost. So the definition of committed data in this context is higher than non-distributed transaction application where data only needs to be committed to one data store and not all.
I’d like to also mention another common practice which I call hybrid Active-Active-Passive pattern. It is not strictly Active-Active because there is a shared common component between the two active sites/cloud vendors. The shared component is most likely to be the database or data store layer. For example two identical sets of application components are being deployed to AWS and Azure but sharing a single primary database in AWS. There is a passive standby database configured in Azure running synchronous replication with the primary.
Database can be failed over from AWS to Azure without data loss but it may take a few minutes to do so (i.e. RTO > 0). Needless to say there will be performance impact for cross-site/cloud data access (i.e. due to network latency) but Zero data loss is guaranteed. Some people may argue this is also recognised as Active-Active resilient pattern but I’d leave you to be the judge.
If I haven’t completely lost you by now I hope you’d ask ‘do we really need Active-Active multi-cloud resilient pattern?’. Some application is easier to implement Active-Active across multiple cloud vendors, is my application one of them? Can my application handle distributed transaction? Can we afford to lose some data in pursuit of always-on zero downtime? What compromises I’d have to make? I hope you’d get some inspiration from this article so next time when your senior IT Executive asks you why can’t we run Active-Active in multi-cloud your answer will be a resounding YES WE CAN but…
This article is a guest post by Tommy Tang (https://www.linkedin.com/in/tangtommy/). Tommy is a well rounded and knowledgeable Cloud Evangelist with over 25+ years IT experience covering many industries like Telco, Finance, Banking and Government agencies in Australia. He is currently focusing on the Cloud phenomena and ways to best advise customers on their often confused and thorny Cloud journey.