Is Disaster Recovery Really Worth The Trouble
(Part 4 of 4 part series)
Guest Post by Tommy Tang – Cloud Evangelist
In this final chapter of the Disaster Recovery (part 1, part 2 and part 3) discussion I am going to explore some of the common practices, and myths, regarding DR in the Cloud. I’m sure you must have heard about the argument for deploying application to the Cloud because of the inherited built-in resilience and disaster recovery capability. Multi-AZ, 11 9’s durability, auto-scaling, Availability Sets and multi-region recovery (e.g. Azure Site Recovery) and many more, are widely adopted and embraced without hesitation. No doubt these resilient features are part of the charm of using Cloud services and each vendor will invest and promote their own unique strength and differentiation to win market share. It’ll only take 30 minutes to fail over to another AZ so she’ll be right, yes?
If you remember in Part 2 of the DR article I stated the number one resilience design principle is to “eliminate single point of failure”. Any Cloud vendor could also become the single point of failure. If you’ve deployed the well architected, highly modularised and API rich application in Amazon Web Services (AWS), do you still need to worry about DR? The short answer is YES. You ought to consider DR capability provided by AWS, or any other Cloud vendor for that matter, to determine whether it does meet your requirement. The solution is indeed fit for purpose. Do not assume anything just because it is in the Cloud.
AWS is not immune to unplanned outages because Could infrastructure is also built on physical devices like disks, compute and network switches. Some online stores like Big W and Afterpay had been impacted due to unexpected AWS outage on 14th Feb 2019 for about 2 hours. What is your Recovery Time Objective (RTO) requirement? Similarly Microsoft Azure is also not immune to outages either. On 1st February 2019 Microsoft had inadvertently deleted several Transparent Data Encryption (TDE) databases after encountering DNS issues. The TDE database were quickly restored from snapshot backup, but unfortunately customers would have lost 5 minutes worth of transactions. Image what would you do if your Recovery Point Objective (RPO) is meant to be Zero? No data loss?
At this very moment I hope I have stirred up plenty of emotions and a good dose of anxiety. Cloud infrastructure and Cloud service provider is not the imaginative Nirvana or Utopia that you have been searching for. It’s perhaps multi-generation better than what you have installed in your data centre today, but any Cloud deployment still warrants careful consideration, design and planning. I’m going to briefly discuss 3 areas in which you should start exploring in your own IT environment tomorrow.
1. Backup and Restore
As a common practice you’d take regular backup for your precious application and data so you’d be able to recover in the most unfortunate event. Same logic applies here when you have deployed applications in AWS or Azure Cloud. Ensure you are taking regular backup in the Cloud, which is likely to be auto-configured, as well as a secondary backup stored outside the Cloud service provider. It’s exactly the same concept and reason for taking offsite backup which is, proverbially speaking, you don’t put all your eggs in one basket. Unless you don’t have a data centre anymore, your own data centre would be the perfect offsite backup location. I understand getting backup off AWS S3 could pose a bit of challenge and I’d urge you to consider using AWS Storage Gateway for managing offsite backups. It should make backup management a lot easier.
Once you’ve secured the backup of application and data away from the Cloud vendor, you’re now empowered to restore (or relocate) the application to your own data centre or to different Cloud provider as desired. Bearing in mind that you’re likely to suffer some data loss using backup and restore technique. Depending on the backup cycle it’s typically a daily backup (i.e. 24 hours) or weekly backup (i.e. 7 days). You must diligently consider all recovery scenarios to determine if backup and restore is sufficed for the Recovery Point Objective (RPO) of the targeted application.
2. Data Replication
What if you can’t afford to lose data for your Tier-1 critical application? (i.e. RPO is Zero) Can you still deploy it to the Cloud? Again the short answer is YES but it probably requires some amendment to the existing architecture design, and notwithstanding the additional cost and effort involved. I believe I have already touched on the design patterns Active-Active and Active-Passive in Part 2 of the DR discussion. If Recovery Point Objective (RPO) is Zero then you must establish synchronous data replication across 2 sites, 2 regions or 2 separate Cloud vendors. Ok, even though it’s feasible to establish synchronous data replication over long distances, the Law of Physics still applies and that means your application performance is likely to suffer from elevated network latency. Is it still worth pursuing? It’s your call.
There are generally 2 ways to achieve data replication across multi-region or multi-cloud. The first method is to leverage storage replication technology. It’s the most common and proven data replication solution found in the modern data centre, however it’s extremely difficult to implement in the Cloud. The simple reason is you don’t own Cloud storage but vendors do. There will be limited API and software available for you to synchronise data say between AWS S3 and the on-premises EMC storage array. The only alternative solution I can think of, and you might have other brilliant idea, is to deploy your own Cloud edge storage (e.g. NetApp Cloud Volumes OnTap) and presented to the applications hosted in various Cloud vendors. Effectively you still own and manage the storage (and data) rather than utilising the unlimited storage generously provisioned by the vendor, and as such you are able to synchronise your storage to any secondary location of your choice. You have the power!
As opposed to using storage replication technology you can opt for host or software based replication. Generally you are probably more concerned of the data stored in database than say the configuration file saved on the Tomcat server. Following on this logic data replication at the database tier is our first and foremost consideration. If you are running Oracle database then you can choose to configure Data Guard with synchronous data replication between AWS EC2 and on-premises Linux database server. On the other hand if your preference is Microsoft SQL Server then you’d configure SQL Server Always On Cluster with synchronous replication for databases hosted in Azure Cloud and on-premises VMWare Windows server. You can even set up database replication between different Cloud vendors as long as the Cloud infrastructure supports it. The single most important prerequisite for implementing database replication, wether it is between Cloud vendors or Cloud to on-premises, is the underlying Operating System (OS). Ideally you’d have already standardised your on-premises operating environment to be Cloud ready. For example, retaining large scale AIX or Solaris servers in your data centre, rather than switching to Windows or Linux based Cloud compatible OS, does nothing to inspire a romantic Cloud story.
3. Orchestration Tool
The last area I’d like to explore is how to minimise RTO while recovering application to your on-premises data centre or to another Cloud vendor during major disaster event. If you are well versed in the DevOps world and being a good practitioner then you are already standing on good foundation. The most common problem found during recovery is the complexity and human intervention required to instantiate the targeted application software and hardware. Keeping with the true CI/CD spirit the proliferation use of orchestration tool to deploy immutable infrastructure and application is the very heart and soul of DevOps. By adopting the same principle you’d be able to recover the entire application stack via orchestration tool like Jenkins to another Cloud or on-premises Cloud like environment with minimal effort and time. No more human fat finger syndrome and slack think time during recovery. Consider using open source and Cloud vendor agnostic tool like Terraform (as opposed to AWS CloudFormation) can greatly enhance portability and reusability for recovery. Armed with the suitable containerisation technology (e.g. Kubernetes) that is harmonised in your IT landscape, you’d further enhance deployment flexibility and manageability. Running DR at an alternate site becomes a breeze.
In closing, I’d like to remind you that just because your application is deployed to the Cloud (i.e. someone else infrastructure) you are not exonerated from neglecting the basic Disaster Recovery design principles and making ill-informed decision. Certainly it’s my opinion that the buck will stop with you when the application is blown to smithereens in the Cloud. This is the last article of the Disaster Recovery series and hopefully I have imparted a little bit of the knowledge, practical examples and stories to you that you can tackle DR from a whole new light without fear and prejudice. I’m looking forward to sharing with you some more Cloud stories in a not too distant future. Stay tuned.