Is Disaster Recovery Really Worth The Trouble (Part 3)

Is Disaster Recovery Really Worth The Trouble (Part 3)

Is Disaster Recovery Really Worth The Trouble

(Part 3 of 4 part series)

Guest Post by Tommy Tang – Cloud Evangelist

In previous articles (part 1 and part 2) I’ve emphasised Disaster Recovery (DR) design principle is simply about eliminating single point of failure for data centre, and to provide adequate service and application resilience that’s fit for purpose. Over-engineered gold plated architecture solution does not always fit the bill and conversely low-tech simple and cost effective solution doesn’t necessary mean it’s sub-standard. There are 3 common DR patterns that you are likely to find in your organisation and they are known as “Active-Active”, “Active-Passive” and “Active-Cold”. As a DR solution architect you have been tasked to implement the most cost effective and satisfactory DR solution for your stakeholders. You might wonder where to begin, Pros and Cons of each DR pattern and what are the gotchas? Well, let me tell you there is no perfect solution or “one-size-fit-all” silver bullet. But don’t feel despair as I will be sharing with you some of the key design consideration and relevant technology that is instrumental to successful DR implementation.

Network and Distance Consideration 

Imagine your two data centres that are geographically dispersed, the underlying network infrastructure (e.g DWDM or MPLS) is the very bloodline that interconnects every service together such as HTTP server, database, storage, Active Directory, backup etc. So without doubt network performance and capability is rated high on my checklist. How do we measure and attain good network performance? First of all you’d need to understand the two key measurements; network latency and bandwidth and I will briefly explain them below.

Network latency is defined as the time it takes to transfer a data packet from destination A to B and expressed in Millisecond (ms). In some cases latency also includes the data packet roundtrip with acknowledgement (ACK). Network bandwidth is the maximum data transfer rate between destination A and B (aka network throughput), and the transfer rate is expressed in Megabits per second (Mbps). Both of these metrics are governed by the law of physics (i.e. speed of light) so the distance in which separated the two data centres plays a pivotal role in determining the network performance and ultimately the effectiveness of DR implementation.

Having data centres located in Sydney and Melbourne sounds like a good risk mitigation strategy until you are confronted with the “Zero RPO” dilemma. How could you keep data in-sync between 2 data centres stretched over 800Km, leveraging the existing SAN storage based replication technology, without causing noticeable degradation to storage performance? How about the inconsistent user experience being felt by users who are farther away from the data centre? Remember the law of physics? Unless you own a telephony company or unlimited funds, trying to implement synchronous data replication over long distance, regardless whether it is host or storage based replication technology, will surely cost a large sum of money and not to mention the adverse IO performance impact.

For those brave souls who are game enough to implement dual site Active-Active extended Oracle RAC cluster, the maximum recommended distance between 2 sites is 100Km. However after taking into consideration of super-low network latency requirement and relatively high cost, it’s more palatable to implement extended Oracle RAC cluster in data centres that are 10-15Km apart. You may find similar distance constraint exists for other high availability DR technology. Active-Active pattern is especially sensitive to network latency because of the constant chit-chatting between services at both sites. If the distance between 2 data centres is becoming the major impediment for implementing Active-Active DR or synchronous data replication, then you should diligently pursue alternative solutions. It’s quite possible that Active-Passive or non-zero RPO is acceptable architecture so don’t be afraid to explore all options with your stakeholders.

Mix and Match Pattern

I have come across application systems which have been architected with the flurry of mix and match DR design flair that got me slightly bemused. Let us examine a simple example. A “Category A” service (i.e. highly critical customer facing) is composed of Active-Active DR pattern for the Web Server (pretty standard), Active-Passive pattern for the Oracle database (also stock standard), and Active-Cold pattern for the Windows application server. So you may ask what is the problem if RTO is being met?

As you may recall each DR pattern comes with predefined RTO capability and prescribed technology that underpins it. By combining different DR design patterns into a single architecture will undoubtedly dilute the desired DR capability. In this example the Active-Cold pattern is the lowest common denominator as far as capability is concerned, so it will inadvertently dictate the overall DR capability. The issue being is why would you invest in a relatively high cost and complex Active-Active pattern when the end result is comparable to the lowly Active-Cold design? The return on investment has greatly diminished by including lower calibre pattern such as Active-Cold in the mix.

Another point you should consider is can the mix and match design really stand up in the real DR situation and meet the expected RTO. I have heard the argument that the chosen design works perfectly well in the isolated application DR test. What about in the real DR situation when you are facing competing human resources (e.g. Sysadm, DBA, Network dude) and system resources like IOPS, CPU, Memory, Network etc. It’s my belief that all DR design patterns should be regularly tested in simulated DR scenario with many applications, in the interest of determining the true DR capability and effectiveness. You may find the mix and match DR architecture does not work as well as expected.

Finally the technology that underpins each DR pattern could have changed and evolved over time. Software vendors often change functionality and capability with future releases so DR pattern must be engineered to be adaptive to change. As a result there’s inherited risk for mixing different DR patterns that will certainly increase the dependency and complexity for maintaining expected DR capability in the fast changing technology landscape.

Mix and match DR pattern may sound like a good practical solution and in many cases it is driven by cost optimisation. However after consideration of the associated risks and pitfalls I’d recommend choosing the pattern that is best matched for the corresponding service criticality. Although it’s not a hard and fast rule but I do find the service to DR pattern mapping guidelines below are simple to understand and follow. You may also wish to come up with different set of guidelines that are more attuned to your IT landscape and requirement.

  1. Category A (Highly Critical) – Active-Active (preferred) or Active-Passive
  2. Category B (Critical) – Active-Passive (preferred) or Active-Cold
  3. Category C (Important) – Active-Passive or Active-Cold
  4. Category D (Insignificant) – Active-Cold
 Disaster Recovery 2

Automation

Last but not least I’d like to bring automation into DR discussion. In the current Cloud euphoria era automation is the very DNA that defines its existence and success. Many orchestration and automation tools are readily available for building compute infrastructure, programming API and PaaS services configuration just to list a few. The same set of tools can also be applied to DR implementation with great benefits.

In my mind there is no doubt that Active-Active is the best architecture pattern, however it does come with a hefty implementation price tag and design constraints. For example some application does not support distributed processing model (i.e. XA transaction) so it can’t run in dual-site Active-Active environment. Even for the all mighty Active-Active pattern automation can further improve RTO when applied appropriately. For instance client and application workload distribution via Global DNS Service or Global Traffic Manager (GTM) needed for DR can be automated via pre-configured smart policy. Following the same idea database failover can also be automated based on well tested configurable rules. This is where automation can simplify and vastly improve the quality of DR execution.

Same design principle applies to Active-Passive and Active-Cold DR pattern as well. Automation is the secret source for quality DR implementation. Consider incorporating automation to all service components where possible. But here is the reality check. Implementing automation is not trivial and it is especially difficult for service component that is not well documented or designed, or lack of the suitable automation tools. Furthermore it is not advised to automate DR process if there is no suitable production like environment (e.g. cross-site infrastructure) to conduct quality assurance test. The implementation work itself can be extremely frustrating because you’d need to delicately negotiate and cooperate with different departments and third-party vendors. Having that said I believe the benefits are far outweighed the pain in most cases. I have known one case where automation has reduced DR failover time from 4 hours down to 30 minutes. No pain no gain right?

For those who are DevOps savvy techies there are many orchestration tools out in the marketplace that you can pick to develop the automation framework of your choice. Chef, Puppet, Jenkins for orchestration and Python, Powershell, and C Shell for scripting just to name a few. If you don’t want to build your owner automation framework then you might want to consider vendor software like Selenium, Ansible Tower or Blueprism.

In conclusion a successful DR implementation should be planned with detailed impact assessment of network latency between data centres, carefully consider the most appropriate DR patterns and relevant technology for the targeted service application, and leverage automation infused with artificial intelligence (i.e. policy or rule based) to replace manual tasks where feasible. In the next article I will be exploring the various DR scenarios presented for Cloud deployment.

This article is a guest post by Tommy Tang (https://www.linkedin.com/in/tangtommy/). Tommy is a well rounded and knowledgeable Cloud Evangelist with over 25+ years IT experience covering many industries like Telco, Finance, Banking and Government agencies in Australia. He is currently focusing on the Cloud phenomena and ways to best advise customers on their often confused and thorny Cloud journey.

Is Disaster Recovery Really Worth The Trouble (Part 2)

Is Disaster Recovery Really Worth The Trouble (Part 2)

Is Disaster Recovery Really Worth The Trouble

(Part 2 of 4 part series)

Guest Post by Tommy Tang – Cloud Evangelist

In the previous article I’ve mentioned Architecture is the foundation, the bedrock, for implementing Disaster Recovery (DR), and it must be part of the broader discussion on system resilience and availability otherwise funding will be hard to come by. You may ask what are the key design criteria for DR? I believe, first and foremost, the design must be ‘Fit for Purpose’. In other words you’d need to understand what the customer wants in terms of requirement, objective and expected outcome. The following technical jargons are commonly used to measure DR capability and I will provide a brief explanation for each metric.

Recovery Time Objective (RTO)

  • It is the targeted time duration of which a service or system needs to be restored during Disaster Recovery. Typically RTO is measured in hours or days and it’s no surprise to find human ‘think’ time often exacerbates the recovery time. RTO should be tightly aligned with the Business Continuity requirement (i.e. Maximum Acceptable Outage MAO) given system recovery is only one aspect of the business service restoration process.

Recovery Point Objective (RPO)

  • It is the maximum targeted time of which data or transaction might be lost without recovery. You can view RPO as the maximum data loss that you can afford. So ‘Zero RPO’ is interpreted as no data loss is permitted. Not even a second. The actual amount of data loss is very much dependent on the affected system. For example, an online stock trading system that suffers a 5-minute data loss could result hundreds of lost transaction worth millions of dollars. Conversely, an in-house Human Resource (HR) system is unlikely to suffer any data loss for the same 5-minute interval given changes to HR system are scarce.

Mean Time To Recovery (MTTR)

  • It is the average time taken for a device, component or service to recover from failure after being detected. Unlike RTO and RPO, MTTR includes the element of monitoring and detection, and it’s not limited to DR event but any failure scenario. When you’re designing the appropriate DR solution for your customer, MTTR must be vigorously scrutinised for each software & hardware component in order to meet the targeted RTO.

Let’s move over to the business side of the DR coin and see how these metrics are being applied. I think it is a safe bet to assume each business service would have already been assigned to the predetermined service criticality classification, and each classification must have included RTO and RPO requirement. For illustration purposes let say “Category A” service is a highly critical customer portal so it might have 2-hour RTO and Zero RPO requirement, and for “Category C” internal timesheet service it could have RTO set to 12-hour with 1-hour RPO.

In a real DR event (or DR exercise) the classification is used to determine the order in which a service is being restored. It is neither practical or sensible to have all services weighed equally, or have too many services that are rated critical given the limited resources and immense pressure being exerted during DR. The right balance must be sought and agreed upon by all business owners.

Disaster Recovery

Now you have the basic understanding of the DR requirement and keen to get started. Hold off launching the Microsoft Visio app and start drawing those beautiful boxes just yet. I’d like to share with you the one simple resilience design principle which I have been using, and that is to eliminate “Single Point of Failure”. By the virtue of having 2 working and functionally identical components in the system you’d improve resilience by 100%! The 2x system is now capable of handling single component failure without loss of service. The “Single Point of Failure” principle does apply to physical data centre and therefore it is very much relevant to DR design.

As an IT architect you have a number of tried and proven solutions (aka architecture patterns) available in the toolkit at your disposal. The DR patterns described below are commonly found in most organisations.

Active-Active

The definition of Active-Active DR pattern is to provision two or more active working software components that spread across 2 data centres. E.g. A N-tier system architecture may consist of 2x Web servers, 2x Application servers and 2x Database servers. Client connection and application workload is distributed between 2 sites, either equally weighed or otherwise, via Global DNS Service or Global Traffic Manager (GTM). The primary objective of the Active-Active DR design is to eliminate data centre as single point of failure. Under this design there is no need to initiate failover during Disaster Recovery because an identical system is already running at the alternate site and sharing the application workload. (E.g. Zero RTO)

The Active-Active pattern is best suited for critical system or service (i.e. Category A) because of the high cost and complexity associated with implementing distributed system. Not every application is capable of running in a distributed environment across 2 sites. The reason could be attributed to software limitation like Global Sequencing or Two-Phase Commit. It’s highly desirable to have formulated a prescriptive Active-Active design pattern to help mitigate the inherited cost and risks, and to align with the existing technology investment and future roadmap.

The biggest challenge is often encountered at the database tier. Are you able to run the database simultaneously across 2 sites? If so, is the data being replicated synchronously or asynchronously? Designing a fully distributed database solution with zero data loss (i.e. Zero RPO) is not trivial. Obviously you can choose to implement a partial Active-Active solution where every component except the database is active across 2 sites. Alternatively, you may want to relax the RPO requirement to allow non-zero value so asynchronous data replication can be applied. (E.g. 5-minute RPO)

From general observation I’ve found critical system database is typically configured with a warm standby DB with Zero RPO, where failover operation can be manually initiated or automated. The warm standby DB configuration is also known as Active-Passive DR pattern of which is going to be explored further in the next section.

Recently I’ve heard a story about Disaster Recovery. A service owner proclaimed the targeted system is fully Active-Active across 2 sites during the DR exercise and therefore no failover is ever required. 30 minutes later the same service owner, with much reluctance, scrambled to contact the DBA team requesting an urgent Oracle DB failover to the DR site. A word of advice: many supposed to be Active-Active implementations are only truely Active-Active up to the database tier so it does pay to understand your system design. A one page high-level system architecture diagram with network flow should be suffice to summarise the DR capability without confusion.

Active-Passive

The Active-Passive DR pattern stipulates that there is one or more redundant software components configured in warm standby mode at the alternate data centre. DR failover operation can be either manually initiated or automated for each component in the predetermined order for the respective application stack. Client connection and application workload will be redirected to the active live data centre via Global DNS Service or Global Traffic Manager (GTM). Remember the key differentiator from Active-Active DR pattern is only one active site can accept and process workload while the passive site is in dormant.

The primary objective of the Active-Passive design is, same as Active-Active, to eliminate data centre as single point of failure but albeit with higher RTO value. Time required to failover will vary and is dependent on the underlying design and technology deployed for the corresponding software component. Component failover can typically take 5 to 30 minutes (or even longer) to complete. Therefore the aggregated component failover time + human think time is roughly equivalent to the RTO value. (E.g. 4-hour RTO)

The Active-Passive design is suitable for most systems because it is relatively simple and cost effective, The two key technology enablers are storage replication and application native replication. Leveraging the storage replication for DR is probably the most popular option because it is application agnostic. The storage replication technology itself is simple, mature and proven and it’s generally regarded to be low risk. The data being replicated between sites can be synchronous (i.e. Zero RPO) or asynchronous (i.e. Non-zero RPO) and both options are just as good depending on the RPO requirement.

As for the application specific replication it will typically utilise TCP/IP network to keep data in-sync between 2 sites. It could also be synchronous or asynchronous depending on the technology and configuration. The underlying replication technology is vendor specific and proprietary so you’d need to rely on the vendor’s tools for monitoring, configuration and problem diagnosis. For example, you may want to implement SQL Server Always On Availability Group for the warm standby DB so you’d have to learn how to manage and monitor Windows Server Failover Cluster (WSFC). Application native replication is often found at the database tier like SQL Server Always On or Oracle Data Guard. Every vendor would have published the recommended DR configuration so it’d be foolhardy not to follow their recommendation.

Active-Cold

Last of all it is about the Active-Cold DR pattern. This pattern is similar to Active-Passive except the software component at the alternate site has not been instantiated. In some case it may require a brand new virtual server for configuring and starting the application component. Or it may need to manually mount the replicated filesystem and then start up the application. Or it may require certain backup restoration process to recover the software to the desired operating state.

The word ‘Cold’ implies much work is needed, and whatever it takes, to bring the service online. In many cases it’ll take hours or even days to complete the recovery task. Hence RTO for Active-Cold design is expected to be larger than Active-Passive. However just because it takes longer to recover doesn’t mean it is a bad solution. For example, it is perfectly acceptable to take one or two days to recover an internal timesheet system without causing much outrage. Put it simply it is “Horses for Courses”. Also you can still achieve Zero RPO (i.e. no data loss) with Active-Cold design by leveraging synchronous storage replication between sites. Not bad at all!

In this article I have covered the common DR related metrics like RTO, RPO and MTTR. I have also shared with you the ‘Single Point of Failure’ resilience design principle which has served me well over the years. I have summarised, perhaps a tad longer than summary, the three common DR design patterns interlaced with practical examples and the patterns are: Active-ActiveActive-Passive and Active-Cold. I realised I might have gone a bit longer than expected in this article so I’m saving some of the interesting thoughts and stories for the next article which is focusing on DR implementation.

This article is a guest post by Tommy Tang (https://www.linkedin.com/in/tangtommy/). Tommy is a well rounded and knowledgeable Cloud Evangelist with over 25+ years IT experience covering many industries like Telco, Finance, Banking and Government agencies in Australia. He is currently focusing on the Cloud phenomena and ways to best advise customers on their often confused and thorny Cloud journey.

Is Disaster Recovery Really Worth The Trouble (Part 1)

Is Disaster Recovery Really Worth The Trouble (Part 1)

Is Disaster Recovery Really Worth The Trouble

(Part 1 of a 4 part series)

Guest Post by Tommy Tang – Cloud Evangelist 

Disaster Recovery

Often when you talk to your IT colleagues or business owners about protecting their precious system with adequate Disaster Recovery capability (aka DR), you will get the typical response like ‘I have no money for Disaster Recovery’ or ‘We don’t need Disaster Recovery because our system is highly available’. Before you blow your fuse and try to serve them a comprehensive lecture on why Disaster Recovery is important, you should understand the rationale behind their thinking.

People would normally associate the word ‘disaster’ to insurance policy. So it is about natural disaster event such as flooding, thunderstorm, earthquake or man made disaster like fire, loss of power or terrorist attack. These special events are ‘meant’ to happen infrequently that the inertia of human behaviour will try to brush that off, and in particularly when you are asking for money to improve Disaster Recovery capability!

You may ask how do you overcome such deep rooted prejudice towards DR in your organisation? The first thing you must do is DO NOT talk about Disaster Recovery alone. DR should be one of the subjects covered by the wider discussion regarding system resilience and availability. Before your IT manager or business sponsor going to cough up some hard fought budget for your disposal you’ll need to articulate the benefit in clear, precise and easily understood layman’s terms. Do not overplay the technology benefit such as ‘it’s highly modularised and flexible to change’ or ‘it’s loosely coupled micro-service design that is good for business growth’, or ‘it’s well-aligned to the hybrid Cloud architecture roadmap for the enterprise’. Quite frankly they don’t give a toss about technology as they only care about operations impact or business return.

 For IT manager it’s your job to paint the rosy picture on how a well designed and implemented DR system can help meet the expected Recovery Time Objective (RTO), minimise human error brought on by the pressure cooker like DR exercise, and save the manager from humiliation amongst the peers and superiors in the WAR room during a real DR event. As for the business sponsor it’s only natural not to spend money unless there is material benefit or consequence. You’ll need to apply the shock tactics that will scare the ‘G’ out of them. For certain system it’s not difficult to get the message across. For example, the Internet Banking system that requires urgent funding to improve DR capability and resilience. The consequences of not having the banking system available to customers during business hours will have severe material and reputation impact. The bad publicity generated in today’s omnipresent digital media is both brutal and scathing and will leave no place to hide.

So now you have done the hard sell and secured funding to work on the DR project, how would you go about delivering maximum value with limited resource? This could be the very golden ticket for you to ascend to the senior or executive position. Here is my simple 3 phase approach outlined below and I’m sure there are many ways to achieve the similar outcome.

Architecture

  • This is the foundation of a resilient and highly available design that can be applied to different systems and not just a gold plated one-size-fit-all solution. The design must be prescriptive but yet pragmatic with well defined cost and benefits.

Implementation

  • It has to be agile with risk mitigation strategy incorporated in all delivery phases. I believe automation is the key enabler to quality assurance, operational efficiency and manageability.

On-Premises and Cloud

  • The proliferation and adoption of Cloud has certainly changed the DR game. Many different conversations taking place today is about “To Cloud” or “Not To Cloud”, and if it is Cloud then HOW? Disaster Recovery must be, along with system resilience, included into such critical decision, and it’s ought to be adaptive to whatever path the business has chosen.

Understanding what DR really means in the organisation is utterly important and it can often lead to the change of prejudicial thinking with well articulated benefits and consequences. In the coming weeks I’m going to share my insights for the aforementioned phase approach.

This article is a guest post by Tommy Tang (https://www.linkedin.com/in/tangtommy/). Tommy is a well rounded and knowledgeable Cloud Evangelist with over 25+ years IT experience covering many industries like Telco, Finance, Banking and Government agencies in Australia. He is currently focusing on the Cloud phenomena and ways to best advise customers on their often confused and thorny Cloud journey. 

Multi-Cloud Deployment – Are you Ready?

Multi-Cloud Deployment – Are you Ready?

Are you ready for Multi-Cloud?

MultiCloud

Guest Post by Tommy Tang – Cloud Evangelist 

Lately I have heard colleagues earnestly discussing (or perhaps debating) the prospect of adopting Multi-Cloud strategy; and how it could effectively mitigate risks and protect the business as it was a prized trophy everyone should be striving for. For those uninitiated Multi-Cloud strategy in a nutshell is a set of architecture principles that would facilitate and promote the absolute freedom to select any cloud vendor for any desired service at time of your choosing; and there is no material impact to move from one cloud service provider to another.

Before you get too excited about Multi-Cloud I’d like to mention the much publicised US Department of Defence’s Joint Enterprise Defence Infrastructure cloud contract (aka JEDI). Amongst the usual objectives and strategies stated in the JEDI strategy document, the most contentious issue revolves around the explicit requirement for choosing a single cloud service provider who can help modernise and transform their IT systems for the next 10 years. Not Multi-Cloud. The reaction to the single cloud approach has certainly brought on some fierce debate in the IT world, of which both IBM and Oracle tried to register their displeasure through legal avenues. Both companies have been dismissed and out of the running of the JEDI contract now.

While you are pondering the reason why Department of Defence would seemingly go against the conventional wisdom of Multi-Cloud, let’s briefly examine some of the advantages and disadvantages of Multi-Cloud strategy.

Advantages

  • Mitigate both service and commercial risks by procuring from multiple cloud vendors (i.e. not putting all eggs in one basket)
  • Select the best-in-bred service from a wide range of cloud providers (E.g. AWS for DevOps, Azure for Business Intelligence and Google for Artificial Intelligence)
  • Strive for favourable commercial outcome by encouraging competition between different players
  • Leverage fast emerging new technologies and services offered by the incumbents or new cloud entrants
  • Promote innovation and continuous improvement without artificial cloud boundaries

Disadvantages

  • Multi-Cloud architecture design can be more complex (I.e. integration, replication and backup solution that would need to work across different cloud vendors)
  • Unable to take advantage of vendor specific feature or service (E.g. Lambda is an unique AWS service)
  • Difficult to track and consolidate finance with different contracts and rates
  • No single pane-of-glass view for monitoring and managing cloud services
  • Need extensive and continuous training for different and never-ending cloud technologies

After learning the good and bad of pursuing the Multi-Cloud dream do you think the JEDI approach is wrong? Well the answer in my humble opinion is it depends. For example if you’re managing an online holiday booking service then you’re probably already using cloud services and thus it’s unlikely you’d face any impediments for deploying your Java applications to a different cloud vendor. On the other hand if you’re running the traditional supermarket and warehouse business using predominately on-premises IT systems then it is much more difficult moving them to the cloud; let alone running in different cloud vendors without massive overhaul.

If you’re still keen to explore the Multi-Cloud strategy then I’d consider the following guidelines. These are not prerequisites but certainly help achieve the ultimate cloud-agnostic goal.

Modernise IT Infrastructure

Modernise the on-premises IT systems to align with the common cloud infrastructure so they are Cloud Ready, This is the most important step regardless whether you are aiming for single cloud or Multi-Cloud deployment. During the modernisation phase you’d soon find out certain IT systems are difficult (and insanely expensive) to move to the cloud. This is the reality check you ought to have. It is perfectly ok to retain some on-premises system because quite frankly not every system is suitable for cloud. For instance large and complex application that requires specialised hardware or highly latency sensitive application is probably not for the cloud. Quarantine your cloud disenchanted applications quickly while consolidating cloud friendly applications into Intel-based virtualised platform. (E.g. VMWare or Hyper-V) Modernised on-premises virtualised platform provides the cloud foundation with added benefits of running virtual infrastructure. It is a good strategy for either Multi-Cloud or hybrid cloud. You should take full advantage of the existing data centre while you are embarking on the 3-5 year cloud journey.

Modular Application Design

Application development cost typically outweighs the infrastructure cost by a factor of 3x-5x. Given AppDev is quite expensive it is absolutely paramount to get it right from the start. The key design objective is to create an application that is highly modularised, loosely-coupled and platform agnostic. Hence the application can run on different cloud services without incurring massive redevelopment cost. The latest trendy term that everyone has been using is Microservice. Microservice is not bound to a specific framework or programming language. Any mainstream language like Java, C# or Python is suitable depending on one’s own preference. Apart from the programming language I’d also like to touch on application integration. I understand many people would prefer developing their own APIs because it is highly customisable and flexible. However in today’s cloud era it’d require lots of effort and resources to develop and maintain APIs for different cloud vendors as well as on-premises IT systems. Unless there is a compelling reason I’d consider using specialised API vendor like MuleSoft to speed up and simplify development. Last but not least I’d also embrace Container technology for managing application deployment. (E.g. Kubernetes) Containerised application capsule can significantly enhance portability when moving between clouds.

Data Mobility

It is about your prerogative over your own data. When you are considering Multi-Cloud strategy one of the burning issues is how to maintain data mobility. Data that is stored in the cloud can be extracted and moved to on-premises IT systems or another cloud service providers as desired without restrictions. Any impediment to data mobility would seriously diminish the benefits of using cloud in the first place. In the new digital world data should be treated as capital with intrinsic monetary value and therefore it is unacceptable for data to be placed with restrictive movement. So how do you overcome data mobility challenges? Here are some basic principles you should consider. First one is data replication. For instance is it acceptable to the business if the application would take 5 days to move from AWS to Azure? How about 4 weeks? The technology that underpins the Multi-Cloud strategy must meet the business needs otherwise it becomes totally irrelevant. Data replication between different cloud platforms can be implemented to ensure data is always available in multiple destinations of your choice. Native database replication tool is a relatively straightforward solution for maintaining 2 independent data sources. (E.g. SQL Always-OnOracle Data Guard) The second principle is to leverage specialised cloud storage provider. Imagine you can deploy applications to many different cloud vendors while retaining data in a constant readily accessible location. The boundaries of Multi-Cloud would simply dissipated. For example NetApp Data ONTAP is one of the leading contestants in the cloud storage area. The third principle is the humble long standing offsite backup practice. Maintaining a secondary data backup at alternate site is an absolute requirement for both cloud or non-cloud system. It is a very cost effective way of retaining full data control and avoiding vendor lock-in.

Multi-Cloud is a prudent, agile and commercially sound strategy with many benefits but I believe it is not suitable for everyone. Blindly in pursuit of Multi-Cloud strategy without compelling reason is fraught with danger. The decision made by US Department of Defence to partner with only one cloud vendor, which is yet to be determined at the time of writing this article, is one of the high profile exception. Time will tell.

Check out this link where we dive deeper in to the difference of IAAS resilience on AWS and Azure.

This article is a guest post by Tommy Tang (https://www.linkedin.com/in/tangtommy/). Tommy is a well rounded and knowledgeable Cloud Evangelist with over 25+ years IT experience covering many industries like Telco, Finance, Banking and Government agencies in Australia. He is currently focusing on the Cloud phenomena and ways to best advise customers on their often confused and thorny Cloud journey. 

How to upgrade to SCCM 1810

How to upgrade to SCCM 1810

Step by step how to upgrade SCCM to version 1810

What’s new in SCCM 1810?

Here is a quick run down of the exciting new features that Microsoft has added to SCCM for release 1810. You can see more information around this update on the Microsoft blog site.

Specify the drive for offline OS image servicing

Now you can specify the drive that Configuration Manager uses when adding software updates to OS images and OS upgrade packages.

Task sequence support for boundary groups

When a device runs a task sequence and needs to acquire content, it now uses boundary group behaviors similar to the Configuration Manager client.

Improvements to driver maintenance

Driver packages now have additional metadata fields for Manufacturer and Model which can be used to tag driver packages for general housekeeping.

Phased deployment of software updates

You can now create phased deployments for software updates. Phased deployments allow you to orchestrate a coordinated, sequenced rollout of software based on customizable criteria and groups.

Management insights dashboard

The Management Insights node now includes a graphical dashboard. This dashboard displays an overview of the rule states, which makes it easier for you to show your progress.

Management insights rule for peer cache source client version

The Management Insights node has a new rule to identify clients that serve as a peer cache source but haven’t upgraded from a pre-1806 client version.

Improvement to lifecycle dashboard

The product lifecycle dashboard now includes information for System Center 2012 Configuration Manager and later.

Windows Autopilot for existing devices task sequence template

This new native Configuration Manager task sequence allows you to reimage and re-provision an existing Windows 7 device into an AAD joined, co-managed Windows 10 using Windows Autopilot user-driven mode.

Improvements to co-management dashboard

The co-management dashboard is enhanced with more detailed information about enrollment status.

Required app compliance policy for co-managed devices

You can now define compliance policy rules in Configuration Manager for required applications. This app assessment is part of the overall compliance state sent to Microsoft Intune for co-managed devices.

SMS Provider API

The SMS Provider now provides read-only API interoperability access to WMI over HTTPS.

Site system on Windows cluster node

The Configuration Manager setup process no longer blocks installation of the site server role on a computer with the Windows role for Failover Clustering. With this change, you can create a highly available site with fewer servers by using SQL Always On and a site server in passive mode.

Configuration Manager administrator authentication

You can now specify the minimum authentication level for administrators to access Configuration Manager sites.

Improvements to CMPivot

CMPivot now allows you to save your favorite queries and create collections from the query summary tab. Over 100 new queryable entities added, including for extended hardware inventory properties. Additional improvements to performance.

New client notification action to wake up device

You can now wake up clients from the Configuration Manager console, even if the client isn’t on the same subnet as the site server.

New boundary group options

Boundary groups now include two new settings to give you more control over content distribution in your environment.

Improvements to collection evaluation

There are two changes to collection evaluation scheduling behavior that can improve site performance.

Approve application requests via email

you can now configure email notifications for application approval requests.

Repair applications

You can now specify a repair command line for Windows Installer and Script Installer deployment types.

Convert applications to MSIX

Now you can convert your existing Windows Installer (.msi) applications to the MSIX format.

Improvement to data warehouse

 You can now synchronize more tables from the site database to the data warehouse.

Support Center

Use Support Center for client troubleshooting, real-time log viewing, or capturing the state of a Configuration Manager client computer for later analysis. Find the Support Center installer on the site server in the cd.latestSMSSETUPToolsSupportCenter folder.

Support for Windows Server 2019

Configuration Manager now supports Windows Server 2019 and Windows Server, version 1809, as site systems.

SCCM 1810 prerequisites

As with any update, you should make sure that you have all the prerequisites to install this update to Configuration Manager, prior to starting the upgrade process.

These prerequisites to SCCM 1810 are;

  • Every site server within your existing Configuration Manager environment should be at the same version
  • To install the update, the minimum SCCM version you can currently be on is version 1710. 1802 and 1806 are also accepted
  • SQL 2017 CU2 Standard and Enterprise
  • SQL 2016 SP2 Standard and Enterprise
  • SQL 2016 SP1 Standard and Enterprise
  • SQL 2016 Standard and Enterprise
  • SQL 2014 SP3 Standard and Enterprise
  • SQL 2014 SP2 Standard and Enterprise
  • SQL 2014 SP1 Standard and Enterprise
  • SQL 2012 SP4 Standard and Enterprise
  • SQL 2012 SP4 Standard and Enterprise
  • SQL 2012 SP3 Standard and Enterprise
  • Windows Server x64
  • Windows Server 2012 R2 x64
  • Windows Server 2016
  • Windows Server 2019 version 1809

How to upgrade SCCM to release 1810.

Step 1 – Administration Tab

Open you System Centre Configuration Manager Console and navigate to Administration


Sccm 1810 Upgrade

Step 2 – Updates and Servicing

Now click on Updates and Servicing and hopefully you should see the Configuration Manager 1810 update as highlighted in the attached picture.


Sccm 1810 Upgrade Updates and Servicing

Step 3 – Check SCCM 1810 Prerequisites

Next, right click on the Configuration Manager 1810 update and choose Run Prerequisite Check


Step 4 – Checking Prerequisites

Now the SCCM 1810 prerequisites will run and check that the Configuration Manager 1810 update is compatible with your current system. This will take some time, so perhaps go make a coffee while you wait.


Step 5 – ConfigMgrPrereq.log

You can check the status of the prerequisite check by looking at the ConfigMgrPrereq.log located in the C: drive of your configuration management server.

As you can see in my logs, the prequisite check has passed.


SCCM 1810 ConfigMgrPrereq_log

Step 6 – Install Update Pack

Now the fun stuff begins. We are ready to start the upgrade process for SCCM.

Right click the Configuration Manger 1810 update and choose Install Update Pack


Sccm 1810 Upgrade Install Update Pack

Step 7 – Start the installation process for SCCM 1810

On the Configuration Manager Updates Wizard, you can choose to Ignore any prerequisite check warnings and install this update regardless of missing requirements if you so wish.

As with any production environment, it’s always best case to never ignore any warnings, but we have had none in the previous check, so do not need to click this checkbox.

When you are ready to start the update process, click next.


Step 8 – Features included in update pack

The next page on the wizard are various features you can install as part of this update.

Check if any of the features you will need and when ready click on next.


Step 9 – Review and accept the terms

You can review the license terms that Microsoft has for this update. Accept these by checking the checkbox and click Next


Step 10 – Summary

Review this page to confirm that all the settings and features you have chosen previously are correct, and again when ready click Next.


Step 11 – Installation Completed

Finally, the last screen of the Configuration Manager 1810 upgrade wizard is the completed screen. Review the summary and then click on Close.

SCCM will upgrade in the background. This can take sometime dependent on your infrastructure setup.


Step 12 – Check Installation Status

To check the status of your SCCM upgrade, you need to go to Overview, then Updates and Servicing Status. 


Step 13 – Show Upgrade Status

Select the Configuration Manager 1810, then right click and choose Show Status.


Sccm 1810 Upgrade Show Status

Step 14 – Update Pack Installation Status

Highlight Installation and you will see the status of all the components that are upgrading.

Keep on clicking Refresh until you see all the tasks with a green tick. Be mindful, this does take sometime.

Click on Close when they are all green.


Sccm 1810 Upgrade Show Status 2

Step 15 – Update the Configuration Manager Console

Once all the ticks have gone green, click refresh within the SCCM console and you should be prompted with the Console Update.

Click on OK to proceed.


Sccm 1810 Upgrade Console Update

Step 16 – Update the Configuration Manager Console

The SCCM console update will download the required files and update your configuration manager console to the latest version


Sccm 1810 Upgrade Console Update 3

Step 17 – SCCM 1810 Upgrade Finished

Finally, SCCM has updated your Config Manager environment to release 1810


Sccm 1810 Upgrade Finish

How to Snapshot your VMs before patching with SCCM and SnaPatch

Now that you have upgraded SCCM to the current branch 1810, here is a quick run down on how to use SnaPatch with SCCM to quickly and easily snapshot your VMs prior patching.