Risk Archives -

Departed M365 Users

by Mark | Jul 1, 2025 | Office 365, Onedrive, Risk

What Happens to Microsoft 365 Data After an Employee Leaves?

When someone leaves your organization, the first step IT usually takes is to disable their Microsoft 365 account. But have you ever stopped to ask:

“What happens to all their data — their files, emails, and chats — after that?”

The answer might surprise you.

If you’re not actively managing this, Microsoft will automatically delete that data — often in as little as 30 days.

This post explains exactly what gets deleted (and when), why this is a problem, and what you can do to protect that data — without paying for unnecessary licenses.

Microsoft’s Countdown to Data Deletion

Let’s start with a simple truth:

Disabling a user in Microsoft 365 doesn’t save their data forever.

Instead, Microsoft starts a ticking clock. Unless you take action, data begins disappearing — fast.

Here’s what typically happens:

Service	Default Retention	What’s Deleted	Can You Recover It?
OneDrive	30 days	All files and folders	Maybe, but not always
Exchange	30–60 days	Mailbox content	Sometimes
Teams Chat	Up to 93 days	All chat history and attachments	Usually not

So if an employee leaves on January 1st, by April their Teams messages, OneDrive files, and mailbox may be completely gone.

Why This Matters — And Who Should Care

Most people assume Microsoft keeps this data for legal or security reasons. But that’s not how it works.

Microsoft isn’t your backup provider. Its job is to deliver service, not long-term data retention.

This creates big risks for:

IT teams, who might need to retrieve a user’s files later
Legal and compliance officers, who must retain emails and chat records
HR and management, who need access to handover materials, customer comms, etc.

And unless you assign a license to the user’s account forever, that data is eventually lost.

A Real-World Scenario

Let’s say you’re offboarding an employee named Sarah. She’s been with the company for 5 years.

She has:

200 GB of OneDrive files
50,000 emails in Exchange
Years of chats with project teams in Microsoft Teams

You disable her Microsoft 365 account. Now what?

30 days later, her OneDrive starts purging
60 days later, her mailbox may be gone
By day 93, her Teams chat history is unrecoverable

Now legal asks for chat logs from a project she was on 6 months ago… and it’s too late.

Can’t I Just Use Microsoft Retention Policies?

Yes — but it’s not as easy as it sounds.

You’d need to:

Set up custom retention policies in Microsoft Purview
Create inactive mailboxes (which still require licenses)
Use PowerShell scripts to export OneDrive manually
Deal with Teams data that isn’t easily exportable

And even then, you’re not guaranteed to retain everything — especially chat data.

It’s complex, time-consuming, and risky.

The Simpler Option: Use Chipmunk

Chipmunk is a tool built specifically to solve this exact problem.

It watches for when you disable a user and automatically backs up their:

OneDrive files and folder structure
Exchange emails
Teams chats (including private and group chats)

All the data is stored in your own Azure Blob Storage, so:

You own the data
You don’t need to keep paying Microsoft licenses
You can access it anytime — for audits, legal cases, or handovers

No scripts. No licenses. No data loss.

How It Works — In Plain English

Here’s how Chipmunk fits into your offboarding process:

User is disabled in Microsoft Entra ID (formerly Azure AD)
Chipmunk detects it automatically
It downloads all key data — OneDrive, Email, and Teams
It uploads the data into organized folders in your Azure storage
It updates your central dashboard with status and logs
It emails you the results of the archive activity for each user.
You don’t have to remember to do anything. It just works in the background.

Bonus: Save on Microsoft Licenses

Did you know keeping a disabled user’s data often requires a paid Microsoft 365 license?

That could mean paying $20–$40/month per user just to retain inactive data.

With Chipmunk, you can archive it once — and delete the user safely.

For companies with hundreds of staff turnover each year, that’s tens of thousands in savings.

Compliant. Secure. Yours.

Chipmunk is built for:

Data compliance (GDPR, ISO, HIPAA-ready)
Cost reduction (free up licenses without losing data)
IT simplicity (no need to learn Microsoft Purview or eDiscovery)

And because all archived data is stored in your own Azure tenant, you stay in control at all times.

TL;DR

What Happens by Default	What Chipmunk Does
Microsoft deletes data after 30–93 days	Chipmunk backs it up automatically
You must set complex retention rules	No configuration needed
Teams chat is hard to retain	Chipmunk grabs it for you
Ongoing license may be required	Chipmunk lets you delete users safely
Risk of permanent data loss	Permanent backup in Azure

Ready to Never Lose Ex-Employee Data Again?

Don’t wait for day 93.

If you want peace of mind, predictable offboarding, and full control of your M365 user data — Chipmunk can help.

Microsoft Purview: A Complete Overview

by Mark | Feb 19, 2025 | Office 365, Risk, SharePoint

What is Microsoft Purview?

Microsoft Purview is a unified data governance, risk, and compliance solution designed to help organizations manage, protect, and gain insights into their data across on-premises, multi-cloud, and SaaS environments.

It enables businesses to discover, classify, and secure sensitive information while ensuring compliance with regulations such as GDPR, HIPAA, and ISO 27001. By integrating data cataloging, lifecycle management, access controls, and auditing capabilities, Purview provides a comprehensive approach to data protection and regulatory compliance.

With Microsoft Purview, organizations can monitor data usage, prevent data loss, and enforce policies to mitigate security risks. It integrates seamlessly with Microsoft 365, Azure, Power BI, and third-party platforms, offering a centralized way to govern structured and unstructured data. Whether for data discovery, risk management, or compliance reporting, Microsoft Purview empowers businesses to maintain control over their data while supporting innovation and operational efficiency.

Evolution of Microsoft Purview

Previously, Microsoft offered separate tools for compliance and governance, such as Azure Purview and Microsoft Compliance Center. In April 2022, these tools were rebranded and combined into Microsoft Purview, providing a unified experience for data security, classification, and lifecycle management (Microsoft’s official announcement).

Key Takeaways

Key Aspect	Summary
Unified Platform	Combines data governance, compliance, and risk management into a single solution.
Data Discovery & Protection	Uses AI-powered classification, labeling, and access control to safeguard sensitive data.
Regulatory Compliance	Helps organizations comply with GDPR, HIPAA, CCPA, and other industry regulations.
Multi-Cloud Support	Works across Microsoft 365, Azure, AWS, Google Cloud, and on-premises environments.
Risk & Threat Management	Detects insider threats, prevents data loss, and enables compliance auditing.

What Can Microsoft Purview Do?

1. Data Governance & Discovery

Microsoft Purview Data Map: Provides automated data discovery, classification, and lineage tracking.
Data Catalog: Allows users to search for and understand data assets across an enterprise.
Sensitive Data Classification: Automatically detects and labels sensitive data using AI-powered classifiers.

2. Data Protection & Compliance

Information Protection: Applies encryption and access control to sensitive files.
Data Loss Prevention (DLP): Helps prevent accidental or unauthorized sharing of sensitive data.
Compliance Manager: Assists organizations in assessing compliance with regulations such as GDPR, HIPAA, and CCPA (Microsoft Purview Compliance Overview).

3. Risk Management & Insider Threat Protection

Microsoft Purview Insider Risk Management: Detects and mitigates risks from within an organization.
Communication Compliance: Monitors internal communications for policy violations and potential risks.
eDiscovery & Audit: Enables legal teams to identify, hold, and review data for investigations and compliance audits.

4. Integration with Multi-Cloud & On-Premises Environments

Microsoft 365 & Azure Integration: Provides seamless governance across Microsoft cloud services.
Support for AWS, Google Cloud, and On-Premises Data Sources: Helps organizations manage data across multiple environments (Microsoft Purview Documentation).

Why Organizations Should Use Microsoft Purview

1. Unified Data Governance

Microsoft Purview consolidates governance, security, and compliance into a single platform, reducing the need for multiple standalone tools.

2. Automated Compliance & Risk Assessment

Organizations can streamline compliance processes and reduce regulatory risks through automated compliance tracking and AI-powered risk detection.

3. Enhanced Security & Protection

With built-in encryption, access control, and insider risk management, businesses can minimize data leaks and cyber threats.

4. Improved Data Visibility & Discovery

By using Purview Data Map and Catalog, organizations can gain better insights into their data landscape, enabling better decision-making and compliance reporting.

How to Get Started with Microsoft Purview

Step 1: Assess Your Organization’s Needs

Before deploying Purview, evaluate your data governance, compliance, and risk management requirements.

Step 2: Enable Microsoft Purview in Your Environment

Microsoft Purview is available through Microsoft 365 Compliance Center and Azure Portal.
IT administrators can configure Purview’s governance policies based on organizational needs.

Step 3: Classify and Label Data

Use Microsoft’s built-in sensitivity labels to classify and protect your most valuable data assets.

Step 4: Implement Data Protection & Compliance Policies

Leverage Data Loss Prevention (DLP), Insider Risk Management, and Compliance Manager to automate policy enforcement and monitoring.

Step 5: Monitor & Optimize Governance Strategies

Regularly review compliance dashboards and audit logs to identify gaps and improve governance policies.

Is Microsoft Purview Worth It?

For organizations seeking a centralized, AI-powered compliance and data governance solution, Microsoft Purview is one of the most comprehensive and scalable platforms available today. Whether you need to protect sensitive data, ensure regulatory compliance, or manage insider risks, Microsoft Purview provides the tools necessary to meet these challenges effectively.

By implementing Microsoft Purview, businesses can enhance data security, simplify governance, and reduce compliance risks—all within a unified and intelligent platform.

Frequently Asked Questions about Microsoft Purview

1. What is Microsoft Purview used for?

Microsoft Purview is used for data governance, compliance, and risk management across multi-cloud, SaaS, and on-premises environments.

2. How does Microsoft Purview help with compliance?

It provides automated compliance assessments, built-in regulatory templates, and tools like Data Loss Prevention (DLP) and Compliance Manager.

3. Can Microsoft Purview work with non-Microsoft platforms?

Yes, it supports data management across AWS, Google Cloud, and on-premises environments.

4. Does Microsoft Purview protect against insider threats?

Yes, it has Insider Risk Management features that detect and mitigate internal security threats.

5. Is Microsoft Purview included in Microsoft 365?

Some Purview features are included in Microsoft 365 E5 plans, while others may require additional licensing.

6. What industries benefit the most from Microsoft Purview?

Industries like finance, healthcare, and government benefit significantly due to strict regulatory requirements.

7. Can Microsoft Purview automate data classification?

Yes, it uses AI-driven classifiers to automatically detect and label sensitive data.

8. What’s the difference between Microsoft Purview and Azure Purview?

Azure Purview was rebranded and integrated into Microsoft Purview, expanding its capabilities to include compliance and risk management.

9. How does Purview integrate with Microsoft 365?

It integrates natively, allowing organizations to apply governance policies to SharePoint, OneDrive, Exchange, and Teams.

10. Where can I learn more about Microsoft Purview?

You can refer to Microsoft’s official documentation for detailed insights.

Is Disaster Recovery Really Worth The Trouble (Part 4)

by Mark | May 10, 2019 | Features, How To, Risk, Security

Is Disaster Recovery Really Worth The Trouble

(Part 4 of 4 part series)

Guest Post by Tommy Tang – Cloud Evangelist

In this final chapter of the Disaster Recovery (part 1, part 2 and part 3) discussion I am going to explore some of the common practices, and myths, regarding DR in the Cloud. I’m sure you must have heard about the argument for deploying application to the Cloud because of the inherited built-in resilience and disaster recovery capability. Multi-AZ, 11 9’s durability, auto-scaling, Availability Sets and multi-region recovery (e.g. Azure Site Recovery) and many more, are widely adopted and embraced without hesitation. No doubt these resilient features are part of the charm of using Cloud services and each vendor will invest and promote their own unique strength and differentiation to win market share. It’ll only take 30 minutes to fail over to another AZ so she’ll be right, yes?

If you remember in Part 2 of the DR article I stated the number one resilience design principle is to “eliminate single point of failure”. Any Cloud vendor could also become the single point of failure. If you’ve deployed the well architected, highly modularised and API rich application in Amazon Web Services (AWS), do you still need to worry about DR? The short answer is YES. You ought to consider DR capability provided by AWS, or any other Cloud vendor for that matter, to determine whether it does meet your requirement. The solution is indeed fit for purpose. Do not assume anything just because it is in the Cloud.

AWS is not immune to unplanned outages because Could infrastructure is also built on physical devices like disks, compute and network switches. Some online stores like Big W and Afterpay had been impacted due to unexpected AWS outage on 14th Feb 2019 for about 2 hours. What is your Recovery Time Objective (RTO) requirement? Similarly Microsoft Azure is also not immune to outages either. On 1st February 2019 Microsoft had inadvertently deleted several Transparent Data Encryption (TDE) databases after encountering DNS issues. The TDE database were quickly restored from snapshot backup, but unfortunately customers would have lost 5 minutes worth of transactions. Image what would you do if your Recovery Point Objective (RPO) is meant to be Zero? No data loss?

At this very moment I hope I have stirred up plenty of emotions and a good dose of anxiety. Cloud infrastructure and Cloud service provider is not the imaginative Nirvana or Utopia that you have been searching for. It’s perhaps multi-generation better than what you have installed in your data centre today, but any Cloud deployment still warrants careful consideration, design and planning. I’m going to briefly discuss 3 areas in which you should start exploring in your own IT environment tomorrow.

1. Backup and Restore

As a common practice you’d take regular backup for your precious application and data so you’d be able to recover in the most unfortunate event. Same logic applies here when you have deployed applications in AWS or Azure Cloud. Ensure you are taking regular backup in the Cloud, which is likely to be auto-configured, as well as a secondary backup stored outside the Cloud service provider. It’s exactly the same concept and reason for taking offsite backup which is, proverbially speaking, you don’t put all your eggs in one basket. Unless you don’t have a data centre anymore, your own data centre would be the perfect offsite backup location. I understand getting backup off AWS S3 could pose a bit of challenge and I’d urge you to consider using AWS Storage Gateway for managing offsite backups. It should make backup management a lot easier.

Once you’ve secured the backup of application and data away from the Cloud vendor, you’re now empowered to restore (or relocate) the application to your own data centre or to different Cloud provider as desired. Bearing in mind that you’re likely to suffer some data loss using backup and restore technique. Depending on the backup cycle it’s typically a daily backup (i.e. 24 hours) or weekly backup (i.e. 7 days). You must diligently consider all recovery scenarios to determine if backup and restore is sufficed for the Recovery Point Objective (RPO) of the targeted application.

2. Data Replication

What if you can’t afford to lose data for your Tier-1 critical application? (i.e. RPO is Zero) Can you still deploy it to the Cloud? Again the short answer is YES but it probably requires some amendment to the existing architecture design, and notwithstanding the additional cost and effort involved. I believe I have already touched on the design patterns Active-Active and Active-Passive in Part 2 of the DR discussion. If Recovery Point Objective (RPO) is Zero then you must establish synchronous data replication across 2 sites, 2 regions or 2 separate Cloud vendors. Ok, even though it’s feasible to establish synchronous data replication over long distances, the Law of Physics still applies and that means your application performance is likely to suffer from elevated network latency. Is it still worth pursuing? It’s your call.

There are generally 2 ways to achieve data replication across multi-region or multi-cloud. The first method is to leverage storage replication technology. It’s the most common and proven data replication solution found in the modern data centre, however it’s extremely difficult to implement in the Cloud. The simple reason is you don’t own Cloud storage but vendors do. There will be limited API and software available for you to synchronise data say between AWS S3 and the on-premises EMC storage array. The only alternative solution I can think of, and you might have other brilliant idea, is to deploy your own Cloud edge storage (e.g. NetApp Cloud Volumes OnTap) and presented to the applications hosted in various Cloud vendors. Effectively you still own and manage the storage (and data) rather than utilising the unlimited storage generously provisioned by the vendor, and as such you are able to synchronise your storage to any secondary location of your choice. You have the power!

As opposed to using storage replication technology you can opt for host or software based replication. Generally you are probably more concerned of the data stored in database than say the configuration file saved on the Tomcat server. Following on this logic data replication at the database tier is our first and foremost consideration. If you are running Oracle database then you can choose to configure Data Guard with synchronous data replication between AWS EC2 and on-premises Linux database server. On the other hand if your preference is Microsoft SQL Server then you’d configure SQL Server Always On Cluster with synchronous replication for databases hosted in Azure Cloud and on-premises VMWare Windows server. You can even set up database replication between different Cloud vendors as long as the Cloud infrastructure supports it. The single most important prerequisite for implementing database replication, wether it is between Cloud vendors or Cloud to on-premises, is the underlying Operating System (OS). Ideally you’d have already standardised your on-premises operating environment to be Cloud ready. For example, retaining large scale AIX or Solaris servers in your data centre, rather than switching to Windows or Linux based Cloud compatible OS, does nothing to inspire a romantic Cloud story.

3. Orchestration Tool

The last area I’d like to explore is how to minimise RTO while recovering application to your on-premises data centre or to another Cloud vendor during major disaster event. If you are well versed in the DevOps world and being a good practitioner then you are already standing on good foundation. The most common problem found during recovery is the complexity and human intervention required to instantiate the targeted application software and hardware. Keeping with the true CI/CD spirit the proliferation use of orchestration tool to deploy immutable infrastructure and application is the very heart and soul of DevOps. By adopting the same principle you’d be able to recover the entire application stack via orchestration tool like Jenkins to another Cloud or on-premises Cloud like environment with minimal effort and time. No more human fat finger syndrome and slack think time during recovery. Consider using open source and Cloud vendor agnostic tool like Terraform (as opposed to AWS CloudFormation) can greatly enhance portability and reusability for recovery. Armed with the suitable containerisation technology (e.g. Kubernetes) that is harmonised in your IT landscape, you’d further enhance deployment flexibility and manageability. Running DR at an alternate site becomes a breeze.

In closing, I’d like to remind you that just because your application is deployed to the Cloud (i.e. someone else infrastructure) you are not exonerated from neglecting the basic Disaster Recovery design principles and making ill-informed decision. Certainly it’s my opinion that the buck will stop with you when the application is blown to smithereens in the Cloud. This is the last article of the Disaster Recovery series and hopefully I have imparted a little bit of the knowledge, practical examples and stories to you that you can tackle DR from a whole new light without fear and prejudice. I’m looking forward to sharing with you some more Cloud stories in a not too distant future. Stay tuned.

This article is a guest post by Tommy Tang (https://www.linkedin.com/in/tangtommy/). Tommy is a well rounded and knowledgeable Cloud Evangelist with over 25+ years IT experience covering many industries like Telco, Finance, Banking and Government agencies in Australia. He is currently focusing on the Cloud phenomena and ways to best advise customers on their often confused and thorny Cloud journey.

Is Disaster Recovery Really Worth The Trouble (Part 3)

by Mark | May 10, 2019 | How To, Risk, Security

Is Disaster Recovery Really Worth The Trouble

(Part 3 of 4 part series)

Guest Post by Tommy Tang – Cloud Evangelist

In previous articles (part 1 and part 2) I’ve emphasised Disaster Recovery (DR) design principle is simply about eliminating single point of failure for data centre, and to provide adequate service and application resilience that’s fit for purpose. Over-engineered gold plated architecture solution does not always fit the bill and conversely low-tech simple and cost effective solution doesn’t necessary mean it’s sub-standard. There are 3 common DR patterns that you are likely to find in your organisation and they are known as “Active-Active”, “Active-Passive” and “Active-Cold”. As a DR solution architect you have been tasked to implement the most cost effective and satisfactory DR solution for your stakeholders. You might wonder where to begin, Pros and Cons of each DR pattern and what are the gotchas? Well, let me tell you there is no perfect solution or “one-size-fit-all” silver bullet. But don’t feel despair as I will be sharing with you some of the key design consideration and relevant technology that is instrumental to successful DR implementation.

Network and Distance Consideration

Imagine your two data centres that are geographically dispersed, the underlying network infrastructure (e.g DWDM or MPLS) is the very bloodline that interconnects every service together such as HTTP server, database, storage, Active Directory, backup etc. So without doubt network performance and capability is rated high on my checklist. How do we measure and attain good network performance? First of all you’d need to understand the two key measurements; network latency and bandwidth and I will briefly explain them below.

Network latency is defined as the time it takes to transfer a data packet from destination A to B and expressed in Millisecond (ms). In some cases latency also includes the data packet roundtrip with acknowledgement (ACK). Network bandwidth is the maximum data transfer rate between destination A and B (aka network throughput), and the transfer rate is expressed in Megabits per second (Mbps). Both of these metrics are governed by the law of physics (i.e. speed of light) so the distance in which separated the two data centres plays a pivotal role in determining the network performance and ultimately the effectiveness of DR implementation.

Having data centres located in Sydney and Melbourne sounds like a good risk mitigation strategy until you are confronted with the “Zero RPO” dilemma. How could you keep data in-sync between 2 data centres stretched over 800Km, leveraging the existing SAN storage based replication technology, without causing noticeable degradation to storage performance? How about the inconsistent user experience being felt by users who are farther away from the data centre? Remember the law of physics? Unless you own a telephony company or unlimited funds, trying to implement synchronous data replication over long distance, regardless whether it is host or storage based replication technology, will surely cost a large sum of money and not to mention the adverse IO performance impact.

For those brave souls who are game enough to implement dual site Active-Active extended Oracle RAC cluster, the maximum recommended distance between 2 sites is 100Km. However after taking into consideration of super-low network latency requirement and relatively high cost, it’s more palatable to implement extended Oracle RAC cluster in data centres that are 10-15Km apart. You may find similar distance constraint exists for other high availability DR technology. Active-Active pattern is especially sensitive to network latency because of the constant chit-chatting between services at both sites. If the distance between 2 data centres is becoming the major impediment for implementing Active-Active DR or synchronous data replication, then you should diligently pursue alternative solutions. It’s quite possible that Active-Passive or non-zero RPO is acceptable architecture so don’t be afraid to explore all options with your stakeholders.

Mix and Match Pattern

I have come across application systems which have been architected with the flurry of mix and match DR design flair that got me slightly bemused. Let us examine a simple example. A “Category A” service (i.e. highly critical customer facing) is composed of Active-Active DR pattern for the Web Server (pretty standard), Active-Passive pattern for the Oracle database (also stock standard), and Active-Cold pattern for the Windows application server. So you may ask what is the problem if RTO is being met?

As you may recall each DR pattern comes with predefined RTO capability and prescribed technology that underpins it. By combining different DR design patterns into a single architecture will undoubtedly dilute the desired DR capability. In this example the Active-Cold pattern is the lowest common denominator as far as capability is concerned, so it will inadvertently dictate the overall DR capability. The issue being is why would you invest in a relatively high cost and complex Active-Active pattern when the end result is comparable to the lowly Active-Cold design? The return on investment has greatly diminished by including lower calibre pattern such as Active-Cold in the mix.

Another point you should consider is can the mix and match design really stand up in the real DR situation and meet the expected RTO. I have heard the argument that the chosen design works perfectly well in the isolated application DR test. What about in the real DR situation when you are facing competing human resources (e.g. Sysadm, DBA, Network dude) and system resources like IOPS, CPU, Memory, Network etc. It’s my belief that all DR design patterns should be regularly tested in simulated DR scenario with many applications, in the interest of determining the true DR capability and effectiveness. You may find the mix and match DR architecture does not work as well as expected.

Finally the technology that underpins each DR pattern could have changed and evolved over time. Software vendors often change functionality and capability with future releases so DR pattern must be engineered to be adaptive to change. As a result there’s inherited risk for mixing different DR patterns that will certainly increase the dependency and complexity for maintaining expected DR capability in the fast changing technology landscape.

Mix and match DR pattern may sound like a good practical solution and in many cases it is driven by cost optimisation. However after consideration of the associated risks and pitfalls I’d recommend choosing the pattern that is best matched for the corresponding service criticality. Although it’s not a hard and fast rule but I do find the service to DR pattern mapping guidelines below are simple to understand and follow. You may also wish to come up with different set of guidelines that are more attuned to your IT landscape and requirement.

Category A (Highly Critical) – Active-Active (preferred) or Active-Passive
Category B (Critical) – Active-Passive (preferred) or Active-Cold
Category C (Important) – Active-Passive or Active-Cold
Category D (Insignificant) – Active-Cold

Automation

Last but not least I’d like to bring automation into DR discussion. In the current Cloud euphoria era automation is the very DNA that defines its existence and success. Many orchestration and automation tools are readily available for building compute infrastructure, programming API and PaaS services configuration just to list a few. The same set of tools can also be applied to DR implementation with great benefits.

In my mind there is no doubt that Active-Active is the best architecture pattern, however it does come with a hefty implementation price tag and design constraints. For example some application does not support distributed processing model (i.e. XA transaction) so it can’t run in dual-site Active-Active environment. Even for the all mighty Active-Active pattern automation can further improve RTO when applied appropriately. For instance client and application workload distribution via Global DNS Service or Global Traffic Manager (GTM) needed for DR can be automated via pre-configured smart policy. Following the same idea database failover can also be automated based on well tested configurable rules. This is where automation can simplify and vastly improve the quality of DR execution.

Same design principle applies to Active-Passive and Active-Cold DR pattern as well. Automation is the secret source for quality DR implementation. Consider incorporating automation to all service components where possible. But here is the reality check. Implementing automation is not trivial and it is especially difficult for service component that is not well documented or designed, or lack of the suitable automation tools. Furthermore it is not advised to automate DR process if there is no suitable production like environment (e.g. cross-site infrastructure) to conduct quality assurance test. The implementation work itself can be extremely frustrating because you’d need to delicately negotiate and cooperate with different departments and third-party vendors. Having that said I believe the benefits are far outweighed the pain in most cases. I have known one case where automation has reduced DR failover time from 4 hours down to 30 minutes. No pain no gain right?

For those who are DevOps savvy techies there are many orchestration tools out in the marketplace that you can pick to develop the automation framework of your choice. Chef, Puppet, Jenkins for orchestration and Python, Powershell, and C Shell for scripting just to name a few. If you don’t want to build your owner automation framework then you might want to consider vendor software like Selenium, Ansible Tower or Blueprism.

In conclusion a successful DR implementation should be planned with detailed impact assessment of network latency between data centres, carefully consider the most appropriate DR patterns and relevant technology for the targeted service application, and leverage automation infused with artificial intelligence (i.e. policy or rule based) to replace manual tasks where feasible. In the next article I will be exploring the various DR scenarios presented for Cloud deployment.

Is Disaster Recovery Really Worth The Trouble (Part 2)

by Mark | May 10, 2019 | How To, Risk, Security

Is Disaster Recovery Really Worth The Trouble

(Part 2 of 4 part series)

Guest Post by Tommy Tang – Cloud Evangelist

In the previous article I’ve mentioned Architecture is the foundation, the bedrock, for implementing Disaster Recovery (DR), and it must be part of the broader discussion on system resilience and availability otherwise funding will be hard to come by. You may ask what are the key design criteria for DR? I believe, first and foremost, the design must be ‘Fit for Purpose’. In other words you’d need to understand what the customer wants in terms of requirement, objective and expected outcome. The following technical jargons are commonly used to measure DR capability and I will provide a brief explanation for each metric.

Recovery Time Objective (RTO)

It is the targeted time duration of which a service or system needs to be restored during Disaster Recovery. Typically RTO is measured in hours or days and it’s no surprise to find human ‘think’ time often exacerbates the recovery time. RTO should be tightly aligned with the Business Continuity requirement (i.e. Maximum Acceptable Outage MAO) given system recovery is only one aspect of the business service restoration process.

Recovery Point Objective (RPO)

It is the maximum targeted time of which data or transaction might be lost without recovery. You can view RPO as the maximum data loss that you can afford. So ‘Zero RPO’ is interpreted as no data loss is permitted. Not even a second. The actual amount of data loss is very much dependent on the affected system. For example, an online stock trading system that suffers a 5-minute data loss could result hundreds of lost transaction worth millions of dollars. Conversely, an in-house Human Resource (HR) system is unlikely to suffer any data loss for the same 5-minute interval given changes to HR system are scarce.

Mean Time To Recovery (MTTR)

It is the average time taken for a device, component or service to recover from failure after being detected. Unlike RTO and RPO, MTTR includes the element of monitoring and detection, and it’s not limited to DR event but any failure scenario. When you’re designing the appropriate DR solution for your customer, MTTR must be vigorously scrutinised for each software & hardware component in order to meet the targeted RTO.

Let’s move over to the business side of the DR coin and see how these metrics are being applied. I think it is a safe bet to assume each business service would have already been assigned to the predetermined service criticality classification, and each classification must have included RTO and RPO requirement. For illustration purposes let say “Category A” service is a highly critical customer portal so it might have 2-hour RTO and Zero RPO requirement, and for “Category C” internal timesheet service it could have RTO set to 12-hour with 1-hour RPO.

In a real DR event (or DR exercise) the classification is used to determine the order in which a service is being restored. It is neither practical or sensible to have all services weighed equally, or have too many services that are rated critical given the limited resources and immense pressure being exerted during DR. The right balance must be sought and agreed upon by all business owners.

Now you have the basic understanding of the DR requirement and keen to get started. Hold off launching the Microsoft Visio app and start drawing those beautiful boxes just yet. I’d like to share with you the one simple resilience design principle which I have been using, and that is to eliminate “Single Point of Failure”. By the virtue of having 2 working and functionally identical components in the system you’d improve resilience by 100%! The 2x system is now capable of handling single component failure without loss of service. The “Single Point of Failure” principle does apply to physical data centre and therefore it is very much relevant to DR design.

As an IT architect you have a number of tried and proven solutions (aka architecture patterns) available in the toolkit at your disposal. The DR patterns described below are commonly found in most organisations.

Active-Active

The definition of Active-Active DR pattern is to provision two or more active working software components that spread across 2 data centres. E.g. A N-tier system architecture may consist of 2x Web servers, 2x Application servers and 2x Database servers. Client connection and application workload is distributed between 2 sites, either equally weighed or otherwise, via Global DNS Service or Global Traffic Manager (GTM). The primary objective of the Active-Active DR design is to eliminate data centre as single point of failure. Under this design there is no need to initiate failover during Disaster Recovery because an identical system is already running at the alternate site and sharing the application workload. (E.g. Zero RTO)

The Active-Active pattern is best suited for critical system or service (i.e. Category A) because of the high cost and complexity associated with implementing distributed system. Not every application is capable of running in a distributed environment across 2 sites. The reason could be attributed to software limitation like Global Sequencing or Two-Phase Commit. It’s highly desirable to have formulated a prescriptive Active-Active design pattern to help mitigate the inherited cost and risks, and to align with the existing technology investment and future roadmap.

The biggest challenge is often encountered at the database tier. Are you able to run the database simultaneously across 2 sites? If so, is the data being replicated synchronously or asynchronously? Designing a fully distributed database solution with zero data loss (i.e. Zero RPO) is not trivial. Obviously you can choose to implement a partial Active-Active solution where every component except the database is active across 2 sites. Alternatively, you may want to relax the RPO requirement to allow non-zero value so asynchronous data replication can be applied. (E.g. 5-minute RPO)

From general observation I’ve found critical system database is typically configured with a warm standby DB with Zero RPO, where failover operation can be manually initiated or automated. The warm standby DB configuration is also known as Active-Passive DR pattern of which is going to be explored further in the next section.

Recently I’ve heard a story about Disaster Recovery. A service owner proclaimed the targeted system is fully Active-Active across 2 sites during the DR exercise and therefore no failover is ever required. 30 minutes later the same service owner, with much reluctance, scrambled to contact the DBA team requesting an urgent Oracle DB failover to the DR site. A word of advice: many supposed to be Active-Active implementations are only truely Active-Active up to the database tier so it does pay to understand your system design. A one page high-level system architecture diagram with network flow should be suffice to summarise the DR capability without confusion.

Active-Passive

The Active-Passive DR pattern stipulates that there is one or more redundant software components configured in warm standby mode at the alternate data centre. DR failover operation can be either manually initiated or automated for each component in the predetermined order for the respective application stack. Client connection and application workload will be redirected to the active live data centre via Global DNS Service or Global Traffic Manager (GTM). Remember the key differentiator from Active-Active DR pattern is only one active site can accept and process workload while the passive site is in dormant.

The primary objective of the Active-Passive design is, same as Active-Active, to eliminate data centre as single point of failure but albeit with higher RTO value. Time required to failover will vary and is dependent on the underlying design and technology deployed for the corresponding software component. Component failover can typically take 5 to 30 minutes (or even longer) to complete. Therefore the aggregated component failover time + human think time is roughly equivalent to the RTO value. (E.g. 4-hour RTO)

The Active-Passive design is suitable for most systems because it is relatively simple and cost effective, The two key technology enablers are storage replication and application native replication. Leveraging the storage replication for DR is probably the most popular option because it is application agnostic. The storage replication technology itself is simple, mature and proven and it’s generally regarded to be low risk. The data being replicated between sites can be synchronous (i.e. Zero RPO) or asynchronous (i.e. Non-zero RPO) and both options are just as good depending on the RPO requirement.

As for the application specific replication it will typically utilise TCP/IP network to keep data in-sync between 2 sites. It could also be synchronous or asynchronous depending on the technology and configuration. The underlying replication technology is vendor specific and proprietary so you’d need to rely on the vendor’s tools for monitoring, configuration and problem diagnosis. For example, you may want to implement SQL Server Always On Availability Group for the warm standby DB so you’d have to learn how to manage and monitor Windows Server Failover Cluster (WSFC). Application native replication is often found at the database tier like SQL Server Always On or Oracle Data Guard. Every vendor would have published the recommended DR configuration so it’d be foolhardy not to follow their recommendation.

Active-Cold

Last of all it is about the Active-Cold DR pattern. This pattern is similar to Active-Passive except the software component at the alternate site has not been instantiated. In some case it may require a brand new virtual server for configuring and starting the application component. Or it may need to manually mount the replicated filesystem and then start up the application. Or it may require certain backup restoration process to recover the software to the desired operating state.

The word ‘Cold’ implies much work is needed, and whatever it takes, to bring the service online. In many cases it’ll take hours or even days to complete the recovery task. Hence RTO for Active-Cold design is expected to be larger than Active-Passive. However just because it takes longer to recover doesn’t mean it is a bad solution. For example, it is perfectly acceptable to take one or two days to recover an internal timesheet system without causing much outrage. Put it simply it is “Horses for Courses”. Also you can still achieve Zero RPO (i.e. no data loss) with Active-Cold design by leveraging synchronous storage replication between sites. Not bad at all!

In this article I have covered the common DR related metrics like RTO, RPO and MTTR. I have also shared with you the ‘Single Point of Failure’ resilience design principle which has served me well over the years. I have summarised, perhaps a tad longer than summary, the three common DR design patterns interlaced with practical examples and the patterns are: Active-Active, Active-Passive and Active-Cold. I realised I might have gone a bit longer than expected in this article so I’m saving some of the interesting thoughts and stories for the next article which is focusing on DR implementation.

MICROSOFT’S December 2016 PATCH RELEASES

by Mark | Dec 14, 2016 | Deployment, Fixes, Patch Management, Patch Releases, Risk, SCCM, Security

MICROSOFT’S December 2016 PATCH RELEASES

Microsoft have released 12 new Patch Tuesday releases for deployment this month of December.

See how you can remove the risk of patch deployment by adding SnaPatch to your SCCM patching infrastructure?

MS16-144 – Critical

Cumulative Security Update for Internet Explorer (3204059)
This security update resolves vulnerabilities in Internet Explorer. The most severe of the vulnerabilities could allow remote code execution if a user views a specially crafted webpage using Internet Explorer. An attacker who successfully exploited the vulnerabilities could gain the same user rights as the current user. If the current user is logged on with administrative user rights, an attacker could take control of an affected system. An attacker could then install programs; view, change, or delete data; or create new accounts with full user rights.

MS16-145 – Critical

Cumulative Security Update for Microsoft Edge (3204062)
This security update resolves vulnerabilities in Microsoft Edge. The most severe of the vulnerabilities could allow remote code execution if a user views a specially crafted webpage using Microsoft Edge. An attacker who successfully exploited the vulnerabilities could gain the same user rights as the current user. Customers whose accounts are configured to have fewer user rights on the system could be less impacted than users with administrative user rights.

MS16-146 – Critical

Security Update for Microsoft Graphics Component (3204066)
This security update resolves vulnerabilities in Microsoft Windows. The most severe of the vulnerabilities could allow remote code execution if a user either visits a specially crafted website or opens a specially crafted document. Users whose accounts are configured to have fewer user rights on the system could be less impacted than users who operate with administrative user rights.

MS16-147 – Critical

Security Update for Microsoft Uniscribe (3204063)
This security update resolves a vulnerability in Windows Uniscribe. The vulnerability could allow remote code execution if a user visits a specially crafted website or opens a specially crafted document. Users whose accounts are configured to have fewer user rights on the system could be less impacted than users who operate with administrative user rights.

MS16-148 – Critical

Security Update for Microsoft Office (3204068)
This security update resolves vulnerabilities in Microsoft Office. The most severe of the vulnerabilities could allow remote code execution if a user opens a specially crafted Microsoft Office file. An attacker who successfully exploited the vulnerabilities could run arbitrary code in the context of the current user. Customers whose accounts are configured to have fewer user rights on the system could be less impacted than those who operate with administrative user rights.

MS16-149 – Important

Security Update for Microsoft Windows (3205655)
This security update resolves vulnerabilities in Microsoft Windows. The more severe of the vulnerabilities could allow elevation of privilege if a locally authenticated attacker runs a specially crafted application.

MS16-150 – Important

Security Update for Secure Kernel Mode (3205642)
This security update resolves a vulnerability in Microsoft Windows. The vulnerability could allow elevation of privilege if a locally-authenticated attacker runs a specially crafted application on a targeted system. An attacker who successfully exploited the vulnerability could violate virtual trust levels (VTL).

MS16-151 – Important

Security Update for Windows Kernel-Mode Drivers (3205651)
This security update resolves vulnerabilities in Microsoft Windows. The vulnerabilities could allow elevation of privilege if an attacker logs on to an affected system and runs a specially crafted application that could exploit the vulnerabilities and take control of an affected system.

MS16-152 – Important

Security Update for Windows Kernel (3199709)
This security update resolves a vulnerability in Microsoft Windows. The vulnerability could allow information disclosure when the Windows kernel improperly handles objects in memory.

MS16-153 – Important

Security Update for Common Log File System Driver (3207328)
This security update resolves a vulnerability in Microsoft Windows. The vulnerability could allow information disclosure when the Windows Common Log File System (CLFS) driver improperly handles objects in memory. In a local attack scenario, an attacker could exploit this vulnerability by running a specially crafted application to bypass security measures on the affected system allowing further exploitation.

MS16-154 – Critical

Security Update for Adobe Flash Player (3209498)
This security update resolves vulnerabilities in Adobe Flash Player when installed on all supported editions of Windows 8.1, Windows Server 2012, Windows Server 2012 R2, Windows RT 8.1, Windows 10, and Windows Server 2016.

MS16-155 – Important

Security Update for .NET Framework (3205640)
This security update resolves a vulnerability in Microsoft .NET 4.6.2 Framework’s Data Provider for SQL Server. A security vulnerability exists in Microsoft .NET Framework 4.6.2 that could allow an attacker to access information that is defended by the Always Encrypted feature.

Now that you have made it this far, a quick shameless plug for our software portfolio. 🙂

SnaPatch – Patch Management Addon for Microsoft’s SCCM.

SnapShot Master – Take control of your virtual machine snapshots, works with both Hyper-V and Vmware.

Azure Virtual Machine Scheduler – Save money and schedule the shutdown and power on of your virtual machines within Microsoft’s Azure Cloud.

Azure Virtual Machine Deployer – Deploy VMs to Microsoft’s Azure cloud easily, without the need for powershell.

« Older Entries