Edit

Disaster recovery for an Azure data platform

This article is the first in a series that describes how to design a disaster recovery (DR) strategy for an Azure enterprise data platform. The series complements the following guidance:

Azure offers many reliability options that provide service continuity during a disaster. But higher service levels can add complexity and increase cost. When you make decisions about DR, consider the trade-offs between cost, reliability, and complexity.

Occasional point failures occur across the Azure platform, but Azure datacenters and services have multiple layers of redundancy built in. These failures typically have limited scope and are remediated within hours. A partial service disruption, like an identity management outage, is more common than a complete Azure region failure.

Cyberattacks, particularly ransomware, pose a tangible threat to any modern data ecosystem and can result in a data platform outage. This threat is out of scope for this series, but you should implement controls against such attacks as part of any data platform's security and reliability design.

For more information, see Backup and restore plan to protect against ransomware.

Scope

This series covers service recovery of an Azure data platform from a physical disaster. The example customer in the scenario has the following characteristics:

  • Medium‑sized to large organization that has a defined operational support function that follows Information Technology Infrastructure Library (ITIL) service management methodology.

  • Not cloud-native. Core enterprise shared services, like identity management and incident management, remain on-premises.

  • Migrating to Azure by using automation-enabled deployments.

The data platform implements the following designs within the customer's Azure environment:

  • An enterprise landing zone that provides the platform foundation, including networking, monitoring, security, and other capabilities

  • An Azure analytics platform that provides the data components for various solutions and data products

This article covers service failover operations from the primary region to the secondary region. To follow this guidance, you need the following knowledge:

  • Working knowledge of Azure, its core services, and data components. For more information, see Azure fundamentals.

  • Working knowledge of Azure DevOps, including source control navigation and pipeline execution.

Out of scope

This series doesn't cover:

  • Fallback from the secondary region to the primary region.

  • Non-Azure applications, components, or systems, like on-premises systems, other cloud vendors, and external web services.

  • Upstream service recovery, like on-premises networks, gateways, and enterprise shared services, even if the data platform depends on them.

  • Downstream service recovery, like on-premises operational systems, external reporting systems, and data modeling or data science applications, even if they depend on the data platform.

  • Data loss scenarios, including recovery from ransomware or similar data security incidents.

  • Data backup strategies and data restoration plans.

  • Root cause analysis (RCA) for a DR event. For Azure service incidents, Microsoft publishes RCA reports on the Azure status history page.

Key assumptions

This example assumes that:

  • The organization follows an ITIL-based service management methodology for operational support of the Azure data platform.

  • The organization has an existing DR process as part of its IT service restoration framework.

  • The organization uses infrastructure as code (IaC) to deploy the Azure data platform through an automation service like Azure DevOps.

  • The organization completes a business impact assessment for each solution on the data platform, with defined recovery point objective (RPO), recovery time objective (RTO), and mean time to repair (MTTR) metrics.

Next step

After you review the scenario, see the architecture for this use case.