IT AI Skill
Disaster Recovery Planning
Design and maintain disaster recovery (DR) plans including site selection, failover strategies, RTO/RPO targets, failover testing, and recovery procedures for complete business continuity. Use when developing DR plans, selecting DR sites, configuring failov...
Disaster Recovery Planning
Comprehensive disaster recovery strategy design, implementation, and testing to ensure business continuity during catastrophic events including data center failures, natural disasters, and extended outages.
Workflow
- Conduct business impact analysis (BIA): identify critical business functions, maximum tolerable downtime (MTD), financial impact of downtime, regulatory requirements, interdependencies between systems and departments.
- Define DR strategy per application: RTO/RPO targets aligned with BIA; recovery strategy (active-active, active-passive, pilot light, warm standby, cold site); budget allocation.
- Design DR architecture: DR site selection (secondary data center, cloud region), network connectivity (Direct Connect, ExpressRoute, dedicated link), data replication strategy, DNS failover.
- Build DR environment: provision infrastructure in DR site; configure replication; establish connectivity; deploy monitoring and alerting; document runbooks.
- Develop DR runbooks: step-by-step procedures for failover and failback; contact lists; decision matrices; escalation paths; communication templates.
- Test DR plan: tabletop exercises (quarterly), technical failover tests (semi-annually), full DR drill (annually); measure actual RTO vs. target; document lessons learned.
- Maintain DR readiness: keep DR environment current; update runbooks after changes; verify replication health; review contact lists; train staff on DR procedures.
- Execute DR when needed: declare disaster; activate DR plan; execute failover; verify services; communicate status; begin recovery operations.
- Failback to primary: stabilize primary site; replicate data from DR to primary; execute failback; verify operations; post-DR review.
- Continuous improvement: update DR plan after every test or real event; incorporate lessons learned; adjust strategy based on business changes; annual DR plan review.
DR Strategy Models
DISASTER RECOVERY STRATEGY MODELS
===================================
MODEL 1: ACTIVE-ACTIVE (Highest Availability)
Architecture:
→ Two (or more) fully operational sites processing traffic simultaneously
→ Load balanced across sites (global server load balancer, DNS-based routing)
→ Synchronous data replication between sites (zero data loss)
→ Both sites fully staffed and monitored
RTO: Near-zero (automatic failover within seconds to minutes)
RPO: Zero (synchronous replication)
Advantages:
→ No downtime during failover (users may not even notice)
→ Zero data loss
→ Load distribution improves performance
→ Geographic redundancy for latency optimization
Disadvantages:
→ Highest cost (2x infrastructure, 2x licensing, 2x staffing)
→ Complex application architecture (must support multi-write)
→ Data consistency challenges (conflict resolution needed)
→ Network latency between sites limits geographic distance
Best for: Revenue-critical systems (payment processing, e-commerce, trading platforms)
Cost: 150-200% of single-site cost
MODEL 2: HOT STANDBY / ACTIVE-PASSIVE
Architecture:
→ Primary site handles all production traffic
→ DR site has full infrastructure running in standby mode
→ Asynchronous data replication (RPO: seconds to minutes)
→ DR site sized for full production workload (or right-sized for priority systems)
RTO: 15 minutes - 2 hours (DNS failover + service startup)
RPO: 1-15 minutes (asynchronous replication lag)
Advantages:
→ Good balance of cost and recovery speed
→ DR site ready to accept traffic quickly
→ Simpler than active-active (single-write architecture)
→ Can use DR site for non-production workloads (testing, development)
Disadvantages:
→ Idle infrastructure cost (DR site running but not processing production)
→ Brief downtime during failover (DNS propagation, service startup)
→ Some data loss possible (replication lag window)
Best for: Customer-facing applications, ERP systems, email, core business applications
Cost: 80-120% of single-site cost
MODEL 3: WARM STANDBY
Architecture:
→ Primary site handles all production traffic
→ DR site has minimal infrastructure (skeleton environment)
→ Database replication active; application servers not fully provisioned
→ Auto-scaling or rapid provisioning to scale DR site during failover
RTO: 2-8 hours (provisioning + data catch-up + service startup)
RPO: 5-30 minutes (asynchronous replication)
Advantages:
→ Lower ongoing cost (DR site minimally provisioned)
→ Scalable during failover (cloud-based provisioning)
→ Good for organizations with variable workload patterns
Disadvantages:
→ Longer RTO (time to provision and scale)
→ Failover testing more complex (must validate provisioning automation)
→ Risk of provisioning failures during actual disaster
Best for: Internal applications, development environments, non-customer-facing systems
Cost: 30-50% of single-site cost
MODEL 4: PILOT LIGHT
Architecture:
→ Primary site handles all production traffic
→ DR site maintains core infrastructure only (network, DNS, key databases)
→ Critical data replicated continuously; full application environment not pre-provisioned
→ Automated scripts to provision full environment from templates during failover
RTO: 4-12 hours
RPO: 15 minutes - 1 hour
Advantages:
→ Minimal ongoing cost (only core services running in DR)
→ Core data always available and current
→ Infrastructure as code enables rapid environment provisioning
Disadvantages:
→ Longer RTO (must provision full environment)
→ Requires mature IaC and automation capabilities
→ Failover testing essential (provisioning must work when needed)
Best for: Organizations with strong automation capabilities; moderate-criticality systems
Cost: 15-30% of single-site cost
MODEL 5: COLD SITE / BACKUP-ONLY
Architecture:
→ Primary site handles all production traffic
→ DR site is empty facility or cloud account (no running infrastructure)
→ Data backed up to DR location (not replicated in real-time)
→ Full environment built from scratch during recovery
RTO: 24-72 hours (or longer)
RPO: 24 hours (daily backup window)
Advantages:
→ Lowest ongoing cost
→ Simplest architecture
→ Meets basic compliance requirements
Disadvantages:
→ Very long recovery time
→ Significant data loss possible
→ Full rebuild required (error-prone under pressure)
→ Staff may need to relocate to DR site
Best for: Non-critical systems, archival data, organizations with extended tolerance for downtime
Cost: 5-15% of single-site cost
STRATEGY SELECTION MATRIX:
Application | RTO Target | RPO Target | Recommended Strategy
──────────────────────|────────────|────────────|─────────────────────
Payment processing | < 5 min | 0 | Active-Active
Customer-facing API | < 30 min | < 5 min | Hot Standby
Email (O365/G Suite) | N/A | N/A | Provider-managed DR
ERP (SAP/Oracle) | < 2 hours | < 15 min | Hot Standby
Internal HRIS | < 8 hours | < 1 hour | Warm Standby
Development env | < 24 hours | < 24 hours | Pilot Light
File shares | < 24 hours | < 24 hours | Backup + Cloud
Archive / Compliance | < 72 hours | < 24 hours | Cold Site
DR Architecture Design
CLOUD-BASED DR ARCHITECTURE (AWS EXAMPLE)
============================================
PRIMARY SITE: us-east-1 (N. Virginia)
DR SITE: us-west-2 (Oregon)
NETWORK ARCHITECTURE:
Primary Region (us-east-1):
→ VPC: 10.0.0.0/16
→ Public subnets: 10.0.1.0/24, 10.0.2.0/24 (load balancers, NAT gateways)
→ Private subnets: 10.0.10.0/24 - 10.0.20.0/24 (application tiers)
→ Database subnet: 10.0.30.0/24, 10.0.31.0/24, 10.0.32.0/24 (Multi-AZ)
→ Transit Gateway: Connects VPCs, VPN, Direct Connect
DR Region (us-west-2):
→ VPC: 10.1.0.0/16 (different CIDR to avoid overlap)
→ Mirrored subnet structure (public, private, database)
→ Transit Gateway: Peered with primary region Transit Gateway
→ Global Accelerator: DNS-based failover with health checks
Inter-Region Connectivity:
→ VPC peering: Direct peering between primary and DR VPCs
→ Transit Gateway inter-region peering: For multi-VPC connectivity
→ AWS Direct Connect: Dedicated connection (if on-prem involved)
→ Latency: ~60ms between us-east-1 and us-west-2
DATA REPLICATION:
Databases:
→ RDS (PostgreSQL/MySQL): Cross-region read replica
→ Replication lag: 1-5 seconds typically
→ Failover: Promote read replica to standalone (5-15 minutes)
→ Aurora: Global Database (cross-region replication)
→ Replication lag: < 1 second
→ Failover: < 2 minutes to promote secondary
→ DynamoDB: Global Tables (multi-region active-active)
→ Replication: Automatic, near-real-time
→ Conflict resolution: Last-writer-wins (configurable)
→ Redshift: Cross-region snapshots (hourly)
→ Restore time: Proportional to data size
Storage:
→ S3: Cross-Region Replication (CRR)
→ Replication: Near-real-time
→ Failover: Update DNS to point to DR bucket
→ EBS: Cross-region snapshots (scheduled, hourly/daily)
→ Restore: Create volume from snapshot in DR region
→ EFS: Manual replication (backup to DR EFS via AWS Backup)
Compute:
→ EC2: Not replicated (stateless design); provision from AMI in DR region
→ AMI: Copied to DR region (automated via AWS Backup or script)
→ Auto Scaling: Pre-configured launch templates in DR region
→ ECS/EKS: Infrastructure replicated; application deployed from container registry
→ Lambda: Function code replicated via deployment pipeline or AWS Backup
DNS FAILOVER:
Route 53 Configuration:
→ Record type: Failover (primary + secondary)
→ Health checks: Every 10 seconds; 3 failures = failover
→ TTL: 60 seconds (balance between failover speed and DNS resolution overhead)
→ Routing policy: Failover (not weighted during normal operations)
→ Failover trigger: Health check failure OR manual failover (Route 53 CLI)
Failover DNS Propagation:
→ With 60-second TTL: Full failover within 60-120 seconds
→ With lower TTL (10-30 seconds): Faster failover but higher DNS query load
→ Client-side DNS caching: May delay failover (uncontrollable)
DR Testing Program
DISASTER RECOVERY TESTING PROGRAM
===================================
TEST TYPES AND SCHEDULE:
Tabletop Exercise (Quarterly):
→ Participants: IRT members, business stakeholders, IT leadership
→ Format: Walkthrough of DR scenario; verbal discussion of actions
→ Scenarios: Data center fire, regional outage, ransomware, cloud provider outage
→ Duration: 2-4 hours
→ Objectives:
- Validate DR plan completeness
- Identify gaps in procedures
- Test communication procedures
- Verify contact list accuracy
→ Output: After-action report; improvement items
Technical Failover Test (Semi-Annually):
→ Participants: Infrastructure team, database team, application teams
→ Format: Actual failover of non-critical systems to DR site
→ Scope: One application or tier (e.g., database failover only)
→ Duration: 4-8 hours
→ Objectives:
- Validate replication health and data integrity
- Measure actual failover time (RTO validation)
- Test DNS failover mechanisms
- Verify DR environment provisioning
→ Output: Technical test report; RTO/RPO measurements; issues log
Full DR Drill (Annually):
→ Participants: Entire organization (IT + business units)
→ Format: Simulated full disaster; actual failover of production systems
→ Scope: All critical systems; business continuity operations
→ Duration: 1-3 days (including failback)
→ Objectives:
- End-to-end DR capability validation
- Business impact assessment during extended outage
- Staff readiness and training assessment
- Communication procedure validation (internal + external)
→ Preparation: 4-week advance notice; stakeholder briefing; rollback plan
→ Output: Comprehensive DR test report; executive summary; improvement plan
Unannounced DR Test (Annually — Advanced Organizations):
→ Participants: On-call team (no advance notice)
→ Format: Surprise failover during business hours
→ Scope: Single critical system (limited blast radius)
→ Duration: Until recovery (measured)
→ Objectives:
- Test real-world response (not rehearsed)
- Measure true RTO under pressure
- Identify procedural gaps only visible without preparation
→ Constraints: Limited scope; rapid failback; senior leadership aware
→ Output: Honest assessment of real-world DR readiness
DR TEST REPORT TEMPLATE:
Test Name: [Annual DR Drill 2024]
Date: [Date]
Duration: [Start time] to [End time] = [Total duration]
Scope: [Systems tested]
Scenario: [Description of simulated disaster]
Declaration: [Who declared disaster and when]
Results Summary:
→ Systems failed over: [X of Y] (success rate: Z%)
→ Systems failed to failover: [List and reasons]
→ Actual RTO: [X hours] vs. Target RTO: [Y hours]
→ Actual RPO: [X minutes] vs. Target RPO: [Y minutes]
→ Data integrity: [Verified / Issues found]
→ DNS failover: [Success / Issues / Time taken]
→ Communication: [Timely / Delayed / Issues]
Detailed Timeline:
[HH:MM] — Disaster declared
[HH:MM] — IRT bridge call activated
[HH:MM] — Failover procedures initiated
[HH:MM] — DNS updated
[HH:MM] — Database promoted
[HH:MM] — Application services started
[HH:MM] — Health checks passed
[HH:MM] — Users verified access
[HH:MM] — Full operations confirmed at DR site
Issues Identified:
1. [Issue description, impact, root cause]
2. [Issue description, impact, root cause]
Improvement Actions:
[ ] [Action item], Owner: [Name], Due Date: [Date]
[ ] [Action item], Owner: [Name], Due Date: [Date]
Sign-off:
DR Manager: [Name, Date]
CISO: [Name, Date]
CIO: [Name, Date]
Integration Points
- AWS Disaster Recovery: Elastic Disaster Recovery (DRS) — continuous block-level replication; RPO seconds, RTO minutes; $/GB replicated + $/vCPU-hour; Route 53 failover; Cross-region replication for RDS, S3, DynamoDB, Aurora Global
- Azure Site Recovery: Replicate on-prem VMs to Azure or between Azure regions; hyper-converged infrastructure support; $0.017/VM-hour replication + storage costs; Azure Load Balancer failover
- GCP Disaster Recovery: VMware Engine cross-cluster replication; Cloud SQL read replicas; Storage transfer service; DNS failover with Cloud DNS health checks
- Zerto: Cross-cloud DR platform; continuous data replication; application-aware consistency; RPO < 1 second; RTO < 5 minutes; $15/TB/month + VM licensing
- Druva ZenCommand: Cloud DR for on-prem workloads; lift-and-shift to cloud; self-service failover; immutable backups; per-VM pricing
- Veritas InfoMaker: Cloud DR as a service; managed by Veritas; VMware/Hyper-V support; predictable pricing per protected VM
- Palo Alto Networks Prisma Cloud: Cloud-native DR capabilities; application-centric recovery; automated failover; integrates with security policies
- HashiCorp Vault: DR for secrets management; auto-unseal; replication between clusters; critical for DR environment credential availability
Edge Cases
- Cross-cloud DR (on-prem to cloud or cloud to cloud): Different APIs and tools per platform; Zerto or Veritas for cross-cloud replication; test cross-cloud failover thoroughly; network latency between providers; data egress costs
- Regulated data sovereignty (data cannot leave country): DR site must be in same country; limited region options; consider sovereign cloud providers; document data residency in DR plan; test with regulatory requirements in mind
- Large database recovery (10+ TB): Database replication may lag significantly; consider Aurora Global Database (<1 second lag) or database-specific DR (SQL Server Always On, Oracle Data Guard); test restore times regularly; have contingency for incomplete replication
- Application state during failover: In-flight transactions lost; design applications for idempotency; queue-based architectures help (messages preserved in SQS/SNS); document expected data loss window; communicate to business stakeholders
- Staff availability during disaster: Remote work capability essential for DR team; communication tools must function during disaster (avoid dependency on failed infrastructure); pre-arranged backup communication channels (SMS, satellite phone); on-call rotation must include DR coverage
- Failback complexity: Often overlooked; failback can be more complex than initial failover; data may have been written to DR during extended outage; bidirectional replication conflict resolution; plan and test failback procedures; document failback runbook
- DR cost management: Idle DR infrastructure can cost 30-120% of primary; use cloud cost optimization (spot instances for non-critical DR resources, auto-shutdown for warm standby, pilot light model); track DR costs separately; present cost/benefit to business