IT AI Skill

Disaster Recovery Planning

Design and maintain disaster recovery (DR) plans including site selection, failover strategies, RTO/RPO targets, failover testing, and recovery procedures for complete business continuity. Use when developing DR plans, selecting DR sites, configuring failover systems, conducting DR drills, documenting recovery procedures, or aligning DR strategy with business requirements. Triggers on phrases like "disaster recovery", "DR plan", "business continuity", "BCP", "failover", "site recovery", "DR site", "warm standby", "hot site", "cold site", "pilol light", "active-active", "active-passive", "DR test", "failback", "recovery strategy".

Disaster Recovery Planning

Comprehensive disaster recovery strategy design, implementation, and testing to ensure business continuity during catastrophic events including data center failures, natural disasters, and extended outages.

Workflow

Conduct business impact analysis (BIA): identify critical business functions, maximum tolerable downtime (MTD), financial impact of downtime, regulatory requirements, interdependencies between systems and departments.
Define DR strategy per application: RTO/RPO targets aligned with BIA; recovery strategy (active-active, active-passive, pilot light, warm standby, cold site); budget allocation.
Design DR architecture: DR site selection (secondary data center, cloud region), network connectivity (Direct Connect, ExpressRoute, dedicated link), data replication strategy, DNS failover.
Build DR environment: provision infrastructure in DR site; configure replication; establish connectivity; deploy monitoring and alerting; document runbooks.
Develop DR runbooks: step-by-step procedures for failover and failback; contact lists; decision matrices; escalation paths; communication templates.
Test DR plan: tabletop exercises (quarterly), technical failover tests (semi-annually), full DR drill (annually); measure actual RTO vs. target; document lessons learned.
Maintain DR readiness: keep DR environment current; update runbooks after changes; verify replication health; review contact lists; train staff on DR procedures.
Execute DR when needed: declare disaster; activate DR plan; execute failover; verify services; communicate status; begin recovery operations.
Failback to primary: stabilize primary site; replicate data from DR to primary; execute failback; verify operations; post-DR review.
Continuous improvement: update DR plan after every test or real event; incorporate lessons learned; adjust strategy based on business changes; annual DR plan review.

DR Strategy Models

DISASTER RECOVERY STRATEGY MODELS
===================================

MODEL 1: ACTIVE-ACTIVE (Highest Availability)

  Architecture:
    → Two (or more) fully operational sites processing traffic simultaneously
    → Load balanced across sites (global server load balancer, DNS-based routing)
    → Synchronous data replication between sites (zero data loss)
    → Both sites fully staffed and monitored

  RTO: Near-zero (automatic failover within seconds to minutes)
  RPO: Zero (synchronous replication)

  Advantages:
    → No downtime during failover (users may not even notice)
    → Zero data loss
    → Load distribution improves performance
    → Geographic redundancy for latency optimization

  Disadvantages:
    → Highest cost (2x infrastructure, 2x licensing, 2x staffing)
    → Complex application architecture (must support multi-write)
    → Data consistency challenges (conflict resolution needed)
    → Network latency between sites limits geographic distance

  Best for: Revenue-critical systems (payment processing, e-commerce, trading platforms)
  Cost: 150-200% of single-site cost

MODEL 2: HOT STANDBY / ACTIVE-PASSIVE

  Architecture:
    → Primary site handles all production traffic
    → DR site has full infrastructure running in standby mode
    → Asynchronous data replication (RPO: seconds to minutes)
    → DR site sized for full production workload (or right-sized for priority systems)

  RTO: 15 minutes - 2 hours (DNS failover + service startup)
  RPO: 1-15 minutes (asynchronous replication lag)

  Advantages:
    → Good balance of cost and recovery speed
    → DR site ready to accept traffic quickly
    → Simpler than active-active (single-write architecture)
    → Can use DR site for non-production workloads (testing, development)

  Disadvantages:
    → Idle infrastructure cost (DR site running but not processing production)
    → Brief downtime during failover (DNS propagation, service startup)
    → Some data loss possible (replication lag window)

  Best for: Customer-facing applications, ERP systems, email, core business applications
  Cost: 80-120% of single-site cost

MODEL 3: WARM STANDBY

  Architecture:
    → Primary site handles all production traffic
    → DR site has minimal infrastructure (skeleton environment)
    → Database replication active; application servers not fully provisioned
    → Auto-scaling or rapid provisioning to scale DR site during failover

  RTO: 2-8 hours (provisioning + data catch-up + service startup)
  RPO: 5-30 minutes (asynchronous replication)

  Advantages:
    → Lower ongoing cost (DR site minimally provisioned)
    → Scalable during failover (cloud-based provisioning)
    → Good for organizations with variable workload patterns

  Disadvantages:
    → Longer RTO (time to provision and scale)
    → Failover testing more complex (must validate provisioning automation)
    → Risk of provisioning failures during actual disaster

  Best for: Internal applications, development environments, non-customer-facing systems
  Cost: 30-50% of single-site cost

MODEL 4: PILOT LIGHT

  Architecture:
    → Primary site handles all production traffic
    → DR site maintains core infrastructure only (network, DNS, key databases)
    → Critical data replicated continuously; full application environment not pre-provisioned
    → Automated scripts to provision full environment from templates during failover

  RTO: 4-12 hours
  RPO: 15 minutes - 1 hour

  Advantages:
    → Minimal ongoing cost (only core services running in DR)
    → Core data always available and current
    → Infrastructure as code enables rapid environment provisioning

  Disadvantages:
    → Longer RTO (must provision full environment)
    → Requires mature IaC and automation capabilities
    → Failover testing essential (provisioning must work when needed)

  Best for: Organizations with strong automation capabilities; moderate-criticality systems
  Cost: 15-30% of single-site cost

MODEL 5: COLD SITE / BACKUP-ONLY

  Architecture:
    → Primary site handles all production traffic
    → DR site is empty facility or cloud account (no running infrastructure)
    → Data backed up to DR location (not replicated in real-time)
    → Full environment built from scratch during recovery

  RTO: 24-72 hours (or longer)
  RPO: 24 hours (daily backup window)

  Advantages:
    → Lowest ongoing cost
    → Simplest architecture
    → Meets basic compliance requirements

  Disadvantages:
    → Very long recovery time
    → Significant data loss possible
    → Full rebuild required (error-prone under pressure)
    → Staff may need to relocate to DR site

  Best for: Non-critical systems, archival data, organizations with extended tolerance for downtime
  Cost: 5-15% of single-site cost

STRATEGY SELECTION MATRIX:

  Application            | RTO Target | RPO Target | Recommended Strategy
  ──────────────────────|────────────|────────────|─────────────────────
  Payment processing     | < 5 min    | 0          | Active-Active
  Customer-facing API    | < 30 min   | < 5 min    | Hot Standby
  Email (O365/G Suite)   | N/A        | N/A        | Provider-managed DR
  ERP (SAP/Oracle)       | < 2 hours  | < 15 min   | Hot Standby
  Internal HRIS          | < 8 hours  | < 1 hour   | Warm Standby
  Development env        | < 24 hours | < 24 hours | Pilot Light
  File shares            | < 24 hours | < 24 hours | Backup + Cloud
  Archive / Compliance   | < 72 hours | < 24 hours | Cold Site

DR Architecture Design

CLOUD-BASED DR ARCHITECTURE (AWS EXAMPLE)
============================================

PRIMARY SITE: us-east-1 (N. Virginia)
DR SITE: us-west-2 (Oregon)

NETWORK ARCHITECTURE:

  Primary Region (us-east-1):
    → VPC: 10.0.0.0/16
    → Public subnets: 10.0.1.0/24, 10.0.2.0/24 (load balancers, NAT gateways)
    → Private subnets: 10.0.10.0/24 - 10.0.20.0/24 (application tiers)
    → Database subnet: 10.0.30.0/24, 10.0.31.0/24, 10.0.32.0/24 (Multi-AZ)
    → Transit Gateway: Connects VPCs, VPN, Direct Connect

  DR Region (us-west-2):
    → VPC: 10.1.0.0/16 (different CIDR to avoid overlap)
    → Mirrored subnet structure (public, private, database)
    → Transit Gateway: Peered with primary region Transit Gateway
    → Global Accelerator: DNS-based failover with health checks

  Inter-Region Connectivity:
    → VPC peering: Direct peering between primary and DR VPCs
    → Transit Gateway inter-region peering: For multi-VPC connectivity
    → AWS Direct Connect: Dedicated connection (if on-prem involved)
    → Latency: ~60ms between us-east-1 and us-west-2

DATA REPLICATION:

  Databases:
    → RDS (PostgreSQL/MySQL): Cross-region read replica
       → Replication lag: 1-5 seconds typically
       → Failover: Promote read replica to standalone (5-15 minutes)
    → Aurora: Global Database (cross-region replication)
       → Replication lag: < 1 second
       → Failover: < 2 minutes to promote secondary
    → DynamoDB: Global Tables (multi-region active-active)
       → Replication: Automatic, near-real-time
       → Conflict resolution: Last-writer-wins (configurable)
    → Redshift: Cross-region snapshots (hourly)
       → Restore time: Proportional to data size

  Storage:
    → S3: Cross-Region Replication (CRR)
       → Replication: Near-real-time
       → Failover: Update DNS to point to DR bucket
    → EBS: Cross-region snapshots (scheduled, hourly/daily)
       → Restore: Create volume from snapshot in DR region
    → EFS: Manual replication (backup to DR EFS via AWS Backup)

  Compute:
    → EC2: Not replicated (stateless design); provision from AMI in DR region
    → AMI: Copied to DR region (automated via AWS Backup or script)
    → Auto Scaling: Pre-configured launch templates in DR region
    → ECS/EKS: Infrastructure replicated; application deployed from container registry
    → Lambda: Function code replicated via deployment pipeline or AWS Backup

DNS FAILOVER:

  Route 53 Configuration:
    → Record type: Failover (primary + secondary)
    → Health checks: Every 10 seconds; 3 failures = failover
    → TTL: 60 seconds (balance between failover speed and DNS resolution overhead)
    → Routing policy: Failover (not weighted during normal operations)
    → Failover trigger: Health check failure OR manual failover (Route 53 CLI)

  Failover DNS Propagation:
    → With 60-second TTL: Full failover within 60-120 seconds
    → With lower TTL (10-30 seconds): Faster failover but higher DNS query load
    → Client-side DNS caching: May delay failover (uncontrollable)

DR Testing Program

DISASTER RECOVERY TESTING PROGRAM
===================================

TEST TYPES AND SCHEDULE:

  Tabletop Exercise (Quarterly):

    → Participants: IRT members, business stakeholders, IT leadership
    → Format: Walkthrough of DR scenario; verbal discussion of actions
    → Scenarios: Data center fire, regional outage, ransomware, cloud provider outage
    → Duration: 2-4 hours
    → Objectives:
       - Validate DR plan completeness
       - Identify gaps in procedures
       - Test communication procedures
       - Verify contact list accuracy
    → Output: After-action report; improvement items

  Technical Failover Test (Semi-Annually):

    → Participants: Infrastructure team, database team, application teams
    → Format: Actual failover of non-critical systems to DR site
    → Scope: One application or tier (e.g., database failover only)
    → Duration: 4-8 hours
    → Objectives:
       - Validate replication health and data integrity
       - Measure actual failover time (RTO validation)
       - Test DNS failover mechanisms
       - Verify DR environment provisioning
    → Output: Technical test report; RTO/RPO measurements; issues log

  Full DR Drill (Annually):

    → Participants: Entire organization (IT + business units)
    → Format: Simulated full disaster; actual failover of production systems
    → Scope: All critical systems; business continuity operations
    → Duration: 1-3 days (including failback)
    → Objectives:
       - End-to-end DR capability validation
       - Business impact assessment during extended outage
       - Staff readiness and training assessment
       - Communication procedure validation (internal + external)
    → Preparation: 4-week advance notice; stakeholder briefing; rollback plan
    → Output: Comprehensive DR test report; executive summary; improvement plan

  Unannounced DR Test (Annually — Advanced Organizations):

    → Participants: On-call team (no advance notice)
    → Format: Surprise failover during business hours
    → Scope: Single critical system (limited blast radius)
    → Duration: Until recovery (measured)
    → Objectives:
       - Test real-world response (not rehearsed)
       - Measure true RTO under pressure
       - Identify procedural gaps only visible without preparation
    → Constraints: Limited scope; rapid failback; senior leadership aware
    → Output: Honest assessment of real-world DR readiness

DR TEST REPORT TEMPLATE:

  Test Name: [Annual DR Drill 2024]
  Date: [Date]
  Duration: [Start time] to [End time] = [Total duration]
  Scope: [Systems tested]

  Scenario: [Description of simulated disaster]
  Declaration: [Who declared disaster and when]

  Results Summary:
    → Systems failed over: [X of Y] (success rate: Z%)
    → Systems failed to failover: [List and reasons]
    → Actual RTO: [X hours] vs. Target RTO: [Y hours]
    → Actual RPO: [X minutes] vs. Target RPO: [Y minutes]
    → Data integrity: [Verified / Issues found]
    → DNS failover: [Success / Issues / Time taken]
    → Communication: [Timely / Delayed / Issues]

  Detailed Timeline:
    [HH:MM] — Disaster declared
    [HH:MM] — IRT bridge call activated
    [HH:MM] — Failover procedures initiated
    [HH:MM] — DNS updated
    [HH:MM] — Database promoted
    [HH:MM] — Application services started
    [HH:MM] — Health checks passed
    [HH:MM] — Users verified access
    [HH:MM] — Full operations confirmed at DR site

  Issues Identified:
    1. [Issue description, impact, root cause]
    2. [Issue description, impact, root cause]

  Improvement Actions:
    [ ] [Action item], Owner: [Name], Due Date: [Date]
    [ ] [Action item], Owner: [Name], Due Date: [Date]

  Sign-off:
    DR Manager: [Name, Date]
    CISO: [Name, Date]
    CIO: [Name, Date]

Integration Points

AWS Disaster Recovery: Elastic Disaster Recovery (DRS) — continuous block-level replication; RPO seconds, RTO minutes; $/GB replicated + $/vCPU-hour; Route 53 failover; Cross-region replication for RDS, S3, DynamoDB, Aurora Global
Azure Site Recovery: Replicate on-prem VMs to Azure or between Azure regions; hyper-converged infrastructure support; $0.017/VM-hour replication + storage costs; Azure Load Balancer failover
GCP Disaster Recovery: VMware Engine cross-cluster replication; Cloud SQL read replicas; Storage transfer service; DNS failover with Cloud DNS health checks
Zerto: Cross-cloud DR platform; continuous data replication; application-aware consistency; RPO < 1 second; RTO < 5 minutes; $15/TB/month + VM licensing
Druva ZenCommand: Cloud DR for on-prem workloads; lift-and-shift to cloud; self-service failover; immutable backups; per-VM pricing
Veritas InfoMaker: Cloud DR as a service; managed by Veritas; VMware/Hyper-V support; predictable pricing per protected VM
Palo Alto Networks Prisma Cloud: Cloud-native DR capabilities; application-centric recovery; automated failover; integrates with security policies
HashiCorp Vault: DR for secrets management; auto-unseal; replication between clusters; critical for DR environment credential availability

Edge Cases

Cross-cloud DR (on-prem to cloud or cloud to cloud): Different APIs and tools per platform; Zerto or Veritas for cross-cloud replication; test cross-cloud failover thoroughly; network latency between providers; data egress costs
Regulated data sovereignty (data cannot leave country): DR site must be in same country; limited region options; consider sovereign cloud providers; document data residency in DR plan; test with regulatory requirements in mind
Large database recovery (10+ TB): Database replication may lag significantly; consider Aurora Global Database (<1 second lag) or database-specific DR (SQL Server Always On, Oracle Data Guard); test restore times regularly; have contingency for incomplete replication
Application state during failover: In-flight transactions lost; design applications for idempotency; queue-based architectures help (messages preserved in SQS/SNS); document expected data loss window; communicate to business stakeholders
Staff availability during disaster: Remote work capability essential for DR team; communication tools must function during disaster (avoid dependency on failed infrastructure); pre-arranged backup communication channels (SMS, satellite phone); on-call rotation must include DR coverage
Failback complexity: Often overlooked; failback can be more complex than initial failover; data may have been written to DR during extended outage; bidirectional replication conflict resolution; plan and test failback procedures; document failback runbook
DR cost management: Idle DR infrastructure can cost 30-120% of primary; use cloud cost optimization (spot instances for non-critical DR resources, auto-shutdown for warm standby, pilot light model); track DR costs separately; present cost/benefit to business

Disclaimer: All rights reserved by Circulos AI. These skills are specifically designed for Claude Code, Claude Cowork, Codex, and OpenClaw. When using or referencing any skill, please provide proper attribution to Circulos AI.