IT AI Skill

Disaster Recovery

Plan, implement, and test disaster recovery capabilities including business continuity planning, RTO/RPO definition, failover architecture, DR testing, crisis communication, and recovery execution. Use when creating DR plans, defining recovery objectives, d...

Disaster Recovery & Business Continuity

Ensure organizational resilience through comprehensive disaster recovery planning, testing, and execution.

Workflow

1. Business Impact Analysis & Planning

  1. Business impact analysis (BIA):
  1. Risk assessment and scenario planning:
  1. DR plan development:

2. DR Architecture & Infrastructure

  1. Data protection strategy:
  1. Failover infrastructure design:
  1. Technology stack resilience:

3. DR Testing & Validation

  1. Testing strategy and cadence:
  1. DR test execution:
  1. Post-test analysis and improvement:

4. Incident Response & Recovery Execution

  1. Incident declaration and activation:
  1. Recovery execution:
  1. Crisis communication:

5. Recovery & Restoration

  1. Service restoration:
  1. Post-disaster review:
  1. Continuous improvement:

Templates & Frameworks

DR Plan Framework

DISASTER RECOVERY PLAN FRAMEWORK
=================================

RECOVERY OBJECTIVES:
  Tier 1 (Mission Critical):
    RTO: < 1 hour
    RPO: < 15 minutes
    Systems: Core transaction processing, customer-facing applications

  Tier 2 (Business Critical):
    RTO: < 4 hours
    RPO: < 1 hour
    Systems: Internal applications, reporting, collaboration

  Tier 3 (Standard):
    RTO: < 24 hours
    RPO: < 4 hours
    Systems: Development, test, archival systems

DR SITE CONFIGURATION:
  Primary site: [Location] — Full production
  DR site: [Cloud Region / Secondary Location] — Warm standby
  Failover mode: Semi-automated (manual activation, automated failover)
  DR capacity: 100% of production capacity
  Network connectivity: Dedicated fiber + satellite backup

RECOVERY SEQUENCE:
  Phase 1 (0-30 min): Core infrastructure (network, identity, DNS)
  Phase 2 (30-60 min): Tier 1 applications and databases
  Phase 3 (1-4 hours): Tier 2 applications and services
  Phase 4 (4-24 hours): Tier 3 systems and full restoration

DR TEAM ROSTER:
  DR Commander: [Name, role, phone, email, alternate]
  IT Lead: [Name, role, phone, email, alternate]
  Application Lead: [Name, role, phone, email, alternate]
  Data Lead: [Name, role, phone, email, alternate]
  Communications Lead: [Name, role, phone, email, alternate]
  Facilities Lead: [Name, role, phone, email, alternate]

CONTACT DISTRIBUTION:
  Internal: Executive team, department heads, all employees
  External: Customers, partners, regulators, insurers
  Communication channels: SMS, email, phone tree, mass notification system
  Update frequency: Every 2 hours during active incident

TESTING CADENCE:
  Documentation review: Quarterly
  Tabletop exercise: Semi-annually
  Simulation test: Annually
  Full failover test: Annually (off-peak)
  After-action review: Within 5 business days of each test

DR Test After-Action Report

DR TEST AFTER-ACTION REPORT
==============================

TEST DETAILS:
  Test type: [Tabletop / Simulation / Parallel / Full Failover]
  Date: [Date]
  Duration: [Duration]
  Scenario: [Disaster scenario tested]
  Participants: [Names, roles]
  DR plan version: [Version]

OBJECTIVE ASSESSMENT:
  Test objectives defined: [List]
  Objectives achieved: [X/X]
  Overall result: [Pass / Partial Pass / Fail]

RTO/RPO ACHIEVEMENT:
  Tier 1 systems:
    Target RTO: 1 hour | Actual: [X minutes] | Result: [Pass/Fail]
    Target RPO: 15 min | Actual: [X minutes] | Result: [Pass/Fail]
  Tier 2 systems:
    Target RTO: 4 hours | Actual: [X hours] | Result: [Pass/Fail]
    Target RPO: 1 hour | Actual: [X minutes] | Result: [Pass/Fail]

ISSUES IDENTIFIED:
  1. [Issue description]
     Severity: [Critical / High / Medium / Low]
     Root cause: [Description]
     Impact: [Description]
     Resolution: [Immediate fix if applied]

  2. [Issue description]
     Severity: [Critical / High / Medium / Low]
     Root cause: [Description]
     Impact: [Description]
     Resolution: [Immediate fix if applied]

LESSONS LEARNED:
  What worked well: [List]
  What didn't work: [List]
  Unexpected challenges: [List]
  Plan gaps identified: [List]

IMPROVEMENT ACTIONS:
  1. [Action item] — Owner: [Name] — Due: [Date] — Priority: [High/Med/Low]
  2. [Action item] — Owner: [Name] — Due: [Date] — Priority: [High/Med/Low]
  3. [Action item] — Owner: [Name] — Due: [Date] — Priority: [High/Med/Low]

RECOMMENDATIONS:
  DR plan updates needed: [List]
  Infrastructure improvements: [List]
  Training gaps identified: [List]
  Next test focus area: [Recommendation]

Integration Points

Edge Cases

Output

DR Readiness Dashboard

DISASTER RECOVERY READINESS — April 2025
==========================================

OVERALL DR READINESS SCORE: 87/100 ✓

RECOVERY OBJECTIVE STATUS:
  Tier 1 (Mission Critical):
    RTO compliance: 98% ✓ (last test: 45 min, target: 60 min)
    RPO compliance: 100% ✓ (last test: 12 min, target: 15 min)
  Tier 2 (Business Critical):
    RTO compliance: 95% ✓ (last test: 3.2 hrs, target: 4 hrs)
    RPO compliance: 97% ✓ (last test: 52 min, target: 60 min)
  Tier 3 (Standard):
    RTO compliance: 100% ✓
    RPO compliance: 100% ✓

DATA PROTECTION:
  Backup coverage: 99.2% ✓
  Backup success rate: 98.7% ✓
  Last full backup verification: 3 days ago ✓
  Immutable backup status: Active ✓
  Geo-redundant replication: Active (2 regions)
  RPO actual (avg): 18 minutes

INFRASTRUCTURE READINESS:
  DR site capacity: 100% available
  Network failover tested: Active ✓
  DNS failover configuration: Verified ✓
  Identity redundancy: Active ✓
  Security infrastructure: Redundant ✓

TESTING STATUS:
  Last tabletop exercise: 45 days ago ✓
  Last simulation test: 120 days ago
  Last full failover test: 340 days ago (due within 30 days)
  Test pass rate (last 4 tests): 100% ✓
  Open issues from last test: 3 (all medium, resolution in progress)

PLAN GOVERNANCE:
  DR plan version: 4.2 (current)
  Last plan review: 30 days ago ✓
  Contact list accuracy: Verified within 30 days ✓
  Team training completion: 94%
  Awareness campaign: Active

IMPROVEMENT TRACKING:
  Open improvement actions: 7
  Completed this quarter: 12
  Infrastructure upgrades planned: 3
  Next test scheduled: [Date]

Trigger Phrases

"disaster recovery", "DR plan", "business continuity", "RTO", "RPO", "failover", "BCP", "DR test", "crisis management", "recovery architecture", "backup strategy", "failover test", "incident response", "crisis communication", "after-action review"