IT AI Skill
Disaster Recovery
Plan, implement, and test disaster recovery capabilities including business continuity planning, RTO/RPO definition, failover architecture, DR testing, crisis communication, and recovery execution. Use when creating DR plans, defining recovery objectives, d...
Disaster Recovery & Business Continuity
Ensure organizational resilience through comprehensive disaster recovery planning, testing, and execution.
Workflow
1. Business Impact Analysis & Planning
- Business impact analysis (BIA):
- Critical business function identification
- Dependency mapping (people, processes, technology, vendors)
- Financial and operational impact quantification
- Recovery time objective (RTO) definition per function
- Recovery point objective (RPO) definition per function
- Risk assessment and scenario planning:
- Threat identification (natural, technical, human, cyber)
- Vulnerability assessment
- Likelihood and impact probability analysis
- Risk mitigation strategy per scenario
- Worst-case scenario planning
- DR plan development:
- DR strategy selection (hot, warm, cold site, cloud)
- Recovery priority and sequence definition
- Resource requirement identification
- DR team role and responsibility assignment
- Communication plan and contact list
2. DR Architecture & Infrastructure
- Data protection strategy:
- Backup strategy (full, incremental, differential)
- Replication technology and frequency
- Off-site and geo-redundant storage
- Immutable and air-gapped backup for ransomware protection
- Backup verification and integrity testing
- Failover infrastructure design:
- Primary and secondary site architecture
- Cloud-based DR (AWS DR, Azure Site Recovery)
- Network failover and DNS routing
- Application failover sequence and dependency order
- Capacity planning for DR site
- Technology stack resilience:
- High availability architecture (active-active, active-passive)
- Load balancer and traffic routing failover
- Database replication and failover
- Identity and authentication redundancy
- Security infrastructure continuity
3. DR Testing & Validation
- Testing strategy and cadence:
- Test type selection (documentation review, tabletop, simulation, parallel, full interruption)
- Testing frequency by criticality (annual, bi-annual, quarterly)
- Test scope and objective definition
- Test environment preparation
- Success criteria and acceptance threshold
- DR test execution:
- Test scenario and script development
- Team briefing and role assignment
- Test execution with time tracking
- Issue and deviation logging
- Test completion and recovery validation
- Post-test analysis and improvement:
- After-action review and lessons learned
- RTO/RPO achievement assessment
- Gap and deficiency identification
- Improvement action plan development
- DR plan update and version control
4. Incident Response & Recovery Execution
- Incident declaration and activation:
- Incident detection and assessment
- DR activation criteria and authority
- DR team notification and mobilization
- Stakeholder communication activation
- Escalation path execution
- Recovery execution:
- Recovery sequence execution per priority
- System and application failover
- Data restoration and validation
- Service verification and testing
- Operational status update
- Crisis communication:
- Internal communication (employees, management)
- External communication (customers, partners, regulators)
- Media and public communication
- Status update cadence and channel
- Post-incident communication and lessons learned
5. Recovery & Restoration
- Service restoration:
- Normal operations resumption criteria
- Back-failover planning and execution
- Data reconciliation and sync
- Service validation and verification
- Stakeholder notification of restoration
- Post-disaster review:
- Comprehensive incident review and analysis
- DR plan effectiveness assessment
- Financial impact assessment
- Insurance claim preparation and filing
- Regulatory notification and reporting
- Continuous improvement:
- DR plan update based on lessons learned
- Infrastructure improvement implementation
- Training and awareness program update
- Vendor and partnership review
- DR program maturity assessment
Templates & Frameworks
DR Plan Framework
DISASTER RECOVERY PLAN FRAMEWORK
=================================
RECOVERY OBJECTIVES:
Tier 1 (Mission Critical):
RTO: < 1 hour
RPO: < 15 minutes
Systems: Core transaction processing, customer-facing applications
Tier 2 (Business Critical):
RTO: < 4 hours
RPO: < 1 hour
Systems: Internal applications, reporting, collaboration
Tier 3 (Standard):
RTO: < 24 hours
RPO: < 4 hours
Systems: Development, test, archival systems
DR SITE CONFIGURATION:
Primary site: [Location] — Full production
DR site: [Cloud Region / Secondary Location] — Warm standby
Failover mode: Semi-automated (manual activation, automated failover)
DR capacity: 100% of production capacity
Network connectivity: Dedicated fiber + satellite backup
RECOVERY SEQUENCE:
Phase 1 (0-30 min): Core infrastructure (network, identity, DNS)
Phase 2 (30-60 min): Tier 1 applications and databases
Phase 3 (1-4 hours): Tier 2 applications and services
Phase 4 (4-24 hours): Tier 3 systems and full restoration
DR TEAM ROSTER:
DR Commander: [Name, role, phone, email, alternate]
IT Lead: [Name, role, phone, email, alternate]
Application Lead: [Name, role, phone, email, alternate]
Data Lead: [Name, role, phone, email, alternate]
Communications Lead: [Name, role, phone, email, alternate]
Facilities Lead: [Name, role, phone, email, alternate]
CONTACT DISTRIBUTION:
Internal: Executive team, department heads, all employees
External: Customers, partners, regulators, insurers
Communication channels: SMS, email, phone tree, mass notification system
Update frequency: Every 2 hours during active incident
TESTING CADENCE:
Documentation review: Quarterly
Tabletop exercise: Semi-annually
Simulation test: Annually
Full failover test: Annually (off-peak)
After-action review: Within 5 business days of each test
DR Test After-Action Report
DR TEST AFTER-ACTION REPORT
==============================
TEST DETAILS:
Test type: [Tabletop / Simulation / Parallel / Full Failover]
Date: [Date]
Duration: [Duration]
Scenario: [Disaster scenario tested]
Participants: [Names, roles]
DR plan version: [Version]
OBJECTIVE ASSESSMENT:
Test objectives defined: [List]
Objectives achieved: [X/X]
Overall result: [Pass / Partial Pass / Fail]
RTO/RPO ACHIEVEMENT:
Tier 1 systems:
Target RTO: 1 hour | Actual: [X minutes] | Result: [Pass/Fail]
Target RPO: 15 min | Actual: [X minutes] | Result: [Pass/Fail]
Tier 2 systems:
Target RTO: 4 hours | Actual: [X hours] | Result: [Pass/Fail]
Target RPO: 1 hour | Actual: [X minutes] | Result: [Pass/Fail]
ISSUES IDENTIFIED:
1. [Issue description]
Severity: [Critical / High / Medium / Low]
Root cause: [Description]
Impact: [Description]
Resolution: [Immediate fix if applied]
2. [Issue description]
Severity: [Critical / High / Medium / Low]
Root cause: [Description]
Impact: [Description]
Resolution: [Immediate fix if applied]
LESSONS LEARNED:
What worked well: [List]
What didn't work: [List]
Unexpected challenges: [List]
Plan gaps identified: [List]
IMPROVEMENT ACTIONS:
1. [Action item] — Owner: [Name] — Due: [Date] — Priority: [High/Med/Low]
2. [Action item] — Owner: [Name] — Due: [Date] — Priority: [High/Med/Low]
3. [Action item] — Owner: [Name] — Due: [Date] — Priority: [High/Med/Low]
RECOMMENDATIONS:
DR plan updates needed: [List]
Infrastructure improvements: [List]
Training gaps identified: [List]
Next test focus area: [Recommendation]
Integration Points
- Cloud DR services (AWS Disaster Recovery, Azure Site Recovery, Zerto, VMware): DR infrastructure
- Backup platforms (Veeam, Commvault, Rubrik, Veritas): Data protection
- Mass notification systems (Everbridge, OnSolve, AlertMedia): Emergency communication
- Business continuity management platforms (Resilient, Everbridge, Fusion): BCM orchestration
- Monitoring platforms: System health and failover trigger
- Communication platforms: Crisis communication
- IT service management (ServiceNow): Incident and problem management
- Cybersecurity platforms: Threat detection and response
Edge Cases
- Ransomware during DR: Immutable backup verification; air-gapped recovery point; clean state restoration; incident response coordination; forensic investigation
- Extended outage beyond planned RTO: Extended operations at DR site; resource exhaustion planning; vendor dependency management; employee fatigue and shift management
- Multi-region/cloud disaster: Cross-region replication strategy; DNS failover complexity; data consistency challenges; regulatory data sovereignty compliance
- Pandemic/business continuity: Remote work activation; workforce health and safety; supply chain disruption; customer communication; long-term operational adjustment
- Regulated industry DR (financial, healthcare): Regulatory notification requirements; compliance documentation; audit trail maintenance; regulatory testing requirements
Output
DR Readiness Dashboard
DISASTER RECOVERY READINESS — April 2025
==========================================
OVERALL DR READINESS SCORE: 87/100 ✓
RECOVERY OBJECTIVE STATUS:
Tier 1 (Mission Critical):
RTO compliance: 98% ✓ (last test: 45 min, target: 60 min)
RPO compliance: 100% ✓ (last test: 12 min, target: 15 min)
Tier 2 (Business Critical):
RTO compliance: 95% ✓ (last test: 3.2 hrs, target: 4 hrs)
RPO compliance: 97% ✓ (last test: 52 min, target: 60 min)
Tier 3 (Standard):
RTO compliance: 100% ✓
RPO compliance: 100% ✓
DATA PROTECTION:
Backup coverage: 99.2% ✓
Backup success rate: 98.7% ✓
Last full backup verification: 3 days ago ✓
Immutable backup status: Active ✓
Geo-redundant replication: Active (2 regions)
RPO actual (avg): 18 minutes
INFRASTRUCTURE READINESS:
DR site capacity: 100% available
Network failover tested: Active ✓
DNS failover configuration: Verified ✓
Identity redundancy: Active ✓
Security infrastructure: Redundant ✓
TESTING STATUS:
Last tabletop exercise: 45 days ago ✓
Last simulation test: 120 days ago
Last full failover test: 340 days ago (due within 30 days)
Test pass rate (last 4 tests): 100% ✓
Open issues from last test: 3 (all medium, resolution in progress)
PLAN GOVERNANCE:
DR plan version: 4.2 (current)
Last plan review: 30 days ago ✓
Contact list accuracy: Verified within 30 days ✓
Team training completion: 94%
Awareness campaign: Active
IMPROVEMENT TRACKING:
Open improvement actions: 7
Completed this quarter: 12
Infrastructure upgrades planned: 3
Next test scheduled: [Date]
Trigger Phrases
"disaster recovery", "DR plan", "business continuity", "RTO", "RPO", "failover", "BCP", "DR test", "crisis management", "recovery architecture", "backup strategy", "failover test", "incident response", "crisis communication", "after-action review"