IT AI Skill
Service Continuity
Plan and execute business continuity and disaster recovery for IT services. Use when developing BCP/DR plans, running disaster recovery tests, managing failover processes, defining RTO/RPO targets, or coordinating recovery operations. Triggers on phrases li...
Business Continuity & Disaster Recovery
Ensure organizational resilience through comprehensive business continuity and disaster recovery planning and execution.
Workflow
1. Business Impact Analysis (BIA)
- Critical service identification:
- Inventory all IT services and applications
- Map services to business functions and processes
- Interview business owners for impact assessment
- Classify services by criticality: Mission Critical, Critical, Important, Standard
- Document service dependencies and relationships
- Impact assessment and quantification:
- Financial impact per hour of downtime by service
- Operational impact (business process disruption)
- Regulatory and compliance impact
- Reputational and customer impact
- Tolerable downtime determination per service
- Recovery objective definition:
- Recovery Time Objective (RTO): maximum acceptable downtime
- Recovery Point Objective (RPO): maximum acceptable data loss
- Service Level Objective (SLO): performance level during recovery
- Mission Critical: RTO < 1 hour, RPO < 15 minutes
- Critical: RTO < 4 hours, RPO < 1 hour
- Important: RTO < 8 hours, RPO < 4 hours
2. Continuity & DR Plan Development
- Disaster scenario planning:
- Identify potential disaster scenarios: natural disaster, cyberattack, data center failure, cloud provider outage, supply chain disruption, pandemic
- Assess likelihood and impact for each scenario
- Define response procedures for each scenario
- Identify required resources and personnel
- Define escalation and command structure
- Recovery strategy selection:
- Infrastructure recovery: cloud failover, cold/warm/hot standby, mutual aid agreements
- Data recovery: continuous replication, frequent backups, offsite storage
- Application recovery: container orchestration, multi-region deployment, automated failover
- Workplace recovery: remote work capability, alternate work sites
- Communication recovery: alternative communication channels
- Recovery runbook development:
- Step-by-step recovery procedures for each critical service
- Priority order for service restoration
- Required credentials, access points, and contact information
- Decision trees for recovery path selection
- Rollback procedures if recovery fails
3. DR Infrastructure & Technology
- Backup strategy and implementation:
- Backup types: full, incremental, differential, continuous data protection
- Backup frequency aligned with RPO requirements
- 3-2-1 backup rule: 3 copies, 2 media types, 1 offsite
- Immutable backups for ransomware protection
- Backup encryption and access controls
- Replication and failover infrastructure:
- Database replication (synchronous for critical, asynchronous for others)
- Application multi-region deployment
- DNS failover configuration
- Load balancer health check and failover
- Cloud region failover automation
- Data protection validation:
- Automated backup verification (daily)
- Backup integrity testing (weekly)
- Restoration testing (monthly for critical systems)
- Recovery time measurement against RTO
- Recovery point validation against RPO
4. Testing & Exercises
- Testing program design:
- Annual full-scale DR exercise (all critical services)
- Quarterly component-level DR tests
- Monthly backup restoration tests
- Tabletop exercises for leadership (bi-annual)
- Progressive complexity increase year-over-year
- Test execution and documentation:
- Pre-test: notify stakeholders, validate current state, prepare test environment
- During test: execute recovery procedures, document timing and issues
- Post-test: compare results against RTO/RPO targets
- Capture lessons learned and improvement actions
- Update DR plans based on test findings
- Test types and scope:
- Notification test: validate contact lists and notification process
- Readiness test: review plans, verify resources, validate contact information
- Simulated test: practice recovery in isolated environment
- Parallel test: run operations from DR site alongside production
- Full-interruption test: actual failover (planned maintenance window)
5. Plan Maintenance & Continuous Improvement
- Plan review and update:
- Quarterly plan review and update (minimum)
- Immediate update after significant infrastructure change
- Annual comprehensive review with all stakeholders
- Version control and change log maintenance
- Distribution to relevant personnel and stakeholders
- Contact and resource management:
- Emergency contact list maintenance (quarterly validation)
- Vendor and partner emergency contact information
- Resource inventory validation (spare hardware, licenses, cloud credits)
- Credential rotation for recovery access
- Alternate communication channel testing
- Training and awareness:
- DR team training (quarterly)
- General staff awareness of business continuity procedures
- Leadership tabletop exercise participation
- New team member onboarding includes BCP/DR awareness
- Cross-training for critical recovery roles
Templates & Frameworks
Business Continuity Plan Summary
BUSINESS CONTINUITY PLAN — 2025
================================
CRITICAL SERVICES RECOVERY PRIORITY:
1. Core network and DNS — RTO: 30 min, RPO: 0 min
2. Customer-facing web applications — RTO: 1 hour, RTO: 15 min
3. CRM and sales systems — RTO: 2 hours, RPO: 30 min
4. Email and collaboration — RTO: 4 hours, RPO: 1 hour
5. Internal applications — RTO: 8 hours, RPO: 4 hours
6. Reporting and analytics — RTO: 24 hours, RPO: 8 hours
DISASTER RECOVERY SITE:
Primary data center: [Location, Provider]
DR site: [Location, Provider] (hot standby)
Cloud failover: AWS us-east-1 → us-west-2
Estimated failover time: 45 minutes (automated)
EMERGENCY CONTACT LIST:
Incident Commander: [Name, Phone, Email]
IT Director: [Name, Phone, Email]
Security Lead: [Name, Phone, Email]
Communications Lead: [Name, Phone, Email]
Executive Sponsor: [Name, Phone, Email]
Key Vendors: [List with emergency contact numbers]
RECOVERY DECISION FRAMEWORK:
If single system failure → restart/patch in place
If data center failure → failover to DR site
If cloud region failure → failover to secondary region
If cyberattack detected → isolate, contain, investigate, restore from clean backup
If extended outage (>4 hours) → activate BCP, shift to remote work
DR Test Checklist
DISASTER RECOVERY TEST CHECKLIST
================================
PRE-TEST PREPARATION:
[ ] Test scope and objectives defined
[ ] Stakeholder notification sent (7 days advance)
[ ] Test environment validated
[ ] Current production state documented and backed up
[ ] DR team briefed and assigned roles
[ ] Communication channels tested
[ ] Timing and measurement tools prepared
TEST EXECUTION:
[ ] Test start time recorded: [HH:MM]
[ ] Failover initiated — time recorded
[ ] DNS update and propagation verified
[ ] Critical services restored (check each service):
[ ] Core network/DNS — restored at [HH:MM] — time: [X] min
[ ] Web applications — restored at [HH:MM] — time: [X] min
[ ] Database — restored at [HH:MM] — time: [X] min
[ ] Email/collaboration — restored at [HH:MM] — time: [X] min
[ ] Data integrity verified (RPO validation)
[ ] Application functionality verified
[ ] User access validated (sample test)
[ ] Failback to primary executed (if applicable)
POST-TEST:
[ ] Test end time recorded: [HH:MM]
[ ] RTO achieved: [Yes/No] — Actual: [X] min vs Target: [Y] min
[ ] RPO achieved: [Yes/No] — Actual data loss: [X] min vs Target: [Y] min
[ ] Issues documented and categorized
[ ] Lessons learned captured
[ ] Improvement action items assigned
[ ] DR plan updated based on findings
[ ] Test report distributed to stakeholders
Integration Points
- Cloud DR services (AWS Disaster Recovery, Azure Site Recovery, GCP Cloud Interconnect): Infrastructure failover
- Backup platforms (Veeam, Commvault, Rubrik, Druva): Data protection and backup
- Replication tools (Zerto, Storage Replica, database native replication): Data synchronization
- DNS failover (Route 53 failover, Cloudflare Load Balancing): Traffic redirection
- Communication platforms (Slack, Teams, emergency notification services): Emergency communication
- CMDB and service mapping: Dependency identification
- Monitoring platforms: Recovery validation and health checks
- Compliance systems: BCP/DR audit evidence
Edge Cases
- Extended outage (>48 hours): Activate alternate work arrangements; implement manual workarounds for critical processes; daily stakeholder briefings; monitor employee well-being
- Simultaneous multi-site failure: Activate cloud-based contingency environment; prioritize mission-critical services only; manual routing of essential operations
- Ransomware during recovery: Validate backup integrity before restoration; use immutable backups; forensic investigation parallel to recovery; law enforcement and insurer notification
- Cloud provider regional outage: Cross-cloud failover capability; DNS-based traffic shifting; vendor communication coordination; customer notification management
- Supply chain disruption for hardware: Maintain minimum spare inventory; multi-vendor hardware strategy; cloud burst capacity agreement
Output
BCP/DR Status Dashboard
BUSINESS CONTINUITY STATUS — April 2025
=========================================
RECOVERY READINESS:
Plan last reviewed: 2025-04-01 (current ✓)
Last full DR test: 2025-03-15 (on schedule ✓)
Next scheduled test: 2025-06-15 (quarterly)
Plan version: 4.2 (distributed to 47 stakeholders)
RECOVERY OBJECTIVES STATUS:
Service | RTO Target | Last Test | Status
--------------------|-----------|-----------|--------
Core Network | 30 min | 22 min | ✓
Web Applications | 1 hour | 48 min | ✓
CRM Systems | 2 hours | 1h 45min | ✓
Email/Collaboration | 4 hours | 3h 20min | ✓
Internal Apps | 8 hours | 6h 45min | ✓
Analytics | 24 hours | 18 hours | ✓
BACKUP STATUS:
Backup success rate: 99.2%
Last successful full backup: 2025-04-15
Backup integrity test (last): Passed ✓
Immutable backup coverage: 100% critical systems
RECOVERY INFRASTRUCTURE:
DR site status: Hot standby — synchronized ✓
Replication lag: 12 seconds (target: <60 seconds ✓)
Cloud failover ready: ✓
DNS failover configured: ✓
Emergency credentials: Validated ✓
CONTACT VALIDATION:
Emergency contacts validated: 94% (3 overdue)
Vendor contacts current: 100%
Communication channels tested: Last 7 days ✓
IMPROVEMENT ACTIONS:
[ ] Update DR contact list (3 overdue) — Due: April 18
[ ] Test email failover (next quarterly test) — Due: June 15
[ ] Renew cloud burst agreement — Due: May 30
Trigger Phrases
"business continuity", "disaster recovery", "DR plan", "BCP", "failover", "RTO", "RPO", "site recovery", "backup strategy", "failover testing", "business impact analysis", "contingency planning", "recovery plan", "drill exercise", "site failover"