---
name: service-continuity
description: Plan and execute business continuity and disaster recovery for IT services. Use when developing BCP/DR plans, running disaster recovery tests, managing failover processes, defining RTO/RPO targets, or coordinating recovery operations. Triggers on phrases like "business continuity", "disaster recovery", "DR plan", "failover", "RTO", "RPO", "BCP", "site recovery", "backup strategy", "failover testing".
---

# Business Continuity & Disaster Recovery

Ensure organizational resilience through comprehensive business continuity and disaster recovery planning and execution.

## Workflow

### 1. Business Impact Analysis (BIA)

1. **Critical service identification**:
   - Inventory all IT services and applications
   - Map services to business functions and processes
   - Interview business owners for impact assessment
   - Classify services by criticality: Mission Critical, Critical, Important, Standard
   - Document service dependencies and relationships

2. **Impact assessment and quantification**:
   - Financial impact per hour of downtime by service
   - Operational impact (business process disruption)
   - Regulatory and compliance impact
   - Reputational and customer impact
   - Tolerable downtime determination per service

3. **Recovery objective definition**:
   - Recovery Time Objective (RTO): maximum acceptable downtime
   - Recovery Point Objective (RPO): maximum acceptable data loss
   - Service Level Objective (SLO): performance level during recovery
   - Mission Critical: RTO < 1 hour, RPO < 15 minutes
   - Critical: RTO < 4 hours, RPO < 1 hour
   - Important: RTO < 8 hours, RPO < 4 hours

### 2. Continuity & DR Plan Development

1. **Disaster scenario planning**:
   - Identify potential disaster scenarios: natural disaster, cyberattack, data center failure, cloud provider outage, supply chain disruption, pandemic
   - Assess likelihood and impact for each scenario
   - Define response procedures for each scenario
   - Identify required resources and personnel
   - Define escalation and command structure

2. **Recovery strategy selection**:
   - Infrastructure recovery: cloud failover, cold/warm/hot standby, mutual aid agreements
   - Data recovery: continuous replication, frequent backups, offsite storage
   - Application recovery: container orchestration, multi-region deployment, automated failover
   - Workplace recovery: remote work capability, alternate work sites
   - Communication recovery: alternative communication channels

3. **Recovery runbook development**:
   - Step-by-step recovery procedures for each critical service
   - Priority order for service restoration
   - Required credentials, access points, and contact information
   - Decision trees for recovery path selection
   - Rollback procedures if recovery fails

### 3. DR Infrastructure & Technology

1. **Backup strategy and implementation**:
   - Backup types: full, incremental, differential, continuous data protection
   - Backup frequency aligned with RPO requirements
   - 3-2-1 backup rule: 3 copies, 2 media types, 1 offsite
   - Immutable backups for ransomware protection
   - Backup encryption and access controls

2. **Replication and failover infrastructure**:
   - Database replication (synchronous for critical, asynchronous for others)
   - Application multi-region deployment
   - DNS failover configuration
   - Load balancer health check and failover
   - Cloud region failover automation

3. **Data protection validation**:
   - Automated backup verification (daily)
   - Backup integrity testing (weekly)
   - Restoration testing (monthly for critical systems)
   - Recovery time measurement against RTO
   - Recovery point validation against RPO

### 4. Testing & Exercises

1. **Testing program design**:
   - Annual full-scale DR exercise (all critical services)
   - Quarterly component-level DR tests
   - Monthly backup restoration tests
   - Tabletop exercises for leadership (bi-annual)
   - Progressive complexity increase year-over-year

2. **Test execution and documentation**:
   - Pre-test: notify stakeholders, validate current state, prepare test environment
   - During test: execute recovery procedures, document timing and issues
   - Post-test: compare results against RTO/RPO targets
   - Capture lessons learned and improvement actions
   - Update DR plans based on test findings

3. **Test types and scope**:
   - Notification test: validate contact lists and notification process
   - Readiness test: review plans, verify resources, validate contact information
   - Simulated test: practice recovery in isolated environment
   - Parallel test: run operations from DR site alongside production
   - Full-interruption test: actual failover (planned maintenance window)

### 5. Plan Maintenance & Continuous Improvement

1. **Plan review and update**:
   - Quarterly plan review and update (minimum)
   - Immediate update after significant infrastructure change
   - Annual comprehensive review with all stakeholders
   - Version control and change log maintenance
   - Distribution to relevant personnel and stakeholders

2. **Contact and resource management**:
   - Emergency contact list maintenance (quarterly validation)
   - Vendor and partner emergency contact information
   - Resource inventory validation (spare hardware, licenses, cloud credits)
   - Credential rotation for recovery access
   - Alternate communication channel testing

3. **Training and awareness**:
   - DR team training (quarterly)
   - General staff awareness of business continuity procedures
   - Leadership tabletop exercise participation
   - New team member onboarding includes BCP/DR awareness
   - Cross-training for critical recovery roles

## Templates & Frameworks

### Business Continuity Plan Summary

```
BUSINESS CONTINUITY PLAN — 2025
================================

CRITICAL SERVICES RECOVERY PRIORITY:
  1. Core network and DNS — RTO: 30 min, RPO: 0 min
  2. Customer-facing web applications — RTO: 1 hour, RTO: 15 min
  3. CRM and sales systems — RTO: 2 hours, RPO: 30 min
  4. Email and collaboration — RTO: 4 hours, RPO: 1 hour
  5. Internal applications — RTO: 8 hours, RPO: 4 hours
  6. Reporting and analytics — RTO: 24 hours, RPO: 8 hours

DISASTER RECOVERY SITE:
  Primary data center: [Location, Provider]
  DR site: [Location, Provider] (hot standby)
  Cloud failover: AWS us-east-1 → us-west-2
  Estimated failover time: 45 minutes (automated)

EMERGENCY CONTACT LIST:
  Incident Commander: [Name, Phone, Email]
  IT Director: [Name, Phone, Email]
  Security Lead: [Name, Phone, Email]
  Communications Lead: [Name, Phone, Email]
  Executive Sponsor: [Name, Phone, Email]
  Key Vendors: [List with emergency contact numbers]

RECOVERY DECISION FRAMEWORK:
  If single system failure → restart/patch in place
  If data center failure → failover to DR site
  If cloud region failure → failover to secondary region
  If cyberattack detected → isolate, contain, investigate, restore from clean backup
  If extended outage (>4 hours) → activate BCP, shift to remote work
```

### DR Test Checklist

```
DISASTER RECOVERY TEST CHECKLIST
================================

PRE-TEST PREPARATION:
  [ ] Test scope and objectives defined
  [ ] Stakeholder notification sent (7 days advance)
  [ ] Test environment validated
  [ ] Current production state documented and backed up
  [ ] DR team briefed and assigned roles
  [ ] Communication channels tested
  [ ] Timing and measurement tools prepared

TEST EXECUTION:
  [ ] Test start time recorded: [HH:MM]
  [ ] Failover initiated — time recorded
  [ ] DNS update and propagation verified
  [ ] Critical services restored (check each service):
    [ ] Core network/DNS — restored at [HH:MM] — time: [X] min
    [ ] Web applications — restored at [HH:MM] — time: [X] min
    [ ] Database — restored at [HH:MM] — time: [X] min
    [ ] Email/collaboration — restored at [HH:MM] — time: [X] min
  [ ] Data integrity verified (RPO validation)
  [ ] Application functionality verified
  [ ] User access validated (sample test)
  [ ] Failback to primary executed (if applicable)

POST-TEST:
  [ ] Test end time recorded: [HH:MM]
  [ ] RTO achieved: [Yes/No] — Actual: [X] min vs Target: [Y] min
  [ ] RPO achieved: [Yes/No] — Actual data loss: [X] min vs Target: [Y] min
  [ ] Issues documented and categorized
  [ ] Lessons learned captured
  [ ] Improvement action items assigned
  [ ] DR plan updated based on findings
  [ ] Test report distributed to stakeholders
```

## Integration Points

- Cloud DR services (AWS Disaster Recovery, Azure Site Recovery, GCP Cloud Interconnect): Infrastructure failover
- Backup platforms (Veeam, Commvault, Rubrik, Druva): Data protection and backup
- Replication tools (Zerto, Storage Replica, database native replication): Data synchronization
- DNS failover (Route 53 failover, Cloudflare Load Balancing): Traffic redirection
- Communication platforms (Slack, Teams, emergency notification services): Emergency communication
- CMDB and service mapping: Dependency identification
- Monitoring platforms: Recovery validation and health checks
- Compliance systems: BCP/DR audit evidence

## Edge Cases

- **Extended outage (>48 hours)**: Activate alternate work arrangements; implement manual workarounds for critical processes; daily stakeholder briefings; monitor employee well-being
- **Simultaneous multi-site failure**: Activate cloud-based contingency environment; prioritize mission-critical services only; manual routing of essential operations
- **Ransomware during recovery**: Validate backup integrity before restoration; use immutable backups; forensic investigation parallel to recovery; law enforcement and insurer notification
- **Cloud provider regional outage**: Cross-cloud failover capability; DNS-based traffic shifting; vendor communication coordination; customer notification management
- **Supply chain disruption for hardware**: Maintain minimum spare inventory; multi-vendor hardware strategy; cloud burst capacity agreement

## Output

### BCP/DR Status Dashboard

```
BUSINESS CONTINUITY STATUS — April 2025
=========================================

RECOVERY READINESS:
  Plan last reviewed: 2025-04-01 (current ✓)
  Last full DR test: 2025-03-15 (on schedule ✓)
  Next scheduled test: 2025-06-15 (quarterly)
  Plan version: 4.2 (distributed to 47 stakeholders)

RECOVERY OBJECTIVES STATUS:
  Service             | RTO Target | Last Test | Status
  --------------------|-----------|-----------|--------
  Core Network        | 30 min    | 22 min    | ✓
  Web Applications    | 1 hour    | 48 min    | ✓
  CRM Systems         | 2 hours   | 1h 45min  | ✓
  Email/Collaboration | 4 hours   | 3h 20min  | ✓
  Internal Apps       | 8 hours   | 6h 45min  | ✓
  Analytics           | 24 hours  | 18 hours  | ✓

BACKUP STATUS:
  Backup success rate: 99.2%
  Last successful full backup: 2025-04-15
  Backup integrity test (last): Passed ✓
  Immutable backup coverage: 100% critical systems

RECOVERY INFRASTRUCTURE:
  DR site status: Hot standby — synchronized ✓
  Replication lag: 12 seconds (target: <60 seconds ✓)
  Cloud failover ready: ✓
  DNS failover configured: ✓
  Emergency credentials: Validated ✓

CONTACT VALIDATION:
  Emergency contacts validated: 94% (3 overdue)
  Vendor contacts current: 100%
  Communication channels tested: Last 7 days ✓

IMPROVEMENT ACTIONS:
  [ ] Update DR contact list (3 overdue) — Due: April 18
  [ ] Test email failover (next quarterly test) — Due: June 15
  [ ] Renew cloud burst agreement — Due: May 30
```

## Trigger Phrases

"business continuity", "disaster recovery", "DR plan", "BCP", "failover", "RTO", "RPO", "site recovery", "backup strategy", "failover testing", "business impact analysis", "contingency planning", "recovery plan", "drill exercise", "site failover"
