IT AI Skill

Service Continuity

Plan and execute business continuity and disaster recovery for IT services. Use when developing BCP/DR plans, running disaster recovery tests, managing failover processes, defining RTO/RPO targets, or coordinating recovery operations. Triggers on phrases like "business continuity", "disaster recovery", "DR plan", "failover", "RTO", "RPO", "BCP", "site recovery", "backup strategy", "failover testing".

Business Continuity & Disaster Recovery

Ensure organizational resilience through comprehensive business continuity and disaster recovery planning and execution.

Workflow

1. Business Impact Analysis (BIA)

Critical service identification:

Inventory all IT services and applications
Map services to business functions and processes
Interview business owners for impact assessment
Classify services by criticality: Mission Critical, Critical, Important, Standard
Document service dependencies and relationships

Impact assessment and quantification:

Financial impact per hour of downtime by service
Operational impact (business process disruption)
Regulatory and compliance impact
Reputational and customer impact
Tolerable downtime determination per service

Recovery objective definition:

Recovery Time Objective (RTO): maximum acceptable downtime
Recovery Point Objective (RPO): maximum acceptable data loss
Service Level Objective (SLO): performance level during recovery
Mission Critical: RTO < 1 hour, RPO < 15 minutes
Critical: RTO < 4 hours, RPO < 1 hour
Important: RTO < 8 hours, RPO < 4 hours

2. Continuity & DR Plan Development

Disaster scenario planning:

Identify potential disaster scenarios: natural disaster, cyberattack, data center failure, cloud provider outage, supply chain disruption, pandemic
Assess likelihood and impact for each scenario
Define response procedures for each scenario
Identify required resources and personnel
Define escalation and command structure

Recovery strategy selection:

Infrastructure recovery: cloud failover, cold/warm/hot standby, mutual aid agreements
Data recovery: continuous replication, frequent backups, offsite storage
Application recovery: container orchestration, multi-region deployment, automated failover
Workplace recovery: remote work capability, alternate work sites
Communication recovery: alternative communication channels

Recovery runbook development:

Step-by-step recovery procedures for each critical service
Priority order for service restoration
Required credentials, access points, and contact information
Decision trees for recovery path selection
Rollback procedures if recovery fails

3. DR Infrastructure & Technology

Backup strategy and implementation:

Backup types: full, incremental, differential, continuous data protection
Backup frequency aligned with RPO requirements
3-2-1 backup rule: 3 copies, 2 media types, 1 offsite
Immutable backups for ransomware protection
Backup encryption and access controls

Replication and failover infrastructure:

Database replication (synchronous for critical, asynchronous for others)
Application multi-region deployment
DNS failover configuration
Load balancer health check and failover
Cloud region failover automation

Data protection validation:

Automated backup verification (daily)
Backup integrity testing (weekly)
Restoration testing (monthly for critical systems)
Recovery time measurement against RTO
Recovery point validation against RPO

4. Testing & Exercises

Testing program design:

Annual full-scale DR exercise (all critical services)
Quarterly component-level DR tests
Monthly backup restoration tests
Tabletop exercises for leadership (bi-annual)
Progressive complexity increase year-over-year

Test execution and documentation:

Pre-test: notify stakeholders, validate current state, prepare test environment
During test: execute recovery procedures, document timing and issues
Post-test: compare results against RTO/RPO targets
Capture lessons learned and improvement actions
Update DR plans based on test findings

Test types and scope:

Notification test: validate contact lists and notification process
Readiness test: review plans, verify resources, validate contact information
Simulated test: practice recovery in isolated environment
Parallel test: run operations from DR site alongside production
Full-interruption test: actual failover (planned maintenance window)

5. Plan Maintenance & Continuous Improvement

Plan review and update:

Quarterly plan review and update (minimum)
Immediate update after significant infrastructure change
Annual comprehensive review with all stakeholders
Version control and change log maintenance
Distribution to relevant personnel and stakeholders

Contact and resource management:

Emergency contact list maintenance (quarterly validation)
Vendor and partner emergency contact information
Resource inventory validation (spare hardware, licenses, cloud credits)
Credential rotation for recovery access
Alternate communication channel testing

Training and awareness:

DR team training (quarterly)
General staff awareness of business continuity procedures
Leadership tabletop exercise participation
New team member onboarding includes BCP/DR awareness
Cross-training for critical recovery roles

Templates & Frameworks

Business Continuity Plan Summary

BUSINESS CONTINUITY PLAN — 2025
================================

CRITICAL SERVICES RECOVERY PRIORITY:
  1. Core network and DNS — RTO: 30 min, RPO: 0 min
  2. Customer-facing web applications — RTO: 1 hour, RTO: 15 min
  3. CRM and sales systems — RTO: 2 hours, RPO: 30 min
  4. Email and collaboration — RTO: 4 hours, RPO: 1 hour
  5. Internal applications — RTO: 8 hours, RPO: 4 hours
  6. Reporting and analytics — RTO: 24 hours, RPO: 8 hours

DISASTER RECOVERY SITE:
  Primary data center: [Location, Provider]
  DR site: [Location, Provider] (hot standby)
  Cloud failover: AWS us-east-1 → us-west-2
  Estimated failover time: 45 minutes (automated)

EMERGENCY CONTACT LIST:
  Incident Commander: [Name, Phone, Email]
  IT Director: [Name, Phone, Email]
  Security Lead: [Name, Phone, Email]
  Communications Lead: [Name, Phone, Email]
  Executive Sponsor: [Name, Phone, Email]
  Key Vendors: [List with emergency contact numbers]

RECOVERY DECISION FRAMEWORK:
  If single system failure → restart/patch in place
  If data center failure → failover to DR site
  If cloud region failure → failover to secondary region
  If cyberattack detected → isolate, contain, investigate, restore from clean backup
  If extended outage (>4 hours) → activate BCP, shift to remote work

DR Test Checklist

DISASTER RECOVERY TEST CHECKLIST
================================

PRE-TEST PREPARATION:
  [ ] Test scope and objectives defined
  [ ] Stakeholder notification sent (7 days advance)
  [ ] Test environment validated
  [ ] Current production state documented and backed up
  [ ] DR team briefed and assigned roles
  [ ] Communication channels tested
  [ ] Timing and measurement tools prepared

TEST EXECUTION:
  [ ] Test start time recorded: [HH:MM]
  [ ] Failover initiated — time recorded
  [ ] DNS update and propagation verified
  [ ] Critical services restored (check each service):
    [ ] Core network/DNS — restored at [HH:MM] — time: [X] min
    [ ] Web applications — restored at [HH:MM] — time: [X] min
    [ ] Database — restored at [HH:MM] — time: [X] min
    [ ] Email/collaboration — restored at [HH:MM] — time: [X] min
  [ ] Data integrity verified (RPO validation)
  [ ] Application functionality verified
  [ ] User access validated (sample test)
  [ ] Failback to primary executed (if applicable)

POST-TEST:
  [ ] Test end time recorded: [HH:MM]
  [ ] RTO achieved: [Yes/No] — Actual: [X] min vs Target: [Y] min
  [ ] RPO achieved: [Yes/No] — Actual data loss: [X] min vs Target: [Y] min
  [ ] Issues documented and categorized
  [ ] Lessons learned captured
  [ ] Improvement action items assigned
  [ ] DR plan updated based on findings
  [ ] Test report distributed to stakeholders

Integration Points

Cloud DR services (AWS Disaster Recovery, Azure Site Recovery, GCP Cloud Interconnect): Infrastructure failover
Backup platforms (Veeam, Commvault, Rubrik, Druva): Data protection and backup
Replication tools (Zerto, Storage Replica, database native replication): Data synchronization
DNS failover (Route 53 failover, Cloudflare Load Balancing): Traffic redirection
Communication platforms (Slack, Teams, emergency notification services): Emergency communication
CMDB and service mapping: Dependency identification
Monitoring platforms: Recovery validation and health checks
Compliance systems: BCP/DR audit evidence

Edge Cases

Extended outage (>48 hours): Activate alternate work arrangements; implement manual workarounds for critical processes; daily stakeholder briefings; monitor employee well-being
Simultaneous multi-site failure: Activate cloud-based contingency environment; prioritize mission-critical services only; manual routing of essential operations
Ransomware during recovery: Validate backup integrity before restoration; use immutable backups; forensic investigation parallel to recovery; law enforcement and insurer notification
Cloud provider regional outage: Cross-cloud failover capability; DNS-based traffic shifting; vendor communication coordination; customer notification management
Supply chain disruption for hardware: Maintain minimum spare inventory; multi-vendor hardware strategy; cloud burst capacity agreement

Output

BCP/DR Status Dashboard

BUSINESS CONTINUITY STATUS — April 2025
=========================================

RECOVERY READINESS:
  Plan last reviewed: 2025-04-01 (current ✓)
  Last full DR test: 2025-03-15 (on schedule ✓)
  Next scheduled test: 2025-06-15 (quarterly)
  Plan version: 4.2 (distributed to 47 stakeholders)

RECOVERY OBJECTIVES STATUS:
  Service             | RTO Target | Last Test | Status
  --------------------|-----------|-----------|--------
  Core Network        | 30 min    | 22 min    | ✓
  Web Applications    | 1 hour    | 48 min    | ✓
  CRM Systems         | 2 hours   | 1h 45min  | ✓
  Email/Collaboration | 4 hours   | 3h 20min  | ✓
  Internal Apps       | 8 hours   | 6h 45min  | ✓
  Analytics           | 24 hours  | 18 hours  | ✓

BACKUP STATUS:
  Backup success rate: 99.2%
  Last successful full backup: 2025-04-15
  Backup integrity test (last): Passed ✓
  Immutable backup coverage: 100% critical systems

RECOVERY INFRASTRUCTURE:
  DR site status: Hot standby — synchronized ✓
  Replication lag: 12 seconds (target: <60 seconds ✓)
  Cloud failover ready: ✓
  DNS failover configured: ✓
  Emergency credentials: Validated ✓

CONTACT VALIDATION:
  Emergency contacts validated: 94% (3 overdue)
  Vendor contacts current: 100%
  Communication channels tested: Last 7 days ✓

IMPROVEMENT ACTIONS:
  [ ] Update DR contact list (3 overdue) — Due: April 18
  [ ] Test email failover (next quarterly test) — Due: June 15
  [ ] Renew cloud burst agreement — Due: May 30

Trigger Phrases

"business continuity", "disaster recovery", "DR plan", "BCP", "failover", "RTO", "RPO", "site recovery", "backup strategy", "failover testing", "business impact analysis", "contingency planning", "recovery plan", "drill exercise", "site failover"

Disclaimer: All rights reserved by Circulos AI. These skills are specifically designed for Claude Code, Claude Cowork, Codex, and OpenClaw. When using or referencing any skill, please provide proper attribution to Circulos AI.