---
name: backup-disaster-recovery
description: Design, implement, and test backup and disaster recovery (DR) strategies including RPO/RTO targets, backup automation, replication, failover testing, and business continuity planning. Use when creating backup strategies, designing DR solutions, testing failover procedures, or developing business continuity plans. Triggers on phrases like "backup", "disaster recovery", "DR", "RPO", "RTO", "failover", "business continuity", "BCP", "replication", "snapshot", "archive", "recovery point objective", "recovery time objective", "backup strategy", "failback", "site recovery", "BCDR", "active-active", "active-passive", "warm standby", "hot standby".
---

# Backup & Disaster Recovery

Design, implement, and test backup and disaster recovery (DR) strategies including RPO/RTO targets, backup automation, replication, and failover testing.

## Workflow

### 1. DR Strategy & Classification

```
DISASTER RECOVERY STRATEGY
═══════════════════════════════════════

Application Classification:
═══════════════════════════════════════

Tier    RPO          RTO           Strategy           Applications
───────────────────────────────────────────────────────────────────────
1       ≤ 1 minute   ≤ 5 minutes   Active-Active      Payment processing, auth
2       ≤ 1 hour     ≤ 1 hour      Active-Passive (Hot) Customer database, email
3       ≤ 4 hours    ≤ 4 hours     Warm Standby       Internal tools, analytics
4       ≤ 24 hours   ≤ 24 hours    Cold Standby       Archive, development

DR SITES:
═══════════════════════════════════════

Primary Site: us-east-1 (Virginia)
  → All Tier 1-4 applications
  → 2,500 VMs, 150 cloud instances
  → Primary data center

DR Site: us-west-2 (Oregon)
  → Tier 1-2: Hot/Warm standby
  → Tier 3-4: Backup only
  → Connected via Direct Connect (1Gbps)

  Distance: 2,130 miles (seismic separation)
  Network latency: 45ms

ACTIVE-ACTIVE (Tier 1):
═══════════════════════════════════════

  → Database: Cross-region replication (read replicas)
  → Application: Global load balancer (Route 53)
  → Failover: Automatic (health check failure)
  → RTO: <5 minutes (DNS propagation)
  → RPO: <1 minute (async replication)

ACTIVE-PASSIVE (Tier 2):
═══════════════════════════════════════

  → Database: Cross-region replication (standby)
  → Application: Pre-provisioned, idle
  → Failover: Semi-automatic (1-click)
  → RTO: <1 hour (warming + DNS)
  → RPO: <1 hour (replication lag)
```

### 2. Backup Strategy

```
BACKUP STRATEGY
═══════════════════════════════════════

Backup Types:
═══════════════════════════════════════

Type               Schedule    Retention  Scope           RPO Impact
───────────────────────────────────────────────────────────────────────
Full               Weekly      30 days    Everything      Base restore
Incremental        Daily       7 days     Changes since last  Fast restore
Differential       Daily       7 days     Changes since full  Fast restore
Snapshot           Hourly      24 hours   Block-level      Fast rollback
CDC (Change Data Capture) Continuous 7 days    DB changes      Point-in-time

BACKUP SCHEDULE:
═══════════════════════════════════════

  Sunday    02:00 AM    Full backup (all systems)
  Mon-Sat   02:00 AM    Incremental backup
  Every 1h  ---         Snapshot (critical systems)
  Continuous ---        CDC (databases)
  Every 15m ---         Snapshot (Tier 1 systems)

BACKUP SCOPE:
═══════════════════════════════════════

  Systems:
    → Databases: Full + CDC + snapshot
    → File servers: Full + incremental
    → VMs: Snapshot + full weekly
    → Cloud: Automated (EBS, S3 versioning, RDS)
    → Configs: Git repository + snapshot
    → Backups: Back up the backups (offsite)

  Storage:
    → Primary: On-site NAS (fast restore)
    → Secondary: Cloud storage (S3/GCS, encrypted)
    → Archive: Glacier/Archive tier (long-term, 3-7 years)
    → Offsite: Different region (3-2-1 rule)

3-2-1 RULE:
═══════════════════════════════════════

  3 copies of data (primary + 2 backups)
  2 different storage media (local + cloud)
  1 offsite copy (different region)

  Plus: Automated, encrypted, tested
```

### 3. Backup Automation

```
BACKUP AUTOMATION
═══════════════════════════════════════

Cloud Backups (AWS):
═══════════════════════════════════════

  EBS Snapshots:
    → AWS Backup (centralized policy)
    → Rule: Daily incremental, weekly full
    → Retention: 35 days
    → Encryption: KMS (customer-managed key)
    → Cross-region: Copy to us-west-2

  RDS:
    → Automated backups: 35-day retention
    → Point-in-time recovery: Enabled
    → Snapshots: Manual (before changes)
    → Cross-region: Snapshot copy

  S3:
    → Versioning: Enabled
    → Lifecycle: Transition to Glacier after 90 days
    → MFA delete: Enabled (critical buckets)
    → Replication: Cross-region (CRR)

  EC2:
    → AMI: Weekly (gold image)
    → EBS: Automated snapshots
    → Config: Packer (reproducible)

On-Premise Backups:
═══════════════════════════════════════

  Tool: Veeam / Commvault / Rubrik
  → VM backup: Nightly (incremental forever)
  → Full backup: Weekly (Sunday)
  → Synthetic full: Weekly (no production impact)
  → Offsite replication: Cloud (AWS S3)
  → Immutable backups: Air-gapped (ransomware protection)

BACKUP MONITORING:
═══════════════════════════════════════

  → Daily: Backup success report (email)
  → Alert: Failed backup (immediate, PagerDuty)
  → Weekly: Backup health dashboard
  → Monthly: Restore test (automated validation)

  Current backup success rate: 99.2% (target: 99.5%)
  Failed backups this month: 3 (investigating)
```

### 4. Failover Testing

```
FAILover TESTING
═══════════════════════════════════════

Test Schedule:
═══════════════════════════════════════

Test Type           Frequency    Duration    Scope          Impact
───────────────────────────────────────────────────────────────────────
Tabletop            Quarterly    2 hours     Planning       None
Component           Monthly      1 hour     Single system   Isolated
Partial             Quarterly    4 hours     Tier 2-3       Limited
Full DR Drill       Annually     8 hours     All tiers      Planned

TEST PROCEDURES:
═══════════════════════════════════════

Full DR Drill (Annual):
═══════════════════════════════════════

  Pre-Test (1 week before):
    → Notify stakeholders
    → Schedule maintenance window
    → Document pre-test state
    → Prepare rollback plan

  Test Execution (day of):
    08:00  Declare DR event (simulated)
    08:05  Activate DR team
    08:15  Begin failover (Tier 1)
    08:30  Verify Tier 1 apps (payment, auth)
    09:00  Begin failover (Tier 2)
    09:30  Verify Tier 2 apps (database, email)
    10:00  Begin failover (Tier 3)
    10:30  Run application tests
    11:00  Verify RPO/RTO met
    11:30  Begin failback
    13:00  Complete failback
    13:30  Verify primary site
    14:00  Declare test complete

  Post-Test (1 week after):
    → Document results (RTO/RPO achieved vs target)
    → Identify gaps
    → Update DR runbook
    → Lessons learned meeting

TEST RESULTS (Last Drill):
═══════════════════════════════════════

  Tier    Target RTO    Actual RTO    Target RPO    Actual RPO    Status
  ────────────────────────────────────────────────────────────────────────
  1       5 minutes     4 minutes     1 minute      30 seconds    ✓ Pass
  2       1 hour        45 minutes    1 hour        15 minutes    ✓ Pass
  3       4 hours       3 hours       4 hours       2 hours       ✓ Pass
  4       24 hours      12 hours      24 hours      8 hours       ✓ Pass

  Issues found: 3
    → DNS propagation slower than expected (resolved)
    → Application config missing in DR (fixed)
    → Network ACL blocking DR traffic (updated)
```

### 5. Business Continuity Planning

```
BUSINESS CONTINUITY PLAN (BCP)
═══════════════════════════════════════

BCP Components:
═══════════════════════════════════════

  1. Business Impact Analysis (BIA):
     → Critical business functions identified
     → MTPD (Maximum Tolerable Period of Disruption) defined
     → Resource dependencies mapped
     → Financial impact assessed

  2. Recovery Strategies:
     → People: Remote work, alternate workspace
     → Technology: DR site, cloud failover
     → Processes: Manual workarounds
     → Suppliers: Alternate vendors
     → Facilities: Alternate office

  3. Communication Plan:
     → Internal: Employees, management
     → External: Customers, partners, media
     → Regulatory: Notifying authorities
     → Status page: Public updates

  4. Activation Criteria:
     → Who can declare disaster? (CISO, CIO, CEO)
     → Escalation thresholds
     → Decision tree for response level

BCP CONTACT TREE:
═══════════════════════════════════════

  Level 1: Incident Commander (on-call)
    → Level 2: CISO, CIO
      → Level 3: Department Heads
        → Level 4: Team Leads
          → Level 5: All Staff

  Communication Methods (redundant):
    → Primary: Phone calls
    → Secondary: SMS (mass notification)
    → Tertiary: Email
    → Quaternary: Social media / status page
```

## Edge Cases

- **Ransomware**: Immutable backups, air-gapped copies
- **Region outage**: Multi-region strategy, DNS failover
- **Data corruption**: Point-in-time recovery, validation
- **Partial failure**: Component-level recovery
- **Long-term outage**: Extended BCP (weeks/months)

## Integration Points

- **Backup tools**: Veeam, Commvault, Rubrik, AWS Backup
- **Cloud**: AWS, Azure, GCP (native backup)
- **Monitoring**: Nagios, Zabbix, CloudWatch
- **Communication**: PagerDuty, Opsgenie, Slack
- **DR orchestration**: Azure Site Recovery, AWS DRS, Zerto
- **Status page**: Atlassian Statuspage, Cachet

## Output

### Backup & DR Status

```
BACKUP & DR STATUS — Q4 2024
═══════════════════════════════════════

Backup success rate: 99.2% (target: 99.5%)
Systems backed up: 98% of in-scope
Ransomware protection: Immutable backups ✓
Last full DR drill: Q4 2024 (all tiers passed)
Next DR drill: Q1 2025
RPO achieved: All within target
RTO achieved: All within target
Open issues: 1 (DNS propagation optimization)
```
