---
name: backup-recovery
description: Manage backup and disaster recovery operations including backup strategy design, backup scheduling and execution, restore testing, disaster recovery planning, business continuity, RPO/RTO management, and backup compliance. Use when designing backup strategies, managing backup schedules, performing restore tests, planning disaster recovery, or ensuring backup compliance. Triggers on phrases like "backup strategy", "disaster recovery", "DR plan", "backup scheduling", "restore test", "RPO", "RTO", "business continuity", "BCP", "backup compliance", "offsite backup", "immutability", "ransomware backup", "cross-region backup", "backup verification", "backup retention", "tape backup", "cloud backup", "backup automation".
---

# Backup & Disaster Recovery

Protect against data loss with comprehensive backup strategies, regular restore testing, and validated disaster recovery plans.

## Backup Strategy

### Enterprise Backup Architecture

```
ENTERPRISE BACKUP ARCHITECTURE:
════════════════════════════════

BACKUP PLATFORM: Veeam (primary) + AWS Backup + Azure Backup + native tools
  Scope: 100% of production data (servers, databases, cloud, endpoints)
  Backup jobs: 185 (scheduled)
  Backup success rate: 99.6% (target: >99%) ✓
  Daily backup window: 10:00 PM - 4:00 AM (6 hours)
  Daily backup data: 4.5 TB (incremental, deduplicated: 1.2 TB)

BACKUP TOPOLOGY:
  ┌──────────────────────────┬──────────┬──────────────────┬──────────┐
  │ Asset Type               │ Count    │ Backup Method    │ Freq     │
  ├──────────────────────────┼──────────┼──────────────────┼──────────┤
  │ VMware VMs               │ 45       │ Veeam (image)    │ Daily    │
  │ AWS EC2 instances        │ 45       │ AWS Backup       │ Daily    │
  │ Azure VMs                │ 30       │ Azure Backup     │ Daily    │
  │ Databases (PostgreSQL)   │ 12       │ pgBackRest + RDS │ Hourly   │
  │ Databases (MySQL)        │ 6        │ XtraBackup + RDS │ Hourly   │
  │ Databases (MongoDB)      │ 4        │ MongoDB Backup   │ Hourly   │
  │ S3 buckets               │ 28       │ S3 versioning +  │ Continuous│
  │                          │          │ cross-region     │          │
  │ Azure Blob storage       │ 8        │ Blob versioning +│ Continuous│
  │                          │          │ cross-region     │          │
  │ File servers             │ 6        │ Veeam (file-level)│ Daily   │
  │ Endpoints (laptops)      │ 450      │ CrowdStrike +    │ Daily    │
  │                          │          │ OneDrive/SharePoint│         │
  │ Email (M365)             │ 485      │ M365 native +    │ Continuous│
  │                          │          │ third-party      │          │
  │ K8s persistent volumes   │ 35       │ Velero           │ Daily    │
  │ Configuration/IaC        │ 721      │ Git (versioned)  │ Continuous│
  │ ────────────────────── │ ────── │ ────────────── │ ──────── │
  │ TOTAL                  │ 1,487  │ Multiple       │ Variable │
  └──────────────────────────┴──────────┴──────────────────┴──────────┘

BACKUP 3-2-1 RULE COMPLIANCE:
  3 copies of data: ✓
    - Primary (production)
    - Local backup (on-prem backup server)
    - Offsite backup (cloud — AWS S3 + Azure Blob)
  
  2 different media: ✓
    - Local disk (backup server — Dell EMC)
    - Cloud storage (AWS S3 + Azure Blob)
  
  1 offsite copy: ✓
    - Cross-region (AWS us-east-1 → us-west-2)
    - Cross-cloud (AWS → Azure — critical data)
  
  Extension: 3-2-1-1-0
    1 immutable copy: ✓ (AWS S3 Object Lock — WORM)
    0 errors: ✓ (backup verification — automated)

RETENTION POLICY:
  ┌──────────────────────────┬──────────┬──────────┬──────────┐
  │ Data Type                │ Short    │ Medium   │ Long     │
  │                          │ (<30d)   │ (30-90d) │ (>90d)   │
  ├──────────────────────────┼──────────┼──────────┼──────────┤
  │ VM images                │ Daily    │ Weekly   │ Monthly  │
  │ Database full backup     │ Daily    │ Weekly   │ Monthly  │
  │ Database incremental     │ Hourly   │ Daily    │ N/A      │
  │ Database WAL/logs        │ Continuous│ 7 days  │ N/A      │
  │ File server data         │ Daily    │ Weekly   │ Monthly  │
  │ Object storage (S3/Blob) │ Version  │ Lifecycle│ Glacier  │
  │ Email                    │ 30 days  │ 1 year   │ 7 years  │
  │ Endpoint (laptop)        │ 14 days  │ 30 days  │ N/A      │
  │ Compliance data          │ Daily    │ Monthly  │ 7 years  │
  │ ────────────────────── │ ────── │ ────── │ ────── │
  │ Storage used           │ 15 TB   │ 25 TB   │ 40 TB   │
  └──────────────────────────┴──────────┴──────────┴──────────┘

  Total backup storage: 80 TB (on-prem: 20 TB, cloud: 60 TB)
  Monthly storage cost: ~$800 (cloud backup storage)
  Lifecycle: Auto-archive (cold → glacier at 90 days)
  Deletion: Automated (per retention policy — compliance hold override)

IMMUTABLE BACKUPS (Ransomware Protection):
  Immutable storage: AWS S3 Object Lock (WORM — Write Once, Read Many)
  Coverage: 100% of production data (critical: 18 databases + 45 VMs)
  Retention (immutable): 30 days (non-deletable, non-modifiable)
  Legal hold: Override (7 years for compliance data — even immutable)
  
  Ransomware protection:
    - Air-gapped backup: 1 copy (offline tape — quarterly sync)
    - Immutable cloud: 1 copy (S3 Object Lock — 30 days)
    - Cross-region: 1 copy (us-west-2 — separate account)
    - Access: Break-glass (2-person approval + MFA)
  
  Testing: Quarterly ransomware simulation (restore from immutable)
  Last test: Q4 2024 (January) — Passed (all data recovered)
```

## Restore Operations

### Recovery Testing & Validation

```
RESTORE OPERATIONS:
═══════════════════

RECOVERY TIME OBJECTIVES (RTO):
  ┌──────────────────────────┬──────────┬──────────┬──────────┐
  │ Asset Type               │ RTO      │ RPO      │ Method   │
  ├──────────────────────────┼──────────┼──────────┼──────────┤
  │ VM (full)                │ <2 hrs   │ <24 hrs  │ Image restore│
  │ VM (critical)            │ <30 min  │ <1 hr    │ Instant recovery│
  │ Database (full)          │ <1 hr    │ <5 min   │ PITR       │
  │ Database (single table)  │ <10 min  │ <5 min   │ Point-in-time  │
  │ File (individual)        │ <15 min  │ <1 hr    │ File-level   │
  │ Email (individual)       │ <1 hr    │ <1 hr    │ M365 restore │
  │ S3 object                │ <5 min   │ <5 sec   │ Version restore│
  │ K8s PV                   │ <30 min  │ <24 hrs  │ Velero       │
  │ ────────────────────── │ ────── │ ────── │ ───────── │
  │ Overall RTO            │ <2 hrs │ <24 hrs │ Multiple │
  └──────────────────────────┴──────────┴──────────┴──────────┘

RESTORE TESTING PROGRAM:
  Cadence:
    Individual file/email: Weekly (spot check — random sample)
    Database restore: Monthly (all production databases)
    VM restore: Quarterly (sample — 5 VMs)
    Full DR test: Bi-annual (half-yearly — full failover)
    Ransomware simulation: Quarterly (restore from immutable)
  
  Testing methodology:
    1. Select assets (random or targeted)
    2. Restore to isolated environment (no production impact)
    3. Validate data integrity (checksum, application test)
    4. Measure RTO (time from restore request to validation)
    5. Document results (pass/fail, root cause if fail)
    6. Remediate (if fail — retest)
    7. Report (backup team + management)
  
  Testing results (January 2025):
    File restores tested: 15 (weekly sample)
    Database restores tested: 18 (all production — monthly)
    VM restores tested: 5 (quarterly sample)
    
    ┌──────────────────────────┬──────────┬──────────┬──────────┐
    │ Asset Type               │ Tested   │ Passed   │ RTO met  │
    ├──────────────────────────┼──────────┼──────────┼──────────┤
    │ File                     │ 15       │ 15 (100%)│ 15/15    │
    │ Database                 │ 18       │ 18 (100%)│ 18/18    │
    │ VM                       │ 5        │ 5 (100%) │ 5/5      │
    │ ────────────────────── │ ────── │ ────── │ ───── │
    │ TOTAL                  │ 38     │ 38 (100%)│ 38/38  │
    └──────────────────────────┴──────────┴──────────┴──────────┘
  
  Avg. actual RTO vs. target:
    Database: 25 minutes (target: <60 min) ✓
    VM: 45 minutes (target: <120 min) ✓
    File: 8 minutes (target: <15 min) ✓

BACKUP VERIFICATION:
  Automated verification (every backup):
    1. Backup job completion (success/fail alert)
    2. Backup size validation (expected range — anomaly alert)
    3. Checksum verification (data integrity)
    4. Backup catalog update (inventory)
    5. Synthetic full (weekly — consolidate incrementals)
  
  Manual verification (monthly):
    1. Sample restore (random selection)
    2. Data validation (application-level test)
    3. Report generation (Veeam report + custom dashboard)
  
  Verification results (January 2025):
    Automated: 1,825/1,825 passed (100%)
    Manual: 18/18 passed (100%)
    Failures: 0

RESTORE REQUEST PROCESS:
  User-initiated restore:
    1. Submit request (ServiceNow ticket — IT portal)
    2. Auto-approve (file-level, <100 MB)
    3. Manual approve (large restore, database)
    4. Execute (automated or manual)
    5. Validate (user confirmation)
    6. Close (ticket + documentation)
  
  Emergency restore (disaster):
    1. Declare incident (P1 — backup team lead)
    2. Activate DR plan (runbook)
    3. Execute restore (parallel, prioritized)
    4. Validate (application + data integrity)
    5. Failback (when production restored)
    6. Post-incident review (RCA + improvement)
  
  Restore statistics (January 2025):
    User restores: 42 (file-level — avg. 1.4/day)
    Database restores: 3 (point-in-time — troubleshooting)
    VM restores: 1 (testing — not production)
    Emergency restores: 0 (no disaster)
    Total restore time: Avg. 12 minutes (file), 25 minutes (DB)
```

## Disaster Recovery

### DR Planning & Execution

```
DISASTER RECOVERY PLAN:
════════════════════════

DR STRATEGY: Pilot Light + Warm Standby
  Pilot Light (most systems):
    - Minimal infrastructure running (standby)
    - Database replication active (continuous)
    - Scale up on disaster (auto — Terraform)
    - RTO: <2 hours (scale up + validation)
  
  Warm Standby (critical systems):
    - Full infrastructure running (standby — reduced size)
    - Database replication active (real-time)
    - Scale up + traffic switch (auto)
    - RTO: <30 minutes (traffic switch + validation)
    - Systems: Customer API, authentication, payment

DR SITE:
  Primary: AWS us-east-1 (N. Virginia)
  DR site: AWS us-west-2 (Oregon)
  Cross-cloud DR: Azure eastus (secondary DR — critical data)
  
  DR site readiness:
    Infrastructure: Pre-provisioned (pilot light)
    Data: Replicated (real-time — critical, daily — standard)
    Testing: Bi-annual (full failover — last: November 2024)
    Next test: May 2025 (scheduled)

DR ACTIVATION CRITERIA:
  Automatic activation:
    - Region-wide outage (AWS health dash confirms)
    - Data center loss (physical — natural disaster)
    - Cyberattack (ransomware, destructive — immutable backup)
  
  Manual activation:
    - Extended outage (>2 hours, no resolution ETA)
    - Data corruption (widespread, unrecoverable in primary)
    - Compliance/regulatory (data breach, legal requirement)
  
  Activation authority:
    Technical: IT Director (regional outage, extended)
    Business: CTO (cyberattack, data corruption)
    Executive: CEO (full company-wide DR)

DR RUNBOOK:
  Phase 1: Assessment (0-15 minutes)
    1. Confirm disaster (monitoring + provider status)
    2. Classify severity (regional vs. localized)
    3. Notify stakeholders (IT, management, customer comms)
    4. Activate DR team (on-call + management)
  
  Phase 2: Activation (15-60 minutes)
    5. Scale DR infrastructure (auto — Terraform apply)
    6. Verify data replication (consistency check)
    7. DNS failover (Route53 health check → DR endpoint)
    8. Verify application health (smoke tests)
  
  Phase 3: Operations (1-24 hours)
    9. Monitor DR environment (performance, errors)
    10. Customer communication (status page, email)
    11. Business operations (continued on DR site)
    12. Data sync (continue replication, if primary partially available)
  
  Phase 4: Failback (24-72 hours, when primary restored)
    13. Validate primary (full health check)
    14. Data sync (DR → primary — final replication)
    15. DNS failback (traffic → primary)
    16. DR scale-down (cost reduction)
    17. Post-incident review (RCA + improvement)

BUSINESS CONTINUITY PLAN (BCP):
  Business impact analysis (BIA):
    ┌──────────────────────────┬──────────┬──────────┬──────────┐
    │ Business Function        │ Max Down │ Priority │ DR Strategy    │
    │                          │ Time     │          │             │
    ├──────────────────────────┼──────────┼──────────┼──────────────┤
    │ Customer API             │ 30 min   │ Critical │ Warm standby  │
    │ Payment processing       │ 30 min   │ Critical │ Warm standby  │
    │ Authentication (SSO)     │ 1 hour   │ Critical │ Warm standby  │
    │ Internal web app         │ 4 hours  │ High     │ Pilot light   │
    │ Email (M365)             │ 4 hours  │ High     │ Cloud native  │
    │ CRM (Salesforce)         │ 8 hours  │ Medium   │ SaaS (vendor) │
    │ ERP (NetSuite)           │ 8 hours  │ Medium   │ SaaS (vendor) │
    │ HR system (Rippling)     │ 24 hrs   │ Low      │ SaaS (vendor) │
    │ ────────────────────── │ ────── │ ────── │ ───────── │
    │ Communications         │ 1 hour   │ Critical │ Multi-channel │
    └──────────────────────────┴──────────┴──────────┴──────────────┘

  Communication plan:
    Internal: Teams alert + email (every 30 min during incident)
    External: Status page + customer email (every 1 hour)
    Executive: Phone call + Teams (immediate + every 2 hours)
    Regulatory: As required (GDPR: 72 hours, breach notification)
  
  Key contacts:
    DR team: 12 members (IT, security, comms, legal)
    Vendor contacts: 8 (cloud, SaaS, hardware)
    Executive team: 5 (CEO, CTO, CFO, CISO, COO)
    Regulatory: 3 (legal counsel, compliance, data protection officer)

DR TESTING RESULTS (November 2024):
  Full failover test (us-east-1 → us-west-2):
    Infrastructure activation: 22 minutes (target: <60 min) ✓
    Data consistency: 100% (verified) ✓
    DNS failover: 5 minutes (target: <10 min) ✓
    Application health: 100% (smoke tests passed) ✓
    Total RTO: 45 minutes (target: <2 hours) ✓
    Total RPO: 30 seconds (target: <5 min) ✓
    Failback: 2 hours (target: <4 hours) ✓
    
    Findings:
      - All objectives met
      - 2 minor issues (documentation updates needed)
      - Team response: Excellent (coordinated, timely)
      - Improvement: Automate DNS failover (manual step → auto)
```

## Output

### Backup & DR Dashboard

```
BACKUP & DR DASHBOARD — Jan 2025
═══════════════════════════════

Backup Operations:
  Total backup jobs: 185 (scheduled)
  Success rate: 99.6% (target: >99%) ✓
  Daily backup data: 4.5 TB (deduplicated: 1.2 TB)
  Backup storage: 80 TB (on-prem: 20 TB, cloud: 60 TB)
  Monthly cost: ~$800 (cloud storage)

Recovery Testing:
  File restores: 15/15 passed (100%)
  Database restores: 18/18 passed (100%)
  VM restores: 5/5 passed (100%)
  Avg. RTO: File 8 min, DB 25 min, VM 45 min
  Next full test: May 2025

Disaster Recovery:
  Strategy: Pilot light + warm standby (critical)
  DR site: AWS us-west-2 + Azure eastus
  Last DR test: November 2024 (all passed)
  RTO achieved: 45 min (target: <2 hrs) ✓
  RPO achieved: 30 sec (target: <5 min) ✓

Protection:
  3-2-1-1-0 rule: Compliant ✓
  Immutable backups: 100% coverage (30 days WORM)
  Air-gapped: Quarterly (tape — offline)
  Cross-region: 100% (critical: real-time, standard: daily)
  Ransomware test: Q4 2024 (passed)

Compliance:
  Retention: 7 years (compliance data)
  Verification: 100% (automated + manual)
  Audit evidence: Complete (SOC 2, ISO 27001)
  Findings: 0

Actions:
  1. DR test (May 2025 — scheduled)
  2. Ransomware simulation (Q1 — February)
  3. DNS failover automation (improvement from last test)
  4. Backup retention review (annual — March)
  5. BIA update (annual — April)
```

## Integration Points

- Backup platforms (Veeam, AWS Backup, Azure Backup): Orchestration, scheduling
- Cloud providers (AWS, Azure, GCP): Storage, replication, cross-region
- Database engines (PostgreSQL, MySQL, MongoDB): Native backup, PITR
- Storage (AWS S3, Azure Blob, on-prem NAS): Backup target, lifecycle
- Immutable storage (S3 Object Lock, Azure Immutable Blob): Ransomware protection
- Orchestration (Terraform, Ansible): DR infrastructure provisioning
- DNS (Route53, Azure DNS): Failover routing
- Monitoring (Datadog, Veeam Monitor): Backup health, alerting
- Communication (Teams, Slack, email): DR alerts, status updates
- ITSM (ServiceNow): Restore requests, DR incident tracking
- CMDB: Asset inventory, recovery priority
- Compliance (Vanta, Drata): Audit evidence, retention policy

## Edge Cases

- **Ransomware attack**: Immutable backup verification; air-gapped restore; scope assessment; incident response
- **Region-wide outage**: DR activation; DNS failover; data consistency; customer communication
- **Data corruption (undetected)**: Backup verification failure; older backup restore; root cause; prevention
- **Backup window overrun**: Job prioritization; window extension; performance optimization; deduplication
- **Backup storage exhaustion**: Lifecycle enforcement; retention review; capacity expansion; cost
- **Cross-region replication failure**: Alternative replication path; manual sync; DR readiness check
- **DR test failure**: Root cause; remediation; retest; runbook update
- **Compliance retention conflict**: Legal hold override; exception process; audit documentation
- **Cloud provider backup outage**: Alternative backup method; manual backup; vendor escalation
- **Backup credential rotation**: Automated update; verification; failover credential; monitoring
