IT AI Skill

Backup Disaster Recovery

Design, implement, and test backup and disaster recovery (DR) strategies including RPO/RTO targets, backup automation, replication, failover testing, and business continuity planning. Use when creating backup strategies, designing DR solutions, testing fail...

Backup & Disaster Recovery

Design, implement, and test backup and disaster recovery (DR) strategies including RPO/RTO targets, backup automation, replication, and failover testing.

Workflow

1. DR Strategy & Classification

DISASTER RECOVERY STRATEGY
═══════════════════════════════════════

Application Classification:
═══════════════════════════════════════

Tier    RPO          RTO           Strategy           Applications
───────────────────────────────────────────────────────────────────────
1       ≤ 1 minute   ≤ 5 minutes   Active-Active      Payment processing, auth
2       ≤ 1 hour     ≤ 1 hour      Active-Passive (Hot) Customer database, email
3       ≤ 4 hours    ≤ 4 hours     Warm Standby       Internal tools, analytics
4       ≤ 24 hours   ≤ 24 hours    Cold Standby       Archive, development

DR SITES:
═══════════════════════════════════════

Primary Site: us-east-1 (Virginia)
  → All Tier 1-4 applications
  → 2,500 VMs, 150 cloud instances
  → Primary data center

DR Site: us-west-2 (Oregon)
  → Tier 1-2: Hot/Warm standby
  → Tier 3-4: Backup only
  → Connected via Direct Connect (1Gbps)

  Distance: 2,130 miles (seismic separation)
  Network latency: 45ms

ACTIVE-ACTIVE (Tier 1):
═══════════════════════════════════════

  → Database: Cross-region replication (read replicas)
  → Application: Global load balancer (Route 53)
  → Failover: Automatic (health check failure)
  → RTO: <5 minutes (DNS propagation)
  → RPO: <1 minute (async replication)

ACTIVE-PASSIVE (Tier 2):
═══════════════════════════════════════

  → Database: Cross-region replication (standby)
  → Application: Pre-provisioned, idle
  → Failover: Semi-automatic (1-click)
  → RTO: <1 hour (warming + DNS)
  → RPO: <1 hour (replication lag)

2. Backup Strategy

BACKUP STRATEGY
═══════════════════════════════════════

Backup Types:
═══════════════════════════════════════

Type               Schedule    Retention  Scope           RPO Impact
───────────────────────────────────────────────────────────────────────
Full               Weekly      30 days    Everything      Base restore
Incremental        Daily       7 days     Changes since last  Fast restore
Differential       Daily       7 days     Changes since full  Fast restore
Snapshot           Hourly      24 hours   Block-level      Fast rollback
CDC (Change Data Capture) Continuous 7 days    DB changes      Point-in-time

BACKUP SCHEDULE:
═══════════════════════════════════════

  Sunday    02:00 AM    Full backup (all systems)
  Mon-Sat   02:00 AM    Incremental backup
  Every 1h  ---         Snapshot (critical systems)
  Continuous ---        CDC (databases)
  Every 15m ---         Snapshot (Tier 1 systems)

BACKUP SCOPE:
═══════════════════════════════════════

  Systems:
    → Databases: Full + CDC + snapshot
    → File servers: Full + incremental
    → VMs: Snapshot + full weekly
    → Cloud: Automated (EBS, S3 versioning, RDS)
    → Configs: Git repository + snapshot
    → Backups: Back up the backups (offsite)

  Storage:
    → Primary: On-site NAS (fast restore)
    → Secondary: Cloud storage (S3/GCS, encrypted)
    → Archive: Glacier/Archive tier (long-term, 3-7 years)
    → Offsite: Different region (3-2-1 rule)

3-2-1 RULE:
═══════════════════════════════════════

  3 copies of data (primary + 2 backups)
  2 different storage media (local + cloud)
  1 offsite copy (different region)

  Plus: Automated, encrypted, tested

3. Backup Automation

BACKUP AUTOMATION
═══════════════════════════════════════

Cloud Backups (AWS):
═══════════════════════════════════════

  EBS Snapshots:
    → AWS Backup (centralized policy)
    → Rule: Daily incremental, weekly full
    → Retention: 35 days
    → Encryption: KMS (customer-managed key)
    → Cross-region: Copy to us-west-2

  RDS:
    → Automated backups: 35-day retention
    → Point-in-time recovery: Enabled
    → Snapshots: Manual (before changes)
    → Cross-region: Snapshot copy

  S3:
    → Versioning: Enabled
    → Lifecycle: Transition to Glacier after 90 days
    → MFA delete: Enabled (critical buckets)
    → Replication: Cross-region (CRR)

  EC2:
    → AMI: Weekly (gold image)
    → EBS: Automated snapshots
    → Config: Packer (reproducible)

On-Premise Backups:
═══════════════════════════════════════

  Tool: Veeam / Commvault / Rubrik
  → VM backup: Nightly (incremental forever)
  → Full backup: Weekly (Sunday)
  → Synthetic full: Weekly (no production impact)
  → Offsite replication: Cloud (AWS S3)
  → Immutable backups: Air-gapped (ransomware protection)

BACKUP MONITORING:
═══════════════════════════════════════

  → Daily: Backup success report (email)
  → Alert: Failed backup (immediate, PagerDuty)
  → Weekly: Backup health dashboard
  → Monthly: Restore test (automated validation)

  Current backup success rate: 99.2% (target: 99.5%)
  Failed backups this month: 3 (investigating)

4. Failover Testing

FAILover TESTING
═══════════════════════════════════════

Test Schedule:
═══════════════════════════════════════

Test Type           Frequency    Duration    Scope          Impact
───────────────────────────────────────────────────────────────────────
Tabletop            Quarterly    2 hours     Planning       None
Component           Monthly      1 hour     Single system   Isolated
Partial             Quarterly    4 hours     Tier 2-3       Limited
Full DR Drill       Annually     8 hours     All tiers      Planned

TEST PROCEDURES:
═══════════════════════════════════════

Full DR Drill (Annual):
═══════════════════════════════════════

  Pre-Test (1 week before):
    → Notify stakeholders
    → Schedule maintenance window
    → Document pre-test state
    → Prepare rollback plan

  Test Execution (day of):
    08:00  Declare DR event (simulated)
    08:05  Activate DR team
    08:15  Begin failover (Tier 1)
    08:30  Verify Tier 1 apps (payment, auth)
    09:00  Begin failover (Tier 2)
    09:30  Verify Tier 2 apps (database, email)
    10:00  Begin failover (Tier 3)
    10:30  Run application tests
    11:00  Verify RPO/RTO met
    11:30  Begin failback
    13:00  Complete failback
    13:30  Verify primary site
    14:00  Declare test complete

  Post-Test (1 week after):
    → Document results (RTO/RPO achieved vs target)
    → Identify gaps
    → Update DR runbook
    → Lessons learned meeting

TEST RESULTS (Last Drill):
═══════════════════════════════════════

  Tier    Target RTO    Actual RTO    Target RPO    Actual RPO    Status
  ────────────────────────────────────────────────────────────────────────
  1       5 minutes     4 minutes     1 minute      30 seconds    ✓ Pass
  2       1 hour        45 minutes    1 hour        15 minutes    ✓ Pass
  3       4 hours       3 hours       4 hours       2 hours       ✓ Pass
  4       24 hours      12 hours      24 hours      8 hours       ✓ Pass

  Issues found: 3
    → DNS propagation slower than expected (resolved)
    → Application config missing in DR (fixed)
    → Network ACL blocking DR traffic (updated)

5. Business Continuity Planning

BUSINESS CONTINUITY PLAN (BCP)
═══════════════════════════════════════

BCP Components:
═══════════════════════════════════════

  1. Business Impact Analysis (BIA):
     → Critical business functions identified
     → MTPD (Maximum Tolerable Period of Disruption) defined
     → Resource dependencies mapped
     → Financial impact assessed

  2. Recovery Strategies:
     → People: Remote work, alternate workspace
     → Technology: DR site, cloud failover
     → Processes: Manual workarounds
     → Suppliers: Alternate vendors
     → Facilities: Alternate office

  3. Communication Plan:
     → Internal: Employees, management
     → External: Customers, partners, media
     → Regulatory: Notifying authorities
     → Status page: Public updates

  4. Activation Criteria:
     → Who can declare disaster? (CISO, CIO, CEO)
     → Escalation thresholds
     → Decision tree for response level

BCP CONTACT TREE:
═══════════════════════════════════════

  Level 1: Incident Commander (on-call)
    → Level 2: CISO, CIO
      → Level 3: Department Heads
        → Level 4: Team Leads
          → Level 5: All Staff

  Communication Methods (redundant):
    → Primary: Phone calls
    → Secondary: SMS (mass notification)
    → Tertiary: Email
    → Quaternary: Social media / status page

Edge Cases

Integration Points

Output

Backup & DR Status

BACKUP & DR STATUS — Q4 2024
═══════════════════════════════════════

Backup success rate: 99.2% (target: 99.5%)
Systems backed up: 98% of in-scope
Ransomware protection: Immutable backups ✓
Last full DR drill: Q4 2024 (all tiers passed)
Next DR drill: Q1 2025
RPO achieved: All within target
RTO achieved: All within target
Open issues: 1 (DNS propagation optimization)