IT AI Skill
Backup Disaster Recovery
Design, implement, and test backup and disaster recovery (DR) strategies including RPO/RTO targets, backup automation, replication, failover testing, and business continuity planning. Use when creating backup strategies, designing DR solutions, testing fail...
Backup & Disaster Recovery
Design, implement, and test backup and disaster recovery (DR) strategies including RPO/RTO targets, backup automation, replication, and failover testing.
Workflow
1. DR Strategy & Classification
DISASTER RECOVERY STRATEGY
═══════════════════════════════════════
Application Classification:
═══════════════════════════════════════
Tier RPO RTO Strategy Applications
───────────────────────────────────────────────────────────────────────
1 ≤ 1 minute ≤ 5 minutes Active-Active Payment processing, auth
2 ≤ 1 hour ≤ 1 hour Active-Passive (Hot) Customer database, email
3 ≤ 4 hours ≤ 4 hours Warm Standby Internal tools, analytics
4 ≤ 24 hours ≤ 24 hours Cold Standby Archive, development
DR SITES:
═══════════════════════════════════════
Primary Site: us-east-1 (Virginia)
→ All Tier 1-4 applications
→ 2,500 VMs, 150 cloud instances
→ Primary data center
DR Site: us-west-2 (Oregon)
→ Tier 1-2: Hot/Warm standby
→ Tier 3-4: Backup only
→ Connected via Direct Connect (1Gbps)
Distance: 2,130 miles (seismic separation)
Network latency: 45ms
ACTIVE-ACTIVE (Tier 1):
═══════════════════════════════════════
→ Database: Cross-region replication (read replicas)
→ Application: Global load balancer (Route 53)
→ Failover: Automatic (health check failure)
→ RTO: <5 minutes (DNS propagation)
→ RPO: <1 minute (async replication)
ACTIVE-PASSIVE (Tier 2):
═══════════════════════════════════════
→ Database: Cross-region replication (standby)
→ Application: Pre-provisioned, idle
→ Failover: Semi-automatic (1-click)
→ RTO: <1 hour (warming + DNS)
→ RPO: <1 hour (replication lag)
2. Backup Strategy
BACKUP STRATEGY
═══════════════════════════════════════
Backup Types:
═══════════════════════════════════════
Type Schedule Retention Scope RPO Impact
───────────────────────────────────────────────────────────────────────
Full Weekly 30 days Everything Base restore
Incremental Daily 7 days Changes since last Fast restore
Differential Daily 7 days Changes since full Fast restore
Snapshot Hourly 24 hours Block-level Fast rollback
CDC (Change Data Capture) Continuous 7 days DB changes Point-in-time
BACKUP SCHEDULE:
═══════════════════════════════════════
Sunday 02:00 AM Full backup (all systems)
Mon-Sat 02:00 AM Incremental backup
Every 1h --- Snapshot (critical systems)
Continuous --- CDC (databases)
Every 15m --- Snapshot (Tier 1 systems)
BACKUP SCOPE:
═══════════════════════════════════════
Systems:
→ Databases: Full + CDC + snapshot
→ File servers: Full + incremental
→ VMs: Snapshot + full weekly
→ Cloud: Automated (EBS, S3 versioning, RDS)
→ Configs: Git repository + snapshot
→ Backups: Back up the backups (offsite)
Storage:
→ Primary: On-site NAS (fast restore)
→ Secondary: Cloud storage (S3/GCS, encrypted)
→ Archive: Glacier/Archive tier (long-term, 3-7 years)
→ Offsite: Different region (3-2-1 rule)
3-2-1 RULE:
═══════════════════════════════════════
3 copies of data (primary + 2 backups)
2 different storage media (local + cloud)
1 offsite copy (different region)
Plus: Automated, encrypted, tested
3. Backup Automation
BACKUP AUTOMATION
═══════════════════════════════════════
Cloud Backups (AWS):
═══════════════════════════════════════
EBS Snapshots:
→ AWS Backup (centralized policy)
→ Rule: Daily incremental, weekly full
→ Retention: 35 days
→ Encryption: KMS (customer-managed key)
→ Cross-region: Copy to us-west-2
RDS:
→ Automated backups: 35-day retention
→ Point-in-time recovery: Enabled
→ Snapshots: Manual (before changes)
→ Cross-region: Snapshot copy
S3:
→ Versioning: Enabled
→ Lifecycle: Transition to Glacier after 90 days
→ MFA delete: Enabled (critical buckets)
→ Replication: Cross-region (CRR)
EC2:
→ AMI: Weekly (gold image)
→ EBS: Automated snapshots
→ Config: Packer (reproducible)
On-Premise Backups:
═══════════════════════════════════════
Tool: Veeam / Commvault / Rubrik
→ VM backup: Nightly (incremental forever)
→ Full backup: Weekly (Sunday)
→ Synthetic full: Weekly (no production impact)
→ Offsite replication: Cloud (AWS S3)
→ Immutable backups: Air-gapped (ransomware protection)
BACKUP MONITORING:
═══════════════════════════════════════
→ Daily: Backup success report (email)
→ Alert: Failed backup (immediate, PagerDuty)
→ Weekly: Backup health dashboard
→ Monthly: Restore test (automated validation)
Current backup success rate: 99.2% (target: 99.5%)
Failed backups this month: 3 (investigating)
4. Failover Testing
FAILover TESTING
═══════════════════════════════════════
Test Schedule:
═══════════════════════════════════════
Test Type Frequency Duration Scope Impact
───────────────────────────────────────────────────────────────────────
Tabletop Quarterly 2 hours Planning None
Component Monthly 1 hour Single system Isolated
Partial Quarterly 4 hours Tier 2-3 Limited
Full DR Drill Annually 8 hours All tiers Planned
TEST PROCEDURES:
═══════════════════════════════════════
Full DR Drill (Annual):
═══════════════════════════════════════
Pre-Test (1 week before):
→ Notify stakeholders
→ Schedule maintenance window
→ Document pre-test state
→ Prepare rollback plan
Test Execution (day of):
08:00 Declare DR event (simulated)
08:05 Activate DR team
08:15 Begin failover (Tier 1)
08:30 Verify Tier 1 apps (payment, auth)
09:00 Begin failover (Tier 2)
09:30 Verify Tier 2 apps (database, email)
10:00 Begin failover (Tier 3)
10:30 Run application tests
11:00 Verify RPO/RTO met
11:30 Begin failback
13:00 Complete failback
13:30 Verify primary site
14:00 Declare test complete
Post-Test (1 week after):
→ Document results (RTO/RPO achieved vs target)
→ Identify gaps
→ Update DR runbook
→ Lessons learned meeting
TEST RESULTS (Last Drill):
═══════════════════════════════════════
Tier Target RTO Actual RTO Target RPO Actual RPO Status
────────────────────────────────────────────────────────────────────────
1 5 minutes 4 minutes 1 minute 30 seconds ✓ Pass
2 1 hour 45 minutes 1 hour 15 minutes ✓ Pass
3 4 hours 3 hours 4 hours 2 hours ✓ Pass
4 24 hours 12 hours 24 hours 8 hours ✓ Pass
Issues found: 3
→ DNS propagation slower than expected (resolved)
→ Application config missing in DR (fixed)
→ Network ACL blocking DR traffic (updated)
5. Business Continuity Planning
BUSINESS CONTINUITY PLAN (BCP)
═══════════════════════════════════════
BCP Components:
═══════════════════════════════════════
1. Business Impact Analysis (BIA):
→ Critical business functions identified
→ MTPD (Maximum Tolerable Period of Disruption) defined
→ Resource dependencies mapped
→ Financial impact assessed
2. Recovery Strategies:
→ People: Remote work, alternate workspace
→ Technology: DR site, cloud failover
→ Processes: Manual workarounds
→ Suppliers: Alternate vendors
→ Facilities: Alternate office
3. Communication Plan:
→ Internal: Employees, management
→ External: Customers, partners, media
→ Regulatory: Notifying authorities
→ Status page: Public updates
4. Activation Criteria:
→ Who can declare disaster? (CISO, CIO, CEO)
→ Escalation thresholds
→ Decision tree for response level
BCP CONTACT TREE:
═══════════════════════════════════════
Level 1: Incident Commander (on-call)
→ Level 2: CISO, CIO
→ Level 3: Department Heads
→ Level 4: Team Leads
→ Level 5: All Staff
Communication Methods (redundant):
→ Primary: Phone calls
→ Secondary: SMS (mass notification)
→ Tertiary: Email
→ Quaternary: Social media / status page
Edge Cases
- Ransomware: Immutable backups, air-gapped copies
- Region outage: Multi-region strategy, DNS failover
- Data corruption: Point-in-time recovery, validation
- Partial failure: Component-level recovery
- Long-term outage: Extended BCP (weeks/months)
Integration Points
- Backup tools: Veeam, Commvault, Rubrik, AWS Backup
- Cloud: AWS, Azure, GCP (native backup)
- Monitoring: Nagios, Zabbix, CloudWatch
- Communication: PagerDuty, Opsgenie, Slack
- DR orchestration: Azure Site Recovery, AWS DRS, Zerto
- Status page: Atlassian Statuspage, Cachet
Output
Backup & DR Status
BACKUP & DR STATUS — Q4 2024
═══════════════════════════════════════
Backup success rate: 99.2% (target: 99.5%)
Systems backed up: 98% of in-scope
Ransomware protection: Immutable backups ✓
Last full DR drill: Q4 2024 (all tiers passed)
Next DR drill: Q1 2025
RPO achieved: All within target
RTO achieved: All within target
Open issues: 1 (DNS propagation optimization)