---
name: data-backup-recovery
description: Design and manage comprehensive data backup and recovery strategies including backup scheduling, retention policies, storage tiering, recovery testing, immutability, and cross-region replication. Use when implementing backup solutions, defining RPO/RTO targets, configuring backup retention, testing recovery procedures, managing backup storage costs, implementing immutable backups, or preparing for disaster recovery scenarios. Triggers on phrases like "data backup", "backup strategy", "recovery point objective", "RPO", "recovery time objective", "RTO", "backup retention", "immutable backup", "backup verification", "recovery testing", "cross-region backup", "tape backup", "backup encryption", "disaster recovery backup".
---

# Data Backup & Recovery

Comprehensive data backup and recovery strategies ensuring business continuity through properly designed backup architectures, tested recovery procedures, and defensible retention policies.

## Workflow

1. Classify data by criticality: identify crown jewels (customer databases, financial records, intellectual property), high-value data (email, documents, application configs), and low-value data (logs, temporary files, cached data).
2. Define RPO (Recovery Point Objective) and RTO (Recovery Time Objective) for each data class: align with business requirements, regulatory mandates, and SLA commitments.
3. Design backup architecture: backup types (full, incremental, differential, snapshot), backup windows, storage targets (local, cloud, offsite), encryption requirements, immutability.
4. Select backup solutions based on environment: Veeam (virtualization), Commvault (enterprise), Rubrik (cloud-native), AWS Backup (AWS-native), Azure Backup (Azure-native), Druva/Spectra (SaaS).
5. Implement backup scheduling: daily incremental, weekly full, monthly archive; align with backup windows to minimize production impact; stagger backups to avoid resource contention.
6. Configure backup retention: align with legal/regulatory requirements (tax: 7 years, HIPAA: 6 years, PCI-DSS: 1 year minimum); implement tiered retention (30-day hot, 1-year warm, 7-year cold).
7. Encrypt all backups: AES-256 encryption at rest; TLS 1.2+ in transit; customer-managed encryption keys (CMEK) for sensitive data; key rotation every 12 months.
8. Enable immutability: WORM (Write Once Read Many) storage for cloud backups; air-gapped copies for on-prem; immutable retention locks to prevent ransomware deletion.
9. Test recovery procedures: quarterly restore tests for critical data; annual full disaster recovery exercise; measure actual RTO vs. target; document and improve.
10. Monitor and report: backup success/failure rates, storage utilization trends, compliance with retention policies, recovery test results; monthly report to stakeholders.

## RPO/RTO Framework

```
RECOVERY OBJECTIVES BY DATA CLASSIFICATION
=============================================

CRITICAL DATA (Crown Jewels):

  Examples: Production databases, customer data, payment systems, email servers
  RPO (Recovery Point Objective): 0-15 minutes
    → Near-zero data loss tolerance
    → Achieved via: continuous replication, transaction log shipping, real-time sync
    → Backup frequency: Every 15 minutes (transaction logs) or real-time replication
  RTO (Recovery Time Objective): 0-2 hours
    → Maximum 2 hours to restore and resume operations
    → Achieved via: automated failover, pre-staged recovery environment, pilot light DR
  Recovery Strategy:
    → Primary: Synchronous replication to DR site (zero RPO but performance impact)
    → Alternative: Asynchronous replication (RPO 1-5 minutes) + automated failover
    → Tertiary: Frequent backups (15-min increments) + automated restore scripts
  Cost Impact: 2-5x of primary infrastructure cost (for synchronous replication)

HIGH-VALUE DATA:

  Examples: Application data, file shares, ERP/CRM data, document management systems
  RPO: 1-4 hours
    → Acceptable data loss: up to 4 hours of changes
    → Achieved via: hourly snapshots or replication
    → Backup frequency: Every 1-4 hours
  RTO: 4-8 hours
    → Business can operate with degraded functionality for up to 8 hours
    → Achieved via: backup restore with automated scripts; warm standby DR
  Recovery Strategy:
    → Primary: Hourly snapshots to secondary location
    → Alternative: 4-hour backup windows + staged restore (databases first, then files)
    → Tertiary: Daily full + hourly incremental + offsite copy
  Cost Impact: 1.5-2x of primary infrastructure cost

STANDARD DATA:

  Examples: Development environments, test data, internal collaboration data, logs
  RPO: 4-24 hours
    → Acceptable data loss: up to 1 day
    → Achieved via: daily backups
    → Backup frequency: Once daily (during off-peak hours)
  RTO: 24-48 hours
    → Business can operate without this data for up to 2 days
    → Achieved via: standard backup restore process
  Recovery Strategy:
    → Primary: Daily full backup to cloud storage
    → Alternative: Weekly full + daily incremental + cloud copy
    → Tertiary: On-demand snapshots before maintenance
  Cost Impact: 0.3-0.5x of primary infrastructure cost

ARCHIVAL / COMPLIANCE DATA:

  Examples: Historical records, audit logs, regulatory archives, legal hold data
  RPO: 24-72 hours (data is append-only; new entries backed up periodically)
  RTO: 72 hours - 7 days (retrieval time acceptable due to infrequent access)
  Recovery Strategy:
    → Primary: Cloud archive storage (AWS Glacier Deep Archive, Azure Archive, GCP Coldline)
    → Retention: 7-10 years (or as mandated by regulation)
    → Retrieval: 12-48 hours retrieval time; plan for advance retrieval requests
  Cost Impact: $0.001-$0.003 per GB/month (deep archive tier)

RPO/RTO ALIGNMENT WITH REGULATIONS:

  PCI-DSS:
    → Requirement 12.10.4: Recovery procedures tested annually
    → Requirement 3.4: Backups stored securely; encryption required
    → RPO: ≤ 24 hours for cardholder data
    → RTO: ≤ 24 hours to resume card processing

  HIPAA:
    → §164.308(a)(7): Contingency plan including data backup plan
    → RPO: ≤ 24 hours for ePHI
    → RTO: ≤ 48 hours to restore ePHI access
    → Testing: Annual testing of backup and recovery procedures

  SOC 2:
    → Availability criterion: Backup procedures and recovery testing
    → Evidence: Backup logs, restore test results, DR test reports
    → RPO/RTO: Defined and documented per service SLA
```

## Backup Architecture Design

```
BACKUP ARCHITECTURE COMPONENTS
================================

BACKUP TYPES:

  Full Backup:
    → Copies all selected data
    → Advantages: Fastest restore (single backup set); independent of other backups
    → Disadvantages: Longest backup time; most storage consumed
    → Schedule: Weekly (recommended); daily for small datasets (<1 TB)

  Incremental Backup:
    → Copies only data changed since last backup (any type)
    → Advantages: Fastest backup; least storage per backup
    → Disadvantages: Restore requires full + all incrementals (chain dependency)
    → Schedule: Daily or multiple times per day

  Differential Backup:
    → Copies data changed since last FULL backup
    → Advantages: Faster restore than incremental (only full + latest differential needed)
    → Disadvantages: Backup time grows throughout the week
    → Schedule: Daily (with weekly full)

  Synthetic Full Backup:
    → Created from previous full + incrementals (no production I/O impact)
    → Advantages: Full backup available without production impact
    → Disadvantages: Relies on integrity of source backups
    → Schedule: Replace scheduled full backup (Veeam, Commvault support)

  Snapshot:
    → Point-in-time copy at storage level (near-instant)
    → Advantages: Near-zero RPO; sub-second creation
    → Disadvantages: Typically on same storage array (not offsite); dependency on primary
    → Schedule: Every 1-4 hours; copy to offsite for true backup

  Continuous Data Protection (CDP):
    → Every write operation captured; restore to any point in time
    → Advantages: Near-zero RPO; granular recovery
    → Disadvantages: Higher cost; performance overhead
    → Schedule: Real-time (continuous)

THE 3-2-1-1-0 BACKUP RULE:

  3 copies of data: Primary + 2 backup copies
  2 different media types: Disk + cloud (or tape); prevents single-failure-mode loss
  1 offsite copy: Geographically separate (different region or cloud)
  1 immutable/air-gapped copy: Cannot be modified or deleted (ransomware protection)
  0 errors: Automated backup verification; no silent corruption

BACKUP STORAGE TIERS:

  Tier 1 — Hot (Fast Recovery, 0-90 Days):
    → Storage: Local disk (NAS/SAN), cloud standard storage
    → Retention: 30-90 days
    → Recovery time: Minutes to hours
    → Cost: $0.02-$0.10/GB/month
    → Use: Daily operational restore, file recovery, database point-in-time

  Tier 2 — Warm (Standard Recovery, 90 Days - 2 Years):
    → Storage: Cloud cool storage (AWS S3 Standard-IA, Azure Cool)
    → Retention: 1-2 years
    → Recovery time: Hours
    → Cost: $0.005-$0.02/GB/month
    → Use: Monthly backup retention, compliance backups

  Tier 3 — Cold (Archive Recovery, 2-7 Years):
    → Storage: Cloud archive (AWS Glacier, Azure Archive, GCP Coldline)
    → Retention: 3-7 years
    → Recovery time: 12-48 hours
    → Cost: $0.001-$0.003/GB/month
    → Use: Long-term compliance, legal hold, regulatory archives

  Tier 4 — Deep Archive (7+ Years):
    → Storage: AWS Glacier Deep Archive, Azure Archive, physical tape
    → Retention: 7-10+ years
    → Recovery time: 24-72 hours
    → Cost: $0.00099/GB/month (Glacier Deep Archive)
    → Use: Regulatory archives, legal requirements, historical records

BACKUP ENCRYPTION:

  At Rest:
    → Algorithm: AES-256 (industry standard)
    → Key management: Customer-managed keys (CMK) in KMS/HSM
    → Key rotation: Annual (automated via KMS)
    → Key backup: Keys stored separately from backup data (different region/account)

  In Transit:
    → Protocol: TLS 1.2+ (TLS 1.3 preferred)
    → Certificate validation: Server certificate verified
    → Self-managed certs: Validated against trusted CA

  Key Custody:
    → Encryption keys NEVER stored with backup data
    → Break-glass key access: 2-person control; audited access
    → Key destruction: Automated per retention policy; manual for legal holds
```

## Recovery Testing

```
RECOVERY TESTING PROGRAM
==========================

TEST TYPES AND FREQUENCY:

  File-Level Restore Test (Monthly):
    → Restore 5-10 random files from backup
    → Verify file integrity (checksum comparison)
    → Measure restore time (actual vs. expected)
    → Test for: individual files, directories, specific dates
    → Participants: Backup administrator
    → Documentation: Screenshot of restored files, integrity verification

  Database Restore Test (Quarterly):
    → Restore production database to isolated test environment
    → Verify data integrity (row counts, checksums, referential integrity)
    → Verify point-in-time recovery (restore to specific transaction)
    → Measure restore time (RTO validation)
    → Participants: DBA, application team
    → Documentation: Restore log, integrity test results, RTO measurement

  System-Level Restore Test (Quarterly):
    → Restore entire virtual machine or server from backup
    → Verify system boots and services start
    → Verify application functionality
    → Measure RTO from backup to operational system
    → Participants: System administrator, application owner
    → Documentation: System verification checklist, RTO measurement

  Full Disaster Recovery Exercise (Annually):
    → Simulate complete site failure (primary data center down)
    → Activate DR site / cloud recovery environment
    → Restore critical systems in priority order
    → Verify end-to-end business functionality
    → Measure actual RTO and RPO vs. targets
    → Participants: Full IRT, business stakeholders, IT leadership
    → Documentation: DR test report, lessons learned, improvement plan

  Backup Integrity Verification (Continuous):
    → Automated synthetic backups (Veeam SureBackup, Commvault Media Verification)
    → Regular integrity checks (scrubbing, checksum validation)
    → Cloud storage: AWS S3 checksum verification, Azure blob integrity
    → Alert on: integrity failures, silent corruption, incomplete backups

RECOVERY TEST REPORT TEMPLATE:

  Test Date: [Date]
  Test Type: [File/Database/System/DR Exercise]
  Systems Tested: [List of systems/databases]
  Backup Source: [Backup type, date, storage location]
  Restore Target: [Test environment details]

  Results:
    → Restore success: [Yes/No] for each system
    → Data integrity: [Verified/Partial/Failed]
    → Restore time: [Actual duration vs. target RTO]
    → Issues encountered: [List any issues]

  Metrics:
    → Data volume restored: [GB/TB]
    → Restore throughput: [GB/hour]
    → RTO achieved: [X hours] vs. target: [Y hours]
    → RPO achieved: [X minutes/hours] vs. target: [Y minutes/hours]

  Lessons Learned:
    → What worked well
    → What needs improvement
    → Action items for improvement

  Sign-off:
    → Test lead: [Name, signature]
    → Backup administrator: [Name, signature]
    → Business stakeholder: [Name, signature]
```

## Integration Points

- **Veeam Backup & Replication**: Virtualization backup leader; VMware/Hyper-V backup; instant VM recovery; cloud tier (AWS/Azure/GCP); immutability; $3,200/year base + per-socket pricing
- **Commvault Complete Data Protection**: Enterprise-grade backup; 400+ connectors; cloud-native; AI-driven optimization (Command Center); info archive for compliance; $5,000+/year + per-petabyte pricing
- **Rubrik**: Cloud-native backup platform; immutable backups (RMM); cloud-to-cloud backup (O365, Salesforce); policy-based; self-service recovery; $5,000+/year + per-node pricing
- **AWS Backup**: Native AWS backup service; centralized backup policy; supports EC2, EBS, RDS, DynamoDB, EFS, FSx; cross-region copy; VSS-aware; pay-per-GB
- **Azure Backup**: Native Azure backup; Recovery Services Vault; supports Azure VMs, SQL, Azure Files, Office 365, SharePoint; cross-region restore; pay-per-GB
- **Druva Phoenix**: Cloud-first backup; SaaS backup (O365, G Workspace, Salesforce); end-user self-service recovery; in-place deduplication; $15/user/year
- **Spectra Logic (Shapedot)**: Scalable backup platform; inline deduplication; cloud tiering; air-gapped immutable storage; hybrid cloud support
- **Restic / BorgBackup**: Open-source backup tools; encrypted; deduplication; CLI-based; suitable for Linux servers and development environments; free

## Edge Cases

- **Ransomware targeting backups**: Modern ransomware specifically targets and encrypts backups; defense: immutable backups (WORM), air-gapped copies, offline tape, separate backup network with no access from production
- **Massive data volumes (100+ TB)**: Deduplication critical (achieve 10:1-50:1 ratio); synthetic full backups to avoid production impact; parallel backup streams; backup window may require off-peak scheduling over multiple days
- **SaaS application backup (Office 365, Google Workspace, Salesforce)**: Cloud provider does NOT back up customer data; use CBA (Cloud Backup Application) with API access; backup user data, sharepoint sites, CRM records; retain deleted items for 30-90 days
- **Database point-in-time recovery (PITR)**: Requires transaction log backups (SQL Server), WAL archiving (PostgreSQL), binary logs (MySQL); test PITR regularly; verify log chain integrity; RPO determined by log backup frequency
- **Cross-region/cross-cloud backup**: Data sovereignty requirements (data must stay in-country); backup to same-country region only; consider sovereign cloud options; transfer impact assessments
- **Backup window constraints**: Production systems cannot tolerate backup I/O impact; solution: CDP (continuous protection), snapshots (sub-second), backup proxy (offload I/O to dedicated server), off-peak scheduling
- **Legal hold and eDiscovery**: Backup data may be subject to legal hold; cannot delete per retention policy; implement legal hold capability in backup solution; coordinate with legal team; flag held data in retention management
