IT AI Skill

Data Backup Recovery

Design and manage comprehensive data backup and recovery strategies including backup scheduling, retention policies, storage tiering, recovery testing, immutability, and cross-region replication. Use when implementing backup solutions, defining RPO/RTO targ...

Data Backup & Recovery

Comprehensive data backup and recovery strategies ensuring business continuity through properly designed backup architectures, tested recovery procedures, and defensible retention policies.

Workflow

  1. Classify data by criticality: identify crown jewels (customer databases, financial records, intellectual property), high-value data (email, documents, application configs), and low-value data (logs, temporary files, cached data).
  2. Define RPO (Recovery Point Objective) and RTO (Recovery Time Objective) for each data class: align with business requirements, regulatory mandates, and SLA commitments.
  3. Design backup architecture: backup types (full, incremental, differential, snapshot), backup windows, storage targets (local, cloud, offsite), encryption requirements, immutability.
  4. Select backup solutions based on environment: Veeam (virtualization), Commvault (enterprise), Rubrik (cloud-native), AWS Backup (AWS-native), Azure Backup (Azure-native), Druva/Spectra (SaaS).
  5. Implement backup scheduling: daily incremental, weekly full, monthly archive; align with backup windows to minimize production impact; stagger backups to avoid resource contention.
  6. Configure backup retention: align with legal/regulatory requirements (tax: 7 years, HIPAA: 6 years, PCI-DSS: 1 year minimum); implement tiered retention (30-day hot, 1-year warm, 7-year cold).
  7. Encrypt all backups: AES-256 encryption at rest; TLS 1.2+ in transit; customer-managed encryption keys (CMEK) for sensitive data; key rotation every 12 months.
  8. Enable immutability: WORM (Write Once Read Many) storage for cloud backups; air-gapped copies for on-prem; immutable retention locks to prevent ransomware deletion.
  9. Test recovery procedures: quarterly restore tests for critical data; annual full disaster recovery exercise; measure actual RTO vs. target; document and improve.
  10. Monitor and report: backup success/failure rates, storage utilization trends, compliance with retention policies, recovery test results; monthly report to stakeholders.

RPO/RTO Framework

RECOVERY OBJECTIVES BY DATA CLASSIFICATION
=============================================

CRITICAL DATA (Crown Jewels):

  Examples: Production databases, customer data, payment systems, email servers
  RPO (Recovery Point Objective): 0-15 minutes
    → Near-zero data loss tolerance
    → Achieved via: continuous replication, transaction log shipping, real-time sync
    → Backup frequency: Every 15 minutes (transaction logs) or real-time replication
  RTO (Recovery Time Objective): 0-2 hours
    → Maximum 2 hours to restore and resume operations
    → Achieved via: automated failover, pre-staged recovery environment, pilot light DR
  Recovery Strategy:
    → Primary: Synchronous replication to DR site (zero RPO but performance impact)
    → Alternative: Asynchronous replication (RPO 1-5 minutes) + automated failover
    → Tertiary: Frequent backups (15-min increments) + automated restore scripts
  Cost Impact: 2-5x of primary infrastructure cost (for synchronous replication)

HIGH-VALUE DATA:

  Examples: Application data, file shares, ERP/CRM data, document management systems
  RPO: 1-4 hours
    → Acceptable data loss: up to 4 hours of changes
    → Achieved via: hourly snapshots or replication
    → Backup frequency: Every 1-4 hours
  RTO: 4-8 hours
    → Business can operate with degraded functionality for up to 8 hours
    → Achieved via: backup restore with automated scripts; warm standby DR
  Recovery Strategy:
    → Primary: Hourly snapshots to secondary location
    → Alternative: 4-hour backup windows + staged restore (databases first, then files)
    → Tertiary: Daily full + hourly incremental + offsite copy
  Cost Impact: 1.5-2x of primary infrastructure cost

STANDARD DATA:

  Examples: Development environments, test data, internal collaboration data, logs
  RPO: 4-24 hours
    → Acceptable data loss: up to 1 day
    → Achieved via: daily backups
    → Backup frequency: Once daily (during off-peak hours)
  RTO: 24-48 hours
    → Business can operate without this data for up to 2 days
    → Achieved via: standard backup restore process
  Recovery Strategy:
    → Primary: Daily full backup to cloud storage
    → Alternative: Weekly full + daily incremental + cloud copy
    → Tertiary: On-demand snapshots before maintenance
  Cost Impact: 0.3-0.5x of primary infrastructure cost

ARCHIVAL / COMPLIANCE DATA:

  Examples: Historical records, audit logs, regulatory archives, legal hold data
  RPO: 24-72 hours (data is append-only; new entries backed up periodically)
  RTO: 72 hours - 7 days (retrieval time acceptable due to infrequent access)
  Recovery Strategy:
    → Primary: Cloud archive storage (AWS Glacier Deep Archive, Azure Archive, GCP Coldline)
    → Retention: 7-10 years (or as mandated by regulation)
    → Retrieval: 12-48 hours retrieval time; plan for advance retrieval requests
  Cost Impact: $0.001-$0.003 per GB/month (deep archive tier)

RPO/RTO ALIGNMENT WITH REGULATIONS:

  PCI-DSS:
    → Requirement 12.10.4: Recovery procedures tested annually
    → Requirement 3.4: Backups stored securely; encryption required
    → RPO: ≤ 24 hours for cardholder data
    → RTO: ≤ 24 hours to resume card processing

  HIPAA:
    → §164.308(a)(7): Contingency plan including data backup plan
    → RPO: ≤ 24 hours for ePHI
    → RTO: ≤ 48 hours to restore ePHI access
    → Testing: Annual testing of backup and recovery procedures

  SOC 2:
    → Availability criterion: Backup procedures and recovery testing
    → Evidence: Backup logs, restore test results, DR test reports
    → RPO/RTO: Defined and documented per service SLA

Backup Architecture Design

BACKUP ARCHITECTURE COMPONENTS
================================

BACKUP TYPES:

  Full Backup:
    → Copies all selected data
    → Advantages: Fastest restore (single backup set); independent of other backups
    → Disadvantages: Longest backup time; most storage consumed
    → Schedule: Weekly (recommended); daily for small datasets (<1 TB)

  Incremental Backup:
    → Copies only data changed since last backup (any type)
    → Advantages: Fastest backup; least storage per backup
    → Disadvantages: Restore requires full + all incrementals (chain dependency)
    → Schedule: Daily or multiple times per day

  Differential Backup:
    → Copies data changed since last FULL backup
    → Advantages: Faster restore than incremental (only full + latest differential needed)
    → Disadvantages: Backup time grows throughout the week
    → Schedule: Daily (with weekly full)

  Synthetic Full Backup:
    → Created from previous full + incrementals (no production I/O impact)
    → Advantages: Full backup available without production impact
    → Disadvantages: Relies on integrity of source backups
    → Schedule: Replace scheduled full backup (Veeam, Commvault support)

  Snapshot:
    → Point-in-time copy at storage level (near-instant)
    → Advantages: Near-zero RPO; sub-second creation
    → Disadvantages: Typically on same storage array (not offsite); dependency on primary
    → Schedule: Every 1-4 hours; copy to offsite for true backup

  Continuous Data Protection (CDP):
    → Every write operation captured; restore to any point in time
    → Advantages: Near-zero RPO; granular recovery
    → Disadvantages: Higher cost; performance overhead
    → Schedule: Real-time (continuous)

THE 3-2-1-1-0 BACKUP RULE:

  3 copies of data: Primary + 2 backup copies
  2 different media types: Disk + cloud (or tape); prevents single-failure-mode loss
  1 offsite copy: Geographically separate (different region or cloud)
  1 immutable/air-gapped copy: Cannot be modified or deleted (ransomware protection)
  0 errors: Automated backup verification; no silent corruption

BACKUP STORAGE TIERS:

  Tier 1 — Hot (Fast Recovery, 0-90 Days):
    → Storage: Local disk (NAS/SAN), cloud standard storage
    → Retention: 30-90 days
    → Recovery time: Minutes to hours
    → Cost: $0.02-$0.10/GB/month
    → Use: Daily operational restore, file recovery, database point-in-time

  Tier 2 — Warm (Standard Recovery, 90 Days - 2 Years):
    → Storage: Cloud cool storage (AWS S3 Standard-IA, Azure Cool)
    → Retention: 1-2 years
    → Recovery time: Hours
    → Cost: $0.005-$0.02/GB/month
    → Use: Monthly backup retention, compliance backups

  Tier 3 — Cold (Archive Recovery, 2-7 Years):
    → Storage: Cloud archive (AWS Glacier, Azure Archive, GCP Coldline)
    → Retention: 3-7 years
    → Recovery time: 12-48 hours
    → Cost: $0.001-$0.003/GB/month
    → Use: Long-term compliance, legal hold, regulatory archives

  Tier 4 — Deep Archive (7+ Years):
    → Storage: AWS Glacier Deep Archive, Azure Archive, physical tape
    → Retention: 7-10+ years
    → Recovery time: 24-72 hours
    → Cost: $0.00099/GB/month (Glacier Deep Archive)
    → Use: Regulatory archives, legal requirements, historical records

BACKUP ENCRYPTION:

  At Rest:
    → Algorithm: AES-256 (industry standard)
    → Key management: Customer-managed keys (CMK) in KMS/HSM
    → Key rotation: Annual (automated via KMS)
    → Key backup: Keys stored separately from backup data (different region/account)

  In Transit:
    → Protocol: TLS 1.2+ (TLS 1.3 preferred)
    → Certificate validation: Server certificate verified
    → Self-managed certs: Validated against trusted CA

  Key Custody:
    → Encryption keys NEVER stored with backup data
    → Break-glass key access: 2-person control; audited access
    → Key destruction: Automated per retention policy; manual for legal holds

Recovery Testing

RECOVERY TESTING PROGRAM
==========================

TEST TYPES AND FREQUENCY:

  File-Level Restore Test (Monthly):
    → Restore 5-10 random files from backup
    → Verify file integrity (checksum comparison)
    → Measure restore time (actual vs. expected)
    → Test for: individual files, directories, specific dates
    → Participants: Backup administrator
    → Documentation: Screenshot of restored files, integrity verification

  Database Restore Test (Quarterly):
    → Restore production database to isolated test environment
    → Verify data integrity (row counts, checksums, referential integrity)
    → Verify point-in-time recovery (restore to specific transaction)
    → Measure restore time (RTO validation)
    → Participants: DBA, application team
    → Documentation: Restore log, integrity test results, RTO measurement

  System-Level Restore Test (Quarterly):
    → Restore entire virtual machine or server from backup
    → Verify system boots and services start
    → Verify application functionality
    → Measure RTO from backup to operational system
    → Participants: System administrator, application owner
    → Documentation: System verification checklist, RTO measurement

  Full Disaster Recovery Exercise (Annually):
    → Simulate complete site failure (primary data center down)
    → Activate DR site / cloud recovery environment
    → Restore critical systems in priority order
    → Verify end-to-end business functionality
    → Measure actual RTO and RPO vs. targets
    → Participants: Full IRT, business stakeholders, IT leadership
    → Documentation: DR test report, lessons learned, improvement plan

  Backup Integrity Verification (Continuous):
    → Automated synthetic backups (Veeam SureBackup, Commvault Media Verification)
    → Regular integrity checks (scrubbing, checksum validation)
    → Cloud storage: AWS S3 checksum verification, Azure blob integrity
    → Alert on: integrity failures, silent corruption, incomplete backups

RECOVERY TEST REPORT TEMPLATE:

  Test Date: [Date]
  Test Type: [File/Database/System/DR Exercise]
  Systems Tested: [List of systems/databases]
  Backup Source: [Backup type, date, storage location]
  Restore Target: [Test environment details]

  Results:
    → Restore success: [Yes/No] for each system
    → Data integrity: [Verified/Partial/Failed]
    → Restore time: [Actual duration vs. target RTO]
    → Issues encountered: [List any issues]

  Metrics:
    → Data volume restored: [GB/TB]
    → Restore throughput: [GB/hour]
    → RTO achieved: [X hours] vs. target: [Y hours]
    → RPO achieved: [X minutes/hours] vs. target: [Y minutes/hours]

  Lessons Learned:
    → What worked well
    → What needs improvement
    → Action items for improvement

  Sign-off:
    → Test lead: [Name, signature]
    → Backup administrator: [Name, signature]
    → Business stakeholder: [Name, signature]

Integration Points

Edge Cases