IT AI Skill
Data Backup Recovery
Design and manage comprehensive data backup and recovery strategies including backup scheduling, retention policies, storage tiering, recovery testing, immutability, and cross-region replication. Use when implementing backup solutions, defining RPO/RTO targ...
Data Backup & Recovery
Comprehensive data backup and recovery strategies ensuring business continuity through properly designed backup architectures, tested recovery procedures, and defensible retention policies.
Workflow
- Classify data by criticality: identify crown jewels (customer databases, financial records, intellectual property), high-value data (email, documents, application configs), and low-value data (logs, temporary files, cached data).
- Define RPO (Recovery Point Objective) and RTO (Recovery Time Objective) for each data class: align with business requirements, regulatory mandates, and SLA commitments.
- Design backup architecture: backup types (full, incremental, differential, snapshot), backup windows, storage targets (local, cloud, offsite), encryption requirements, immutability.
- Select backup solutions based on environment: Veeam (virtualization), Commvault (enterprise), Rubrik (cloud-native), AWS Backup (AWS-native), Azure Backup (Azure-native), Druva/Spectra (SaaS).
- Implement backup scheduling: daily incremental, weekly full, monthly archive; align with backup windows to minimize production impact; stagger backups to avoid resource contention.
- Configure backup retention: align with legal/regulatory requirements (tax: 7 years, HIPAA: 6 years, PCI-DSS: 1 year minimum); implement tiered retention (30-day hot, 1-year warm, 7-year cold).
- Encrypt all backups: AES-256 encryption at rest; TLS 1.2+ in transit; customer-managed encryption keys (CMEK) for sensitive data; key rotation every 12 months.
- Enable immutability: WORM (Write Once Read Many) storage for cloud backups; air-gapped copies for on-prem; immutable retention locks to prevent ransomware deletion.
- Test recovery procedures: quarterly restore tests for critical data; annual full disaster recovery exercise; measure actual RTO vs. target; document and improve.
- Monitor and report: backup success/failure rates, storage utilization trends, compliance with retention policies, recovery test results; monthly report to stakeholders.
RPO/RTO Framework
RECOVERY OBJECTIVES BY DATA CLASSIFICATION
=============================================
CRITICAL DATA (Crown Jewels):
Examples: Production databases, customer data, payment systems, email servers
RPO (Recovery Point Objective): 0-15 minutes
→ Near-zero data loss tolerance
→ Achieved via: continuous replication, transaction log shipping, real-time sync
→ Backup frequency: Every 15 minutes (transaction logs) or real-time replication
RTO (Recovery Time Objective): 0-2 hours
→ Maximum 2 hours to restore and resume operations
→ Achieved via: automated failover, pre-staged recovery environment, pilot light DR
Recovery Strategy:
→ Primary: Synchronous replication to DR site (zero RPO but performance impact)
→ Alternative: Asynchronous replication (RPO 1-5 minutes) + automated failover
→ Tertiary: Frequent backups (15-min increments) + automated restore scripts
Cost Impact: 2-5x of primary infrastructure cost (for synchronous replication)
HIGH-VALUE DATA:
Examples: Application data, file shares, ERP/CRM data, document management systems
RPO: 1-4 hours
→ Acceptable data loss: up to 4 hours of changes
→ Achieved via: hourly snapshots or replication
→ Backup frequency: Every 1-4 hours
RTO: 4-8 hours
→ Business can operate with degraded functionality for up to 8 hours
→ Achieved via: backup restore with automated scripts; warm standby DR
Recovery Strategy:
→ Primary: Hourly snapshots to secondary location
→ Alternative: 4-hour backup windows + staged restore (databases first, then files)
→ Tertiary: Daily full + hourly incremental + offsite copy
Cost Impact: 1.5-2x of primary infrastructure cost
STANDARD DATA:
Examples: Development environments, test data, internal collaboration data, logs
RPO: 4-24 hours
→ Acceptable data loss: up to 1 day
→ Achieved via: daily backups
→ Backup frequency: Once daily (during off-peak hours)
RTO: 24-48 hours
→ Business can operate without this data for up to 2 days
→ Achieved via: standard backup restore process
Recovery Strategy:
→ Primary: Daily full backup to cloud storage
→ Alternative: Weekly full + daily incremental + cloud copy
→ Tertiary: On-demand snapshots before maintenance
Cost Impact: 0.3-0.5x of primary infrastructure cost
ARCHIVAL / COMPLIANCE DATA:
Examples: Historical records, audit logs, regulatory archives, legal hold data
RPO: 24-72 hours (data is append-only; new entries backed up periodically)
RTO: 72 hours - 7 days (retrieval time acceptable due to infrequent access)
Recovery Strategy:
→ Primary: Cloud archive storage (AWS Glacier Deep Archive, Azure Archive, GCP Coldline)
→ Retention: 7-10 years (or as mandated by regulation)
→ Retrieval: 12-48 hours retrieval time; plan for advance retrieval requests
Cost Impact: $0.001-$0.003 per GB/month (deep archive tier)
RPO/RTO ALIGNMENT WITH REGULATIONS:
PCI-DSS:
→ Requirement 12.10.4: Recovery procedures tested annually
→ Requirement 3.4: Backups stored securely; encryption required
→ RPO: ≤ 24 hours for cardholder data
→ RTO: ≤ 24 hours to resume card processing
HIPAA:
→ §164.308(a)(7): Contingency plan including data backup plan
→ RPO: ≤ 24 hours for ePHI
→ RTO: ≤ 48 hours to restore ePHI access
→ Testing: Annual testing of backup and recovery procedures
SOC 2:
→ Availability criterion: Backup procedures and recovery testing
→ Evidence: Backup logs, restore test results, DR test reports
→ RPO/RTO: Defined and documented per service SLA
Backup Architecture Design
BACKUP ARCHITECTURE COMPONENTS
================================
BACKUP TYPES:
Full Backup:
→ Copies all selected data
→ Advantages: Fastest restore (single backup set); independent of other backups
→ Disadvantages: Longest backup time; most storage consumed
→ Schedule: Weekly (recommended); daily for small datasets (<1 TB)
Incremental Backup:
→ Copies only data changed since last backup (any type)
→ Advantages: Fastest backup; least storage per backup
→ Disadvantages: Restore requires full + all incrementals (chain dependency)
→ Schedule: Daily or multiple times per day
Differential Backup:
→ Copies data changed since last FULL backup
→ Advantages: Faster restore than incremental (only full + latest differential needed)
→ Disadvantages: Backup time grows throughout the week
→ Schedule: Daily (with weekly full)
Synthetic Full Backup:
→ Created from previous full + incrementals (no production I/O impact)
→ Advantages: Full backup available without production impact
→ Disadvantages: Relies on integrity of source backups
→ Schedule: Replace scheduled full backup (Veeam, Commvault support)
Snapshot:
→ Point-in-time copy at storage level (near-instant)
→ Advantages: Near-zero RPO; sub-second creation
→ Disadvantages: Typically on same storage array (not offsite); dependency on primary
→ Schedule: Every 1-4 hours; copy to offsite for true backup
Continuous Data Protection (CDP):
→ Every write operation captured; restore to any point in time
→ Advantages: Near-zero RPO; granular recovery
→ Disadvantages: Higher cost; performance overhead
→ Schedule: Real-time (continuous)
THE 3-2-1-1-0 BACKUP RULE:
3 copies of data: Primary + 2 backup copies
2 different media types: Disk + cloud (or tape); prevents single-failure-mode loss
1 offsite copy: Geographically separate (different region or cloud)
1 immutable/air-gapped copy: Cannot be modified or deleted (ransomware protection)
0 errors: Automated backup verification; no silent corruption
BACKUP STORAGE TIERS:
Tier 1 — Hot (Fast Recovery, 0-90 Days):
→ Storage: Local disk (NAS/SAN), cloud standard storage
→ Retention: 30-90 days
→ Recovery time: Minutes to hours
→ Cost: $0.02-$0.10/GB/month
→ Use: Daily operational restore, file recovery, database point-in-time
Tier 2 — Warm (Standard Recovery, 90 Days - 2 Years):
→ Storage: Cloud cool storage (AWS S3 Standard-IA, Azure Cool)
→ Retention: 1-2 years
→ Recovery time: Hours
→ Cost: $0.005-$0.02/GB/month
→ Use: Monthly backup retention, compliance backups
Tier 3 — Cold (Archive Recovery, 2-7 Years):
→ Storage: Cloud archive (AWS Glacier, Azure Archive, GCP Coldline)
→ Retention: 3-7 years
→ Recovery time: 12-48 hours
→ Cost: $0.001-$0.003/GB/month
→ Use: Long-term compliance, legal hold, regulatory archives
Tier 4 — Deep Archive (7+ Years):
→ Storage: AWS Glacier Deep Archive, Azure Archive, physical tape
→ Retention: 7-10+ years
→ Recovery time: 24-72 hours
→ Cost: $0.00099/GB/month (Glacier Deep Archive)
→ Use: Regulatory archives, legal requirements, historical records
BACKUP ENCRYPTION:
At Rest:
→ Algorithm: AES-256 (industry standard)
→ Key management: Customer-managed keys (CMK) in KMS/HSM
→ Key rotation: Annual (automated via KMS)
→ Key backup: Keys stored separately from backup data (different region/account)
In Transit:
→ Protocol: TLS 1.2+ (TLS 1.3 preferred)
→ Certificate validation: Server certificate verified
→ Self-managed certs: Validated against trusted CA
Key Custody:
→ Encryption keys NEVER stored with backup data
→ Break-glass key access: 2-person control; audited access
→ Key destruction: Automated per retention policy; manual for legal holds
Recovery Testing
RECOVERY TESTING PROGRAM
==========================
TEST TYPES AND FREQUENCY:
File-Level Restore Test (Monthly):
→ Restore 5-10 random files from backup
→ Verify file integrity (checksum comparison)
→ Measure restore time (actual vs. expected)
→ Test for: individual files, directories, specific dates
→ Participants: Backup administrator
→ Documentation: Screenshot of restored files, integrity verification
Database Restore Test (Quarterly):
→ Restore production database to isolated test environment
→ Verify data integrity (row counts, checksums, referential integrity)
→ Verify point-in-time recovery (restore to specific transaction)
→ Measure restore time (RTO validation)
→ Participants: DBA, application team
→ Documentation: Restore log, integrity test results, RTO measurement
System-Level Restore Test (Quarterly):
→ Restore entire virtual machine or server from backup
→ Verify system boots and services start
→ Verify application functionality
→ Measure RTO from backup to operational system
→ Participants: System administrator, application owner
→ Documentation: System verification checklist, RTO measurement
Full Disaster Recovery Exercise (Annually):
→ Simulate complete site failure (primary data center down)
→ Activate DR site / cloud recovery environment
→ Restore critical systems in priority order
→ Verify end-to-end business functionality
→ Measure actual RTO and RPO vs. targets
→ Participants: Full IRT, business stakeholders, IT leadership
→ Documentation: DR test report, lessons learned, improvement plan
Backup Integrity Verification (Continuous):
→ Automated synthetic backups (Veeam SureBackup, Commvault Media Verification)
→ Regular integrity checks (scrubbing, checksum validation)
→ Cloud storage: AWS S3 checksum verification, Azure blob integrity
→ Alert on: integrity failures, silent corruption, incomplete backups
RECOVERY TEST REPORT TEMPLATE:
Test Date: [Date]
Test Type: [File/Database/System/DR Exercise]
Systems Tested: [List of systems/databases]
Backup Source: [Backup type, date, storage location]
Restore Target: [Test environment details]
Results:
→ Restore success: [Yes/No] for each system
→ Data integrity: [Verified/Partial/Failed]
→ Restore time: [Actual duration vs. target RTO]
→ Issues encountered: [List any issues]
Metrics:
→ Data volume restored: [GB/TB]
→ Restore throughput: [GB/hour]
→ RTO achieved: [X hours] vs. target: [Y hours]
→ RPO achieved: [X minutes/hours] vs. target: [Y minutes/hours]
Lessons Learned:
→ What worked well
→ What needs improvement
→ Action items for improvement
Sign-off:
→ Test lead: [Name, signature]
→ Backup administrator: [Name, signature]
→ Business stakeholder: [Name, signature]
Integration Points
- Veeam Backup & Replication: Virtualization backup leader; VMware/Hyper-V backup; instant VM recovery; cloud tier (AWS/Azure/GCP); immutability; $3,200/year base + per-socket pricing
- Commvault Complete Data Protection: Enterprise-grade backup; 400+ connectors; cloud-native; AI-driven optimization (Command Center); info archive for compliance; $5,000+/year + per-petabyte pricing
- Rubrik: Cloud-native backup platform; immutable backups (RMM); cloud-to-cloud backup (O365, Salesforce); policy-based; self-service recovery; $5,000+/year + per-node pricing
- AWS Backup: Native AWS backup service; centralized backup policy; supports EC2, EBS, RDS, DynamoDB, EFS, FSx; cross-region copy; VSS-aware; pay-per-GB
- Azure Backup: Native Azure backup; Recovery Services Vault; supports Azure VMs, SQL, Azure Files, Office 365, SharePoint; cross-region restore; pay-per-GB
- Druva Phoenix: Cloud-first backup; SaaS backup (O365, G Workspace, Salesforce); end-user self-service recovery; in-place deduplication; $15/user/year
- Spectra Logic (Shapedot): Scalable backup platform; inline deduplication; cloud tiering; air-gapped immutable storage; hybrid cloud support
- Restic / BorgBackup: Open-source backup tools; encrypted; deduplication; CLI-based; suitable for Linux servers and development environments; free
Edge Cases
- Ransomware targeting backups: Modern ransomware specifically targets and encrypts backups; defense: immutable backups (WORM), air-gapped copies, offline tape, separate backup network with no access from production
- Massive data volumes (100+ TB): Deduplication critical (achieve 10:1-50:1 ratio); synthetic full backups to avoid production impact; parallel backup streams; backup window may require off-peak scheduling over multiple days
- SaaS application backup (Office 365, Google Workspace, Salesforce): Cloud provider does NOT back up customer data; use CBA (Cloud Backup Application) with API access; backup user data, sharepoint sites, CRM records; retain deleted items for 30-90 days
- Database point-in-time recovery (PITR): Requires transaction log backups (SQL Server), WAL archiving (PostgreSQL), binary logs (MySQL); test PITR regularly; verify log chain integrity; RPO determined by log backup frequency
- Cross-region/cross-cloud backup: Data sovereignty requirements (data must stay in-country); backup to same-country region only; consider sovereign cloud options; transfer impact assessments
- Backup window constraints: Production systems cannot tolerate backup I/O impact; solution: CDP (continuous protection), snapshots (sub-second), backup proxy (offload I/O to dedicated server), off-peak scheduling
- Legal hold and eDiscovery: Backup data may be subject to legal hold; cannot delete per retention policy; implement legal hold capability in backup solution; coordinate with legal team; flag held data in retention management