IT AI Skill
Cloud Optimization
Manage cloud infrastructure optimization including cost optimization, resource rightsizing, reserved instance management, spot instance utilization, cloud architecture review, multi-cloud strategy, cloud security posture, and FinOps practices. Use when opti...
Cloud Infrastructure Optimization
Minimize cloud costs while maximizing performance, security, and reliability through data-driven optimization.
Cloud Cost Management
FinOps Framework
FINOPS FRAMEWORK:
═════════════════
CLOUD PROVIDERS:
Primary: AWS (65% of workload)
Secondary: Azure (30% of workload)
Tertiary: GCP (5% — specific ML workloads)
Total monthly spend: ~$45K (January 2025)
Annual run rate: ~$540K
COST BREAKDOWN (Monthly — January 2025):
AWS ($29,250):
┌──────────────────────────┬──────────┬──────────┐
│ Service │ Cost │ % of AWS │
├──────────────────────────┼──────────┼──────────┤
│ EC2 (compute) │ $8,500 │ 29.1% │
│ RDS (databases) │ $4,200 │ 14.4% │
│ S3 (storage) │ $3,800 │ 13.0% │
│ Lambda (serverless) │ $2,100 │ 7.2% │
│ EKS (Kubernetes) │ $1,800 │ 6.2% │
│ CloudFront (CDN) │ $1,200 │ 4.1% │
│ ElastiCache (Redis) │ $950 │ 3.2% │
│ Data transfer │ $1,100 │ 3.8% │
│ Monitoring/logging │ $1,400 │ 4.8% │
│ Backup/snapshot │ $650 │ 2.2% │
│ Other │ $3,350 │ 11.5% │
│ ────────────────────── │ ────── │ ────── │
│ TOTAL │ $29,250│ 100% │
└──────────────────────────┴──────────┴──────────┘
Azure ($13,500):
┌──────────────────────────┬──────────┬──────────┐
│ Service │ Cost │ % of AZ │
├──────────────────────────┼──────────┼──────────┤
│ Virtual Machines │ $4,200 │ 31.1% │
│ Azure SQL │ $2,100 │ 15.6% │
│ Blob Storage │ $1,500 │ 11.1% │
│ App Services │ $1,200 │ 8.9% │
│ AKS (Kubernetes) │ $1,000 │ 7.4% │
│ Data transfer │ $450 │ 3.3% │
│ Other │ $3,050 │ 22.6% │
│ ────────────────────── │ ────── │ ────── │
│ TOTAL │ $13,500│ 100% │
└──────────────────────────┴──────────┴──────────┘
GCP ($2,250):
Compute Engine: $1,100 (48.9%)
Cloud SQL: $550 (24.4%)
Cloud Storage: $350 (15.6%)
Other: $250 (11.1%)
TOTAL CLOUD SPEND: $45,000/month
Cost trend (6 months):
August: $42K → Sep: $44K → Oct: $46K → Nov: $47K → Dec: $45K → Jan: $45K
Status: Stabilized (growth rate: <5% month-over-month)
FINOPS PILLARS:
1. INFORM (visibility):
- Cost allocation tags: 95% coverage (target: 100%)
- Showback: Department-level cost reports (weekly)
- Chargeback: None (internal — showback only)
- Budget alerts: 80%, 90%, 100% (email + Teams)
- Anomaly detection: AWS Cost Anomaly + Azure Cost Management
2. OPTIMIZE (efficiency):
- Rightsizing: Monthly review (savings: $3,200/month)
- Reserved instances: 65% coverage (savings: $8,500/month)
- Spot instances: 30% of dev/test (savings: $2,100/month)
- Idle resource cleanup: Weekly (savings: $800/month)
- Storage tiering: Auto-archive (savings: $1,200/month)
- Total monthly savings: $15,800 (35% of gross cost)
3. OPERATE (governance):
- Budget approval: Monthly (Finance + IT leadership)
- Spend policy: Max $50K/month (alert at $40K)
- Provisioning approval: >$500/month (manager approval)
- Tagging policy: Mandatory (department, project, environment)
- Compliance: Quarterly review (SOC 2, ISO 27001)
COST ALLOCATION (Department showback):
┌──────────────────────────┬──────────┬──────────┐
│ Department │ Cost │ % of Total│
├──────────────────────────┼──────────┼──────────┤
│ Engineering │ $18,000 │ 40.0% │
│ Data/ML │ $7,500 │ 16.7% │
│ Sales/CRM │ $5,400 │ 12.0% │
│ Marketing │ $3,600 │ 8.0% │
│ Finance/ERP │ $3,150 │ 7.0% │
│ HR │ $1,350 │ 3.0% │
│ Operations │ $2,250 │ 5.0% │
│ Shared/Platform │ $3,750 │ 8.3% │
│ ────────────────────── │ ────── │ ────── │
│ TOTAL │ $45,000│ 100% │
└──────────────────────────┴──────────┴──────────┘
Reporting: Weekly (department leads) + Monthly (leadership)
Trending: Tracked monthly (YoY comparison)
Budget adherence: 98% on-target (2 departments slightly over — flagged)
Resource Optimization
Rightsizing & Efficiency
RESOURCE OPTIMIZATION:
══════════════════════
RIGHTSIZING PROGRAM:
Cadence: Monthly review (automated recommendation + manual approval)
Tools: AWS Compute Optimizer + Azure Advisor
Coverage: 100% of EC2 + Azure VMs (75 instances total)
January 2025 recommendations:
┌──────────────────────────┬──────────┬──────────┐
│ Recommendation │ Count │ Savings │
├──────────────────────────┼──────────┼──────────┤
│ Downsize (over-provisioned)│ 12 │ $1,800 │
│ Uprsize (under-provisioned)│ 3 │ $0* │
│ Reserved instance (on-demand)│ 18 │ $4,200 │
│ Spot instance (dev/test) │ 8 │ $950 │
│ Idle (terminate) │ 5 │ $650 │
│ Optimal (no change) │ 29 │ — │
│ ────────────────────── │ ────── │ ─────── │
│ TOTAL │ 75 │ $7,600 │
└──────────────────────────┴──────────┴──────────┘
*Upsize required for performance (cost-neutral or slight increase)
Implementation:
Rightsized: 10/12 (2 deferred — maintenance window)
Reserved: 15/18 (3 deferred — workload uncertainty)
Spot: 6/8 (2 deferred — workload stability concern)
Terminated: 5/5 (all confirmed idle — terminated)
Savings realized: $6,200/month (January)
Savings committed: $4,800/month (reserved instances — 1-year term)
RESERVED INSTANCE MANAGEMENT:
AWS:
RI coverage: 68% (of EC2 + RDS spend)
RI terms: 1-year (55% of RI), 3-year (15% of RI)
RI utilization: 94% (target: >90%) ✓
Savings vs. on-demand: 35-40% (standard), 45-55% (heavy)
Expiring RIs (next 90 days): 5 (renewal planned)
RI optimization: Monthly exchange (instance type, tenancy)
Azure:
Savings Plans coverage: 62% (of compute spend)
Savings Plan terms: 1-year (70%), 3-year (30%)
Utilization: 91% (target: >90%) ✓
Savings vs. on-demand: 25-30% (compute), 15-20% (all compute)
Expiring plans (next 90 days): 3 (renewal planned)
SPOT INSTANCE UTILIZATION:
Workloads on spot:
Development environment: 30% of dev instances
Testing/QA: 40% of test instances
CI/CD runners: 60% of runners
Batch processing: 80% of batch jobs
Data analysis: 50% of analysis instances
Spot interruption rate: 3.2% (AWS), 2.8% (Azure)
Interruption handling:
Graceful shutdown: 60-second warning (checkpoint + save)
Auto-recovery: Launch replacement (same AZ or different AZ)
Data persistence: EBS/Snapshot (state saved before shutdown)
Impact: Minimal (stateless workloads + checkpoint)
Savings: 60-70% vs. on-demand (spot pricing)
Monthly savings: ~$2,100 (spot optimization)
STORAGE OPTIMIZATION:
AWS S3:
Total storage: 18 TB
┌──────────────────────────┬──────────┬──────────┐
│ Storage Tier │ Volume │ Cost/Mo │
├──────────────────────────┼──────────┼──────────┤
│ S3 Standard │ 8 TB │ $232 │
│ S3 Intelligent-Tiering │ 5 TB │ $165 │
│ S3 Standard-IA │ 3 TB │ $60 │
│ S3 Glacier │ 1.5 TB │ $15 │
│ S3 Glacier Deep Archive │ 0.5 TB │ $3 │
│ ────────────────────── │ ────── │ ────── │
│ TOTAL │ 18 TB │ $475 │
└──────────────────────────┴──────────┴──────────┘
Lifecycle policies:
30 days: Standard → Intelligent-Tiering
90 days: → Standard-IA
180 days: → Glacier
365 days: → Glacier Deep Archive
730 days: → Delete (if no compliance requirement)
Savings: $1,200/month (vs. all-standard tier)
Compliance: 15 TB (retention override — legal hold)
Azure Blob:
Total storage: 8 TB
Tiering: Hot (4 TB), Cool (3 TB), Archive (1 TB)
Lifecycle: Auto-transition (30/90/180 day policy)
Savings: $450/month (vs. all-hot tier)
IDLE RESOURCE DETECTION:
Weekly scan (automated):
Unattached EBS volumes: 3 (250 GB — $25/month)
Unassociated Elastic IPs: 2 ($8/month)
Idle RDS instances: 1 (dev DB — $120/month)
Idle load balancers: 1 ($16/month)
Stopped EC2 instances: 2 ($0 — no charge when stopped)
Unused NAT Gateways: 1 ($32/month)
Total idle cost: $201/month
Action: Terminate/archive (weekly cleanup — auto-approved)
Prevention: Auto-shutdown (off-hours, dev/test)
WASTE REDUCTION SUMMARY:
Monthly waste identified: ~$3,500
Monthly waste eliminated: ~$2,800
Remaining waste: ~$700 (under review)
Waste rate: 7.8% of total spend (target: <5%)
Trend: Improving (March: 12% → Dec: 8% → Jan: 7.8%)
Cloud Security Posture
CSPM & Governance
CLOUD SECURITY POSTURE MANAGEMENT (CSPM):
═══════════════════════════════════════════
CSPM TOOLS:
AWS: AWS Security Hub + AWS Config + AWS Trusted Advisor
Azure: Microsoft Defender for Cloud
GCP: Security Command Center
Cross-cloud: Wiz (consolidated view)
SECURITY POSTURE SCORE:
AWS: 92/100 (target: >90%) ✓
Azure: 89/100 (target: >90%) — 1 point short
GCP: 95/100 (target: >90%) ✓
Overall: 91/100 (improving trend)
COMPLIANCE FRAMEWORKS (Cloud):
┌──────────────────────────┬──────────┬──────────┬──────────┐
│ Framework │ AWS │ Azure │ GCP │
├──────────────────────────┼──────────┼──────────┼──────────┤
│ CIS Foundations │ 95% │ 92% │ 97% │
│ SOC 2 │ 100% │ 98% │ 100% │
│ ISO 27001 │ 98% │ 95% │ 99% │
│ NIST 800-53 │ 92% │ 90% │ 95% │
│ PCI DSS (cloud scope) │ 100% │ N/A │ N/A │
│ ────────────────────── │ ────── │ ────── │ ────── │
│ Average │ 97% │ 92% │ 98% │
└──────────────────────────┴──────────┴──────────┴──────────┘
Gap remediation:
Azure CIS gap: 2 findings (network config — remediation in progress)
Azure ISO gap: 3 findings (logging config — remediation in progress)
NIST gap: 5 findings (encryption, access — planned for February)
Target: 95%+ across all frameworks (all providers)
CLOUD CONFIGURATION STANDARDS:
Automated enforcement (AWS Config + Azure Policy):
1. Storage encryption: 100% (AES-256, KMS managed)
2. Network encryption: 100% (TLS 1.2+, no unencrypted endpoints)
3. Public access: 0 S3 buckets publicly accessible (enforced)
4. Security groups: No 0.0.0.0/28 on sensitive ports (22, 3389)
5. MFA: Required for all root/admin accounts (enforced)
6. Logging: Enabled for all services (CloudTrail, Azure Activity Log)
7. Tagging: Mandatory tags (department, environment, owner)
8. Backup: Enabled for all production databases (automated)
9. VPC flow logs: Enabled (all VPCs)
10. IAM: No inline policies (managed policies only)
Compliance rate: 96% (automated checks)
Non-compliant: 12 resources (remediation in progress)
Auto-remediation: 8 of 10 rules (auto-correct)
CLOUD GOVERNANCE:
Guardrails (prevention):
- Max instance type: m5.2xlarge (larger requires approval)
- Max monthly spend per project: $5K (budget alert at $4K)
- Region restriction: us-east-1, us-west-2, eastus, westeurope
- No production deletion: Delete protection (Terraform + cloud)
- No public IP: Private endpoints (NAT, VPC endpoints)
- No root usage: Root account disabled (IAM only)
Monitoring (detection):
- Cost anomaly: AWS Cost Anomaly + Azure Cost Alerts
- Security finding: Security Hub + Defender (real-time)
- Configuration drift: AWS Config + Azure Policy (hourly)
- Usage spike: CloudWatch + Azure Monitor (15-minute)
Review cadence:
Weekly: Cost + security (automated report)
Monthly: Governance review (IT + Finance + Security)
Quarterly: Cloud strategy review (architecture, multi-cloud)
Annually: Cloud provider evaluation (negotiation, migration)
MULTI-CLOUD STRATEGY:
Workload distribution:
AWS (65%): Core applications, databases, storage, CDN
Azure (30%): Microsoft ecosystem (365, AD, SQL Server), HR/Finance
GCP (5%): ML/AI workloads (specific models — TensorFlow)
Multi-cloud benefits:
- Best-of-breed (per workload)
- Vendor diversification (risk reduction)
- Cost optimization (price comparison)
- Compliance (data residency, regulatory)
Multi-cloud challenges:
- Complexity (multiple consoles, APIs)
- Skill requirements (AWS + Azure + GCP)
- Cost visibility (consolidated billing)
- Security consistency (unified policy)
Mitigation:
- Terraform (unified IaC across providers)
- Wiz (unified security view)
- CloudHealth (unified cost view)
- Training (multi-cloud certification)
Output
Cloud Optimization Dashboard
CLOUD OPTIMIZATION DASHBOARD — Jan 2025
════════════════════════════════════
Cost Overview:
Monthly spend: $45,000
Annual run rate: ~$540K
Cost trend: Stabilized (<5% MoM growth)
Budget adherence: 98% on-target
AWS: $29,250 (65%)
Azure: $13,500 (30%)
GCP: $2,250 (5%)
Optimization:
Monthly savings: $15,800 (35% of gross)
RI/SP coverage: 65% (AWS), 62% (Azure)
Spot utilization: 30-60% (dev/test/CI/CD)
Rightsizing: $3,200/month savings
Storage tiering: $1,650/month savings
Waste rate: 7.8% (target: <5%)
Efficiency:
Instance utilization: 55-65% (CPU), 60-70% (memory)
Storage optimization: 3-tier lifecycle (auto)
Idle resources: $201/month (weekly cleanup)
Auto-shutdown: Dev/test (off-hours)
Security:
CSPM score: 91/100 (AWS: 92, Azure: 89, GCP: 95)
Compliance: 97% (CIS), 99% (SOC 2), 97% (ISO 27001)
Config standards: 96% compliant
Non-compliant resources: 12 (remediation in progress)
Governance:
Guardrails: 10 rules (auto-enforce)
Budget alerts: 80%, 90%, 100%
Tagging: 95% coverage (target: 100%)
Region: 4 approved regions (all workloads)
Root access: Disabled (IAM only)
Actions:
1. Improve tagging (95% → 100% coverage)
2. Reduce waste rate (7.8% → <5%)
3. Azure CSPM improvement (89 → 90+)
4. RI renewal (5 expiring — next 90 days)
5. Cloud strategy review (quarterly — Q1)
Integration Points
- Cloud providers (AWS, Azure, GCP): Native services, billing, security
- Cost management (CloudHealth, AWS Cost Explorer, Azure Cost Mgmt): Visibility
- CSPM tools (Wiz, Prisma Cloud, Security Hub): Security posture
- IaC tools (Terraform, CloudFormation): Infrastructure provisioning
- FinOps platforms (Cloudability, CloudHealth, Apptio): Cost optimization
- Monitoring (CloudWatch, Azure Monitor, GCP Monitoring): Resource metrics
- SIEM (Sentinel, Splunk): Cloud security logs
- ITSM (ServiceNow): Change management, cost approval
- Budgeting/Finance (NetSuite, QuickBooks): Chargeback/showback
- Configuration management (AWS Config, Azure Policy): Compliance
- Tagging/asset tools: Resource categorization, cost allocation
- Communication (Teams, Slack): Budget alerts, anomaly notifications
Edge Cases
- Cost spike (unexpected): Budget alert; anomaly detection; root cause (resource leak, misconfiguration); immediate action
- Reserved instance expiry: Renewal planning; workload reassessment; term optimization; savings projection
- Spot interruption (production-adjacent): Graceful shutdown; checkpoint; failover; workload redesign
- Cloud provider outage: Multi-region failover; DNS failover; backup activation; customer communication
- Compliance finding (CSPM): Auto-remediation; manual review; exception request; policy update
- Budget overrun (department): Showback report; department notification; spending freeze; approval override
- Region restriction (new compliance): Workload migration; data transfer; DNS update; validation
- Storage growth (unbounded): Lifecycle policy; alert threshold; cleanup; archival
- Multi-cloud complexity: Unified tooling; skill development; governance consistency
- Vendor lock-in concern: Abstraction layer; portable architecture; exit strategy; negotiation leverage