IT AI Skill
Infrastructure As Code
Manage cloud and on-premise infrastructure through code using Terraform, CloudFormation, Ansible, and similar tools for reproducible, version-controlled infrastructure deployment. Use when creating IaC templates, managing cloud provisioning, implementing Gi...
Infrastructure as Code (IaC)
Manage and provision cloud and on-premise infrastructure through version-controlled code using Terraform, CloudFormation, Ansible, and similar tools for reproducible, auditable, and automated infrastructure deployment.
Workflow
Phase 1: IaC Framework Design
- Tool selection and architecture:
- Provisioning: Terraform (multi-cloud), AWS CloudFormation (AWS-native), Azure Bicep
- Configuration management: Ansible (agentless), Chef, Puppet
- Container orchestration: Kubernetes manifests, Helm charts
- CI/CD integration: GitHub Actions, GitLab CI, Jenkins
- State management: remote backends (S3 + DynamoDB, Terraform Cloud)
- Module library development:
- Reusable modules for common resources (VPC, ECS, RDS, S3)
- Standardized naming conventions and tagging strategy
- Versioned module registry (private Terraform registry)
- Documentation per module (inputs, outputs, dependencies)
- Environment strategy:
- Development, staging, production (environment isolation)
- Environment parity (same code, different variable values)
- Promotion workflow (dev → staging → production)
- Disaster recovery environment provisioning
Phase 2: Implementation & Automation
- Code development standards:
- Code style and linting (terraform fmt, tflint, checkov)
- Security scanning (tfsec, checkov, Prisma Cloud)
- Cost estimation (infracost, AWS Pricing Calculator integration)
- Peer review process (pull request workflow)
- CI/CD pipeline for infrastructure:
- Plan phase: generate and validate infrastructure plan
- Review phase: human approval for production changes
- Apply phase: execute plan with error handling
- Post-deployment: validation tests, smoke tests
- Rollback: automated rollback on failure
- State management and drift detection:
- Remote state storage with locking
- State versioning and backup
- Automated drift detection (scheduled plan runs)
- Drift remediation workflow (auto-remediate or alert)
Phase 3: Governance & Operations
- Security and compliance:
- Policy-as-code (Open Policy Agent, Sentinel, Conftest)
- Compliance scanning (CIS benchmarks, SOC 2 controls)
- Secret management (HashiCorp Vault, AWS Secrets Manager)
- Access control (least privilege, role-based)
- Cost management:
- Cost tagging strategy
- Budget alerts and cost anomaly detection
- Right-sizing recommendations
- Reserved instance and savings plan optimization
- Monitoring and observability:
- Infrastructure health monitoring
- Change tracking and audit trail
- Performance metrics and alerting
- Incident response integration
Templates
IaC Pipeline Configuration
INFRASTRUCTURE AS CODE — CI/CD Pipeline
=========================================
Version: [2.3] | Platform: [GitHub Actions + Terraform]
PIPELINE STAGES:
┌────────────────────────────────────────────────────────────┐
│ 1. CODE COMMIT │
│ • Developer pushes IaC changes to feature branch │
│ • Git hooks: terraform fmt (auto-format) │
│ • Branch protection: required reviews for infra changes │
├────────────────────────────────────────────────────────────┤
│ 2. PRE-CI VALIDATION │
│ • terraform validate — syntax check │
│ • tflint — best practices enforcement │
│ • terraform fmt —check — formatting check │
│ • Failed? → Block PR with specific error messages │
├────────────────────────────────────────────────────────────┤
│ 3. SECURITY SCAN │
│ • checkov — policy compliance (CIS, SOC 2) │
│ • tfsec — security vulnerability detection │
│ • Failed? → Block PR with security findings report │
│ • Warning? → Flag for review (non-blocking) │
├────────────────────────────────────────────────────────────┤
│ 4. COST ESTIMATION │
│ • infracost estimate — cost delta analysis │
│ • Comment on PR with cost impact: │
│ "This change adds ~$340/month (EC2: +$200, RDS: +$140)"│
│ • Cost increase > $1000/month? → Require manager review │
├────────────────────────────────────────────────────────────┤
│ 5. PLAN (dev environment) │
│ • terraform plan — dev variables │
│ • Generate execution plan (add/mod/destroy) │
│ • Comment on PR with plan summary │
│ • Dev auto-apply (if approved by code owner) │
├────────────────────────────────────────────────────────────┤
│ 6. REVIEW & APPROVAL │
│ • Required: 1 code owner approval │
│ • Required: 1 security team approval (if prod change) │
│ • Required: cost approval (if > $1000/month increase) │
│ • Plan output visible for manual review │
├────────────────────────────────────────────────────────────┤
│ 7. APPLY (staging environment) │
│ • terraform apply — staging variables (on merge) │
│ • Post-deploy validation tests │
│ • Smoke tests against staging infrastructure │
│ • Failed? → Auto-rollback + alert │
├────────────────────────────────────────────────────────────┤
│ 8. APPLY (production) │
│ • Manual trigger (after staging validation passes) │
│ • terraform apply — production variables │
│ • Change window enforcement (no prod changes 22:00-06:00)│
│ • Post-deploy validation + monitoring alert │
│ • Slack notification to #infra-changes channel │
└────────────────────────────────────────────────────────────┘
STATE MANAGEMENT:
Backend: AWS S3 + DynamoDB (state locking)
State file encryption: AES-256 (S3 server-side encryption)
State backup: Daily snapshot to separate S3 bucket
State retention: 90 days (versioned)
Access: IAM role-based (dev: read-only, deployer: read-write)
DRIFT DETECTION:
Schedule: Every 6 hours (cron job)
Action: terraform plan → diff detection
Threshold: Any unplanned change = alert
Notification: Slack #infra-drift channel + PagerDuty (if prod)
Remediation: Auto-remediate (if safe) or create Jira ticket
Terraform Module Structure
IaC MODULE LIBRARY — Standard Structure
=========================================
Repository layout:
infrastructure/
├── modules/ # Reusable modules
│ ├── network/ # VPC, subnets, security groups
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ ├── versions.tf
│ │ └── README.md
│ ├── compute/ # EC2, ECS, EKS
│ ├── database/ # RDS, ElastiCache, DynamoDB
│ ├── storage/ # S3, EBS, EFS
│ ├── security/ # IAM, KMS, WAF, Shield
│ ├── monitoring/ # CloudWatch, Prometheus, Grafana
│ └── serverless/ # Lambda, API Gateway, EventBridge
│
├── environments/ # Environment configurations
│ ├── dev/
│ │ ├── main.tf # Root module (calls child modules)
│ │ ├── variables.tf
│ │ ├── terraform.tfvars # Environment-specific values
│ │ └── backend.tf # State backend config
│ ├── staging/
│ └── production/
│
├── policies/ # Policy-as-code
│ ├── opa/ # Open Policy Agent policies
│ └── sentinel/ # Terraform Enterprise policies
│
├── scripts/ # Helper scripts
│ ├── validate.sh
│ ├── cost-estimate.sh
│ └── drift-detect.sh
│
├── .github/ # CI/CD workflows
│ └── workflows/
│ ├── validate.yml
│ ├── plan.yml
│ └── apply.yml
│
└── docs/ # Documentation
├── architecture.md
├── module-catalog.md
└── runbooks/
MODULE DESIGN STANDARDS:
Every module must include:
✓ main.tf — resource definitions
✓ variables.tf — input variables with descriptions and types
✓ outputs.tf — output values with descriptions
✓ versions.tf — provider and terraform version constraints
✓ README.md — usage examples, inputs/outputs table, dependencies
✓ examples/ — working example configurations
Naming conventions:
Resources: {environment}-{service}-{component}-{unique-id}
Tags: Environment, Service, Owner, CostCenter, ManagedBy (IaC)
State files: {environment}-{service}-terraform.tfstate
Best practices:
✓ Use explicit versions for all providers and modules
✓ Never hardcode secrets (use var files or secret management)
✓ Use depends_on sparingly (prefer implicit dependencies)
✓ Use count/for_each instead of duplicating code
✓ Add lifecycle rules (create_before_destroy, prevent_destroy)
✓ Document all assumptions and known limitations
✓ Test modules in dev before staging/production use
Integration Points
- IaC tools: Terraform, AWS CloudFormation, Azure Bicep, Pulumi
- Configuration management: Ansible, Chef, Puppet, SaltStack
- CI/CD platforms: GitHub Actions, GitLab CI, Jenkins, GitLab Runner
- Cloud platforms: AWS, Azure, GCP, OCI, multi-cloud
- Container orchestration: Kubernetes, Docker Swarm, ECS
- Secret management: HashiCorp Vault, AWS Secrets Manager, Azure Key Vault
- Policy engines: Open Policy Agent (OPA), HashiCorp Sentinel, Conftest
- Security scanning: checkov, tfsec, Prisma Cloud, Snyk IaC
- Cost management: Infracost, CloudHealth, Cloudability, Spot by NetApp
- State backends: AWS S3, Terraform Cloud, GCS, Azure Blob Storage
Edge Cases
| Scenario | Handling | |----------|----------| | Terraform state corruption | Restore from versioned backup; manual state fix if needed; post-mortem | | Drift detected in production | Assess impact; auto-remediate if safe; manual intervention if risky; root cause analysis | | Module breaking change affects multiple environments | Version pinning; staged migration; backward compatibility testing | | CI/CD pipeline fails during production apply | Auto-rollback to previous state; incident response; manual recovery if needed | | Cost estimate shows unexpected increase | Block deployment; notify team; investigate root cause; optimize or approve | | Provider API changes break existing configurations | Provider version pinning; upgrade testing in dev; scheduled provider updates | | Multi-region deployment with state conflicts | Separate state files per region; remote state data sources for cross-region refs | | Emergency change needed (bypass CI/CD) | Break-glass procedure: manual apply with extra approval; post-hoc PR within 24h |
Output
Infrastructure Dashboard
INFRASTRUCTURE AS CODE — Operations Dashboard
================================================
As of: 2025-01-15
INFRASTRUCTURE OVERVIEW:
Total managed resources: 2,847
Environments: 3 (dev, staging, production)
Cloud providers: 2 (AWS, GCP)
IaC coverage: 94.2% (267 resources managed, 16 manual)
Last full reconciliation: 2025-01-14
RECENT CHANGES (Last 7 Days):
Pull requests merged: 12
Resources added: 34 | Modified: 89 | Destroyed: 7
Environments affected: dev (8), staging (3), production (1)
Avg deployment time: 8.2 minutes
Failed deployments: 1 (auto-rolled back successfully)
DRIFT STATUS:
Last drift scan: 2025-01-15 12:00 UTC (6h ago)
Drift detected: 3 resources
1. prod-ec2-web-server-03 — Security group modified (manual change) → Auto-remediated
2. staging-rds-analytics — Parameter group changed → Ticket created
3. dev-s3-logs — Lifecycle rule missing → Auto-remediated
Drift rate: 0.1% (3/2,847 resources) — within acceptable threshold
COST OVERVIEW (This Month):
AWS: $42,340/month (IaC-managed: 96%)
GCP: $18,720/month (IaC-managed: 91%)
─────────────────────────────
Total: $61,060/month
vs. Budget: $65,000/month → 86.2% utilization ✓
vs. Last month: +$2,100 (+3.5%) — within 5% variance ✓
TOP COST DRIVERS:
1. EC2 instances: $18,400/month (30.1%)
2. RDS databases: $12,800/month (21.0%)
3. S3 storage: $8,200/month (13.4%)
4. EKS clusters: $7,600/month (12.4%)
5. Data transfer: $5,400/month (8.8%)
COMPLIANCE & SECURITY:
Policy violations: 0 (all resources compliant)
CIS benchmark score: 94/100 (4 minor findings — non-critical)
Secrets exposed: 0
Unencrypted resources: 0
Over-privileged IAM roles: 2 (remediation tickets open)
CI/CD PIPELINE HEALTH:
Total runs (last 30 days): 156
Success rate: 97.4% (152/156)
Avg plan time: 3.2 minutes | Avg apply time: 5.8 minutes
Mean time to deploy: 9.0 minutes
Rollback rate: 0.6% (1/156)
IMPROVEMENT RECOMMENDATIONS:
1. Migrate 16 manual resources to IaC (target: 100% coverage)
2. Enable auto-remediation for 80% of drift cases (current: 67%)
3. Implement cost guardrails for dev environment (overspending 15%)
4. Schedule quarterly provider version updates (currently 2 versions behind)
5. Add chaos engineering tests to staging pipeline
MODULE USAGE ANALYTICS:
┌────────────────────┬──────────┬─────────────┬──────────────┐
│ Module │ Versions │ Envs Used │ Last Updated │
├────────────────────┼──────────┼─────────────┼──────────────┤
│ network/vpc │ 4.2.1 │ 3/3 │ 2025-01-10 │
│ compute/ecs │ 3.1.0 │ 3/3 │ 2025-01-08 │
│ database/rds │ 3.8.2 │ 3/3 │ 2025-01-12 │
│ storage/s3 │ 2.5.0 │ 3/3 │ 2024-12-20 │
│ security/iam │ 5.0.1 │ 3/3 │ 2025-01-14 │
│ monitoring/cw │ 2.1.3 │ 2/3 │ 2025-01-05 │
│ serverless/lambda │ 1.9.0 │ 2/3 │ 2024-12-28 │
└────────────────────┴──────────┴─────────────┴──────────────┘```