---
name: infrastructure-as-code
description: "Manage cloud and on-premise infrastructure through code using Terraform, CloudFormation, Ansible, and similar tools for reproducible, version-controlled infrastructure deployment. Use when creating IaC templates, managing cloud provisioning, implementing GitOps workflows, or automating infrastructure changes. Triggers on phrases like 'infrastructure as code', 'IaC', 'Terraform', 'CloudFormation', 'Ansible', 'GitOps', 'immutable infrastructure', 'provisioning automation', 'state management', 'blue-green deployment', 'environment parity', 'drift detection'."
---

# Infrastructure as Code (IaC)

Manage and provision cloud and on-premise infrastructure through version-controlled code using Terraform, CloudFormation, Ansible, and similar tools for reproducible, auditable, and automated infrastructure deployment.

## Workflow

### Phase 1: IaC Framework Design

1. **Tool selection and architecture**:
   - Provisioning: Terraform (multi-cloud), AWS CloudFormation (AWS-native), Azure Bicep
   - Configuration management: Ansible (agentless), Chef, Puppet
   - Container orchestration: Kubernetes manifests, Helm charts
   - CI/CD integration: GitHub Actions, GitLab CI, Jenkins
   - State management: remote backends (S3 + DynamoDB, Terraform Cloud)
2. **Module library development**:
   - Reusable modules for common resources (VPC, ECS, RDS, S3)
   - Standardized naming conventions and tagging strategy
   - Versioned module registry (private Terraform registry)
   - Documentation per module (inputs, outputs, dependencies)
3. **Environment strategy**:
   - Development, staging, production (environment isolation)
   - Environment parity (same code, different variable values)
   - Promotion workflow (dev → staging → production)
   - Disaster recovery environment provisioning

### Phase 2: Implementation & Automation

1. **Code development standards**:
   - Code style and linting (terraform fmt, tflint, checkov)
   - Security scanning (tfsec, checkov, Prisma Cloud)
   - Cost estimation (infracost, AWS Pricing Calculator integration)
   - Peer review process (pull request workflow)
2. **CI/CD pipeline for infrastructure**:
   - Plan phase: generate and validate infrastructure plan
   - Review phase: human approval for production changes
   - Apply phase: execute plan with error handling
   - Post-deployment: validation tests, smoke tests
   - Rollback: automated rollback on failure
3. **State management and drift detection**:
   - Remote state storage with locking
   - State versioning and backup
   - Automated drift detection (scheduled plan runs)
   - Drift remediation workflow (auto-remediate or alert)

### Phase 3: Governance & Operations

1. **Security and compliance**:
   - Policy-as-code (Open Policy Agent, Sentinel, Conftest)
   - Compliance scanning (CIS benchmarks, SOC 2 controls)
   - Secret management (HashiCorp Vault, AWS Secrets Manager)
   - Access control (least privilege, role-based)
2. **Cost management**:
   - Cost tagging strategy
   - Budget alerts and cost anomaly detection
   - Right-sizing recommendations
   - Reserved instance and savings plan optimization
3. **Monitoring and observability**:
   - Infrastructure health monitoring
   - Change tracking and audit trail
   - Performance metrics and alerting
   - Incident response integration

## Templates

### IaC Pipeline Configuration

```
INFRASTRUCTURE AS CODE — CI/CD Pipeline
=========================================
Version: [2.3] | Platform: [GitHub Actions + Terraform]

PIPELINE STAGES:
  ┌────────────────────────────────────────────────────────────┐
  │ 1. CODE COMMIT                                            │
  │    • Developer pushes IaC changes to feature branch       │
  │    • Git hooks: terraform fmt (auto-format)                │
  │    • Branch protection: required reviews for infra changes │
  ├────────────────────────────────────────────────────────────┤
  │ 2. PRE-CI VALIDATION                                     │
  │    • terraform validate — syntax check                     │
  │    • tflint — best practices enforcement                   │
  │    • terraform fmt —check — formatting check               │
  │    • Failed? → Block PR with specific error messages       │
  ├────────────────────────────────────────────────────────────┤
  │ 3. SECURITY SCAN                                         │
  │    • checkov — policy compliance (CIS, SOC 2)              │
  │    • tfsec — security vulnerability detection              │
  │    • Failed? → Block PR with security findings report      │
  │    • Warning? → Flag for review (non-blocking)             │
  ├────────────────────────────────────────────────────────────┤
  │ 4. COST ESTIMATION                                       │
  │    • infracost estimate — cost delta analysis              │
  │    • Comment on PR with cost impact:                       │
  │      "This change adds ~$340/month (EC2: +$200, RDS: +$140)"│
  │    • Cost increase > $1000/month? → Require manager review │
  ├────────────────────────────────────────────────────────────┤
  │ 5. PLAN (dev environment)                                │
  │    • terraform plan — dev variables                        │
  │    • Generate execution plan (add/mod/destroy)             │
  │    • Comment on PR with plan summary                       │
  │    • Dev auto-apply (if approved by code owner)            │
  ├────────────────────────────────────────────────────────────┤
  │ 6. REVIEW & APPROVAL                                     │
  │    • Required: 1 code owner approval                       │
  │    • Required: 1 security team approval (if prod change)   │
  │    • Required: cost approval (if > $1000/month increase)   │
  │    • Plan output visible for manual review                  │
  ├────────────────────────────────────────────────────────────┤
  │ 7. APPLY (staging environment)                            │
  │    • terraform apply — staging variables (on merge)        │
  │    • Post-deploy validation tests                          │
  │    • Smoke tests against staging infrastructure            │
  │    • Failed? → Auto-rollback + alert                       │
  ├────────────────────────────────────────────────────────────┤
  │ 8. APPLY (production)                                    │
  │    • Manual trigger (after staging validation passes)      │
  │    • terraform apply — production variables                │
  │    • Change window enforcement (no prod changes 22:00-06:00)│
  │    • Post-deploy validation + monitoring alert              │
  │    • Slack notification to #infra-changes channel          │
  └────────────────────────────────────────────────────────────┘

STATE MANAGEMENT:
  Backend: AWS S3 + DynamoDB (state locking)
  State file encryption: AES-256 (S3 server-side encryption)
  State backup: Daily snapshot to separate S3 bucket
  State retention: 90 days (versioned)
  Access: IAM role-based (dev: read-only, deployer: read-write)

DRIFT DETECTION:
  Schedule: Every 6 hours (cron job)
  Action: terraform plan → diff detection
  Threshold: Any unplanned change = alert
  Notification: Slack #infra-drift channel + PagerDuty (if prod)
  Remediation: Auto-remediate (if safe) or create Jira ticket
```

### Terraform Module Structure

```
IaC MODULE LIBRARY — Standard Structure
=========================================

Repository layout:
  infrastructure/
  ├── modules/                    # Reusable modules
  │   ├── network/               # VPC, subnets, security groups
  │   │   ├── main.tf
  │   │   ├── variables.tf
  │   │   ├── outputs.tf
  │   │   ├── versions.tf
  │   │   └── README.md
  │   ├── compute/               # EC2, ECS, EKS
  │   ├── database/              # RDS, ElastiCache, DynamoDB
  │   ├── storage/               # S3, EBS, EFS
  │   ├── security/              # IAM, KMS, WAF, Shield
  │   ├── monitoring/            # CloudWatch, Prometheus, Grafana
  │   └── serverless/            # Lambda, API Gateway, EventBridge
  │
  ├── environments/              # Environment configurations
  │   ├── dev/
  │   │   ├── main.tf            # Root module (calls child modules)
  │   │   ├── variables.tf
  │   │   ├── terraform.tfvars   # Environment-specific values
  │   │   └── backend.tf         # State backend config
  │   ├── staging/
  │   └── production/
  │
  ├── policies/                  # Policy-as-code
  │   ├── opa/                   # Open Policy Agent policies
  │   └── sentinel/              # Terraform Enterprise policies
  │
  ├── scripts/                   # Helper scripts
  │   ├── validate.sh
  │   ├── cost-estimate.sh
  │   └── drift-detect.sh
  │
  ├── .github/                   # CI/CD workflows
  │   └── workflows/
  │       ├── validate.yml
  │       ├── plan.yml
  │       └── apply.yml
  │
  └── docs/                      # Documentation
      ├── architecture.md
      ├── module-catalog.md
      └── runbooks/

MODULE DESIGN STANDARDS:
  Every module must include:
    ✓ main.tf — resource definitions
    ✓ variables.tf — input variables with descriptions and types
    ✓ outputs.tf — output values with descriptions
    ✓ versions.tf — provider and terraform version constraints
    ✓ README.md — usage examples, inputs/outputs table, dependencies
    ✓ examples/ — working example configurations

  Naming conventions:
    Resources: {environment}-{service}-{component}-{unique-id}
    Tags: Environment, Service, Owner, CostCenter, ManagedBy (IaC)
    State files: {environment}-{service}-terraform.tfstate

  Best practices:
    ✓ Use explicit versions for all providers and modules
    ✓ Never hardcode secrets (use var files or secret management)
    ✓ Use depends_on sparingly (prefer implicit dependencies)
    ✓ Use count/for_each instead of duplicating code
    ✓ Add lifecycle rules (create_before_destroy, prevent_destroy)
    ✓ Document all assumptions and known limitations
    ✓ Test modules in dev before staging/production use
```

## Integration Points

- **IaC tools**: Terraform, AWS CloudFormation, Azure Bicep, Pulumi
- **Configuration management**: Ansible, Chef, Puppet, SaltStack
- **CI/CD platforms**: GitHub Actions, GitLab CI, Jenkins, GitLab Runner
- **Cloud platforms**: AWS, Azure, GCP, OCI, multi-cloud
- **Container orchestration**: Kubernetes, Docker Swarm, ECS
- **Secret management**: HashiCorp Vault, AWS Secrets Manager, Azure Key Vault
- **Policy engines**: Open Policy Agent (OPA), HashiCorp Sentinel, Conftest
- **Security scanning**: checkov, tfsec, Prisma Cloud, Snyk IaC
- **Cost management**: Infracost, CloudHealth, Cloudability, Spot by NetApp
- **State backends**: AWS S3, Terraform Cloud, GCS, Azure Blob Storage

## Edge Cases

| Scenario | Handling |
|----------|----------|
| Terraform state corruption | Restore from versioned backup; manual state fix if needed; post-mortem |
| Drift detected in production | Assess impact; auto-remediate if safe; manual intervention if risky; root cause analysis |
| Module breaking change affects multiple environments | Version pinning; staged migration; backward compatibility testing |
| CI/CD pipeline fails during production apply | Auto-rollback to previous state; incident response; manual recovery if needed |
| Cost estimate shows unexpected increase | Block deployment; notify team; investigate root cause; optimize or approve |
| Provider API changes break existing configurations | Provider version pinning; upgrade testing in dev; scheduled provider updates |
| Multi-region deployment with state conflicts | Separate state files per region; remote state data sources for cross-region refs |
| Emergency change needed (bypass CI/CD) | Break-glass procedure: manual apply with extra approval; post-hoc PR within 24h |

## Output

### Infrastructure Dashboard

```
INFRASTRUCTURE AS CODE — Operations Dashboard
================================================
As of: 2025-01-15

INFRASTRUCTURE OVERVIEW:
  Total managed resources: 2,847
  Environments: 3 (dev, staging, production)
  Cloud providers: 2 (AWS, GCP)
  IaC coverage: 94.2% (267 resources managed, 16 manual)
  Last full reconciliation: 2025-01-14

RECENT CHANGES (Last 7 Days):
  Pull requests merged: 12
  Resources added: 34 | Modified: 89 | Destroyed: 7
  Environments affected: dev (8), staging (3), production (1)
  Avg deployment time: 8.2 minutes
  Failed deployments: 1 (auto-rolled back successfully)

DRIFT STATUS:
  Last drift scan: 2025-01-15 12:00 UTC (6h ago)
  Drift detected: 3 resources
    1. prod-ec2-web-server-03 — Security group modified (manual change) → Auto-remediated
    2. staging-rds-analytics — Parameter group changed → Ticket created
    3. dev-s3-logs — Lifecycle rule missing → Auto-remediated
  Drift rate: 0.1% (3/2,847 resources) — within acceptable threshold

COST OVERVIEW (This Month):
  AWS:     $42,340/month (IaC-managed: 96%)
  GCP:     $18,720/month (IaC-managed: 91%)
  ─────────────────────────────
  Total:   $61,060/month
  vs. Budget: $65,000/month → 86.2% utilization ✓
  vs. Last month: +$2,100 (+3.5%) — within 5% variance ✓

TOP COST DRIVERS:
  1. EC2 instances:    $18,400/month (30.1%)
  2. RDS databases:    $12,800/month (21.0%)
  3. S3 storage:        $8,200/month (13.4%)
  4. EKS clusters:      $7,600/month (12.4%)
  5. Data transfer:     $5,400/month  (8.8%)

COMPLIANCE & SECURITY:
  Policy violations: 0 (all resources compliant)
  CIS benchmark score: 94/100 (4 minor findings — non-critical)
  Secrets exposed: 0
  Unencrypted resources: 0
  Over-privileged IAM roles: 2 (remediation tickets open)

CI/CD PIPELINE HEALTH:
  Total runs (last 30 days): 156
  Success rate: 97.4% (152/156)
  Avg plan time: 3.2 minutes | Avg apply time: 5.8 minutes
  Mean time to deploy: 9.0 minutes
  Rollback rate: 0.6% (1/156)

IMPROVEMENT RECOMMENDATIONS:
  1. Migrate 16 manual resources to IaC (target: 100% coverage)
  2. Enable auto-remediation for 80% of drift cases (current: 67%)
  3. Implement cost guardrails for dev environment (overspending 15%)
  4. Schedule quarterly provider version updates (currently 2 versions behind)
  5. Add chaos engineering tests to staging pipeline

MODULE USAGE ANALYTICS:
┌────────────────────┬──────────┬─────────────┬──────────────┐
│ Module             │ Versions │ Envs Used   │ Last Updated │
├────────────────────┼──────────┼─────────────┼──────────────┤
│ network/vpc        │ 4.2.1    │ 3/3         │ 2025-01-10   │
│ compute/ecs        │ 3.1.0    │ 3/3         │ 2025-01-08   │
│ database/rds       │ 3.8.2    │ 3/3         │ 2025-01-12   │
│ storage/s3         │ 2.5.0    │ 3/3         │ 2024-12-20   │
│ security/iam       │ 5.0.1    │ 3/3         │ 2025-01-14   │
│ monitoring/cw      │ 2.1.3    │ 2/3         │ 2025-01-05   │
│ serverless/lambda  │ 1.9.0    │ 2/3         │ 2024-12-28   │
└────────────────────┴──────────┴─────────────┴──────────────┘```