IT AI Skill

Infrastructure As Code

Manage cloud and on-premise infrastructure through code using Terraform, CloudFormation, Ansible, and similar tools for reproducible, version-controlled infrastructure deployment. Use when creating IaC templates, managing cloud provisioning, implementing GitOps workflows, or automating infrastructure changes. Triggers on phrases like 'infrastructure as code', 'IaC', 'Terraform', 'CloudFormation', 'Ansible', 'GitOps', 'immutable infrastructure', 'provisioning automation', 'state management', 'blue-green deployment', 'environment parity', 'drift detection'.

Infrastructure as Code (IaC)

Manage and provision cloud and on-premise infrastructure through version-controlled code using Terraform, CloudFormation, Ansible, and similar tools for reproducible, auditable, and automated infrastructure deployment.

Workflow

Phase 1: IaC Framework Design

Tool selection and architecture:

Provisioning: Terraform (multi-cloud), AWS CloudFormation (AWS-native), Azure Bicep
Configuration management: Ansible (agentless), Chef, Puppet
Container orchestration: Kubernetes manifests, Helm charts
CI/CD integration: GitHub Actions, GitLab CI, Jenkins
State management: remote backends (S3 + DynamoDB, Terraform Cloud)

Module library development:

Reusable modules for common resources (VPC, ECS, RDS, S3)
Standardized naming conventions and tagging strategy
Versioned module registry (private Terraform registry)
Documentation per module (inputs, outputs, dependencies)

Environment strategy:

Development, staging, production (environment isolation)
Environment parity (same code, different variable values)
Promotion workflow (dev → staging → production)
Disaster recovery environment provisioning

Phase 2: Implementation & Automation

Code development standards:

Code style and linting (terraform fmt, tflint, checkov)
Security scanning (tfsec, checkov, Prisma Cloud)
Cost estimation (infracost, AWS Pricing Calculator integration)
Peer review process (pull request workflow)

CI/CD pipeline for infrastructure:

Plan phase: generate and validate infrastructure plan
Review phase: human approval for production changes
Apply phase: execute plan with error handling
Post-deployment: validation tests, smoke tests
Rollback: automated rollback on failure

State management and drift detection:

Remote state storage with locking
State versioning and backup
Automated drift detection (scheduled plan runs)
Drift remediation workflow (auto-remediate or alert)

Phase 3: Governance & Operations

Security and compliance:

Policy-as-code (Open Policy Agent, Sentinel, Conftest)
Compliance scanning (CIS benchmarks, SOC 2 controls)
Secret management (HashiCorp Vault, AWS Secrets Manager)
Access control (least privilege, role-based)

Cost management:

Cost tagging strategy
Budget alerts and cost anomaly detection
Right-sizing recommendations
Reserved instance and savings plan optimization

Monitoring and observability:

Infrastructure health monitoring
Change tracking and audit trail
Performance metrics and alerting
Incident response integration

Templates

IaC Pipeline Configuration

INFRASTRUCTURE AS CODE — CI/CD Pipeline
=========================================
Version: [2.3] | Platform: [GitHub Actions + Terraform]

PIPELINE STAGES:
  ┌────────────────────────────────────────────────────────────┐
  │ 1. CODE COMMIT                                            │
  │    • Developer pushes IaC changes to feature branch       │
  │    • Git hooks: terraform fmt (auto-format)                │
  │    • Branch protection: required reviews for infra changes │
  ├────────────────────────────────────────────────────────────┤
  │ 2. PRE-CI VALIDATION                                     │
  │    • terraform validate — syntax check                     │
  │    • tflint — best practices enforcement                   │
  │    • terraform fmt —check — formatting check               │
  │    • Failed? → Block PR with specific error messages       │
  ├────────────────────────────────────────────────────────────┤
  │ 3. SECURITY SCAN                                         │
  │    • checkov — policy compliance (CIS, SOC 2)              │
  │    • tfsec — security vulnerability detection              │
  │    • Failed? → Block PR with security findings report      │
  │    • Warning? → Flag for review (non-blocking)             │
  ├────────────────────────────────────────────────────────────┤
  │ 4. COST ESTIMATION                                       │
  │    • infracost estimate — cost delta analysis              │
  │    • Comment on PR with cost impact:                       │
  │      "This change adds ~$340/month (EC2: +$200, RDS: +$140)"│
  │    • Cost increase > $1000/month? → Require manager review │
  ├────────────────────────────────────────────────────────────┤
  │ 5. PLAN (dev environment)                                │
  │    • terraform plan — dev variables                        │
  │    • Generate execution plan (add/mod/destroy)             │
  │    • Comment on PR with plan summary                       │
  │    • Dev auto-apply (if approved by code owner)            │
  ├────────────────────────────────────────────────────────────┤
  │ 6. REVIEW & APPROVAL                                     │
  │    • Required: 1 code owner approval                       │
  │    • Required: 1 security team approval (if prod change)   │
  │    • Required: cost approval (if > $1000/month increase)   │
  │    • Plan output visible for manual review                  │
  ├────────────────────────────────────────────────────────────┤
  │ 7. APPLY (staging environment)                            │
  │    • terraform apply — staging variables (on merge)        │
  │    • Post-deploy validation tests                          │
  │    • Smoke tests against staging infrastructure            │
  │    • Failed? → Auto-rollback + alert                       │
  ├────────────────────────────────────────────────────────────┤
  │ 8. APPLY (production)                                    │
  │    • Manual trigger (after staging validation passes)      │
  │    • terraform apply — production variables                │
  │    • Change window enforcement (no prod changes 22:00-06:00)│
  │    • Post-deploy validation + monitoring alert              │
  │    • Slack notification to #infra-changes channel          │
  └────────────────────────────────────────────────────────────┘

STATE MANAGEMENT:
  Backend: AWS S3 + DynamoDB (state locking)
  State file encryption: AES-256 (S3 server-side encryption)
  State backup: Daily snapshot to separate S3 bucket
  State retention: 90 days (versioned)
  Access: IAM role-based (dev: read-only, deployer: read-write)

DRIFT DETECTION:
  Schedule: Every 6 hours (cron job)
  Action: terraform plan → diff detection
  Threshold: Any unplanned change = alert
  Notification: Slack #infra-drift channel + PagerDuty (if prod)
  Remediation: Auto-remediate (if safe) or create Jira ticket

Terraform Module Structure

IaC MODULE LIBRARY — Standard Structure
=========================================

Repository layout:
  infrastructure/
  ├── modules/                    # Reusable modules
  │   ├── network/               # VPC, subnets, security groups
  │   │   ├── main.tf
  │   │   ├── variables.tf
  │   │   ├── outputs.tf
  │   │   ├── versions.tf
  │   │   └── README.md
  │   ├── compute/               # EC2, ECS, EKS
  │   ├── database/              # RDS, ElastiCache, DynamoDB
  │   ├── storage/               # S3, EBS, EFS
  │   ├── security/              # IAM, KMS, WAF, Shield
  │   ├── monitoring/            # CloudWatch, Prometheus, Grafana
  │   └── serverless/            # Lambda, API Gateway, EventBridge
  │
  ├── environments/              # Environment configurations
  │   ├── dev/
  │   │   ├── main.tf            # Root module (calls child modules)
  │   │   ├── variables.tf
  │   │   ├── terraform.tfvars   # Environment-specific values
  │   │   └── backend.tf         # State backend config
  │   ├── staging/
  │   └── production/
  │
  ├── policies/                  # Policy-as-code
  │   ├── opa/                   # Open Policy Agent policies
  │   └── sentinel/              # Terraform Enterprise policies
  │
  ├── scripts/                   # Helper scripts
  │   ├── validate.sh
  │   ├── cost-estimate.sh
  │   └── drift-detect.sh
  │
  ├── .github/                   # CI/CD workflows
  │   └── workflows/
  │       ├── validate.yml
  │       ├── plan.yml
  │       └── apply.yml
  │
  └── docs/                      # Documentation
      ├── architecture.md
      ├── module-catalog.md
      └── runbooks/

MODULE DESIGN STANDARDS:
  Every module must include:
    ✓ main.tf — resource definitions
    ✓ variables.tf — input variables with descriptions and types
    ✓ outputs.tf — output values with descriptions
    ✓ versions.tf — provider and terraform version constraints
    ✓ README.md — usage examples, inputs/outputs table, dependencies
    ✓ examples/ — working example configurations

  Naming conventions:
    Resources: {environment}-{service}-{component}-{unique-id}
    Tags: Environment, Service, Owner, CostCenter, ManagedBy (IaC)
    State files: {environment}-{service}-terraform.tfstate

  Best practices:
    ✓ Use explicit versions for all providers and modules
    ✓ Never hardcode secrets (use var files or secret management)
    ✓ Use depends_on sparingly (prefer implicit dependencies)
    ✓ Use count/for_each instead of duplicating code
    ✓ Add lifecycle rules (create_before_destroy, prevent_destroy)
    ✓ Document all assumptions and known limitations
    ✓ Test modules in dev before staging/production use

Integration Points

IaC tools: Terraform, AWS CloudFormation, Azure Bicep, Pulumi
Configuration management: Ansible, Chef, Puppet, SaltStack
CI/CD platforms: GitHub Actions, GitLab CI, Jenkins, GitLab Runner
Cloud platforms: AWS, Azure, GCP, OCI, multi-cloud
Container orchestration: Kubernetes, Docker Swarm, ECS
Secret management: HashiCorp Vault, AWS Secrets Manager, Azure Key Vault
Policy engines: Open Policy Agent (OPA), HashiCorp Sentinel, Conftest
Security scanning: checkov, tfsec, Prisma Cloud, Snyk IaC
Cost management: Infracost, CloudHealth, Cloudability, Spot by NetApp
State backends: AWS S3, Terraform Cloud, GCS, Azure Blob Storage

Edge Cases

| Scenario | Handling | |----------|----------| | Terraform state corruption | Restore from versioned backup; manual state fix if needed; post-mortem | | Drift detected in production | Assess impact; auto-remediate if safe; manual intervention if risky; root cause analysis | | Module breaking change affects multiple environments | Version pinning; staged migration; backward compatibility testing | | CI/CD pipeline fails during production apply | Auto-rollback to previous state; incident response; manual recovery if needed | | Cost estimate shows unexpected increase | Block deployment; notify team; investigate root cause; optimize or approve | | Provider API changes break existing configurations | Provider version pinning; upgrade testing in dev; scheduled provider updates | | Multi-region deployment with state conflicts | Separate state files per region; remote state data sources for cross-region refs | | Emergency change needed (bypass CI/CD) | Break-glass procedure: manual apply with extra approval; post-hoc PR within 24h |

Output

Infrastructure Dashboard

INFRASTRUCTURE AS CODE — Operations Dashboard
================================================
As of: 2025-01-15

INFRASTRUCTURE OVERVIEW:
  Total managed resources: 2,847
  Environments: 3 (dev, staging, production)
  Cloud providers: 2 (AWS, GCP)
  IaC coverage: 94.2% (267 resources managed, 16 manual)
  Last full reconciliation: 2025-01-14

RECENT CHANGES (Last 7 Days):
  Pull requests merged: 12
  Resources added: 34 | Modified: 89 | Destroyed: 7
  Environments affected: dev (8), staging (3), production (1)
  Avg deployment time: 8.2 minutes
  Failed deployments: 1 (auto-rolled back successfully)

DRIFT STATUS:
  Last drift scan: 2025-01-15 12:00 UTC (6h ago)
  Drift detected: 3 resources
    1. prod-ec2-web-server-03 — Security group modified (manual change) → Auto-remediated
    2. staging-rds-analytics — Parameter group changed → Ticket created
    3. dev-s3-logs — Lifecycle rule missing → Auto-remediated
  Drift rate: 0.1% (3/2,847 resources) — within acceptable threshold

COST OVERVIEW (This Month):
  AWS:     $42,340/month (IaC-managed: 96%)
  GCP:     $18,720/month (IaC-managed: 91%)
  ─────────────────────────────
  Total:   $61,060/month
  vs. Budget: $65,000/month → 86.2% utilization ✓
  vs. Last month: +$2,100 (+3.5%) — within 5% variance ✓

TOP COST DRIVERS:
  1. EC2 instances:    $18,400/month (30.1%)
  2. RDS databases:    $12,800/month (21.0%)
  3. S3 storage:        $8,200/month (13.4%)
  4. EKS clusters:      $7,600/month (12.4%)
  5. Data transfer:     $5,400/month  (8.8%)

COMPLIANCE & SECURITY:
  Policy violations: 0 (all resources compliant)
  CIS benchmark score: 94/100 (4 minor findings — non-critical)
  Secrets exposed: 0
  Unencrypted resources: 0
  Over-privileged IAM roles: 2 (remediation tickets open)

CI/CD PIPELINE HEALTH:
  Total runs (last 30 days): 156
  Success rate: 97.4% (152/156)
  Avg plan time: 3.2 minutes | Avg apply time: 5.8 minutes
  Mean time to deploy: 9.0 minutes
  Rollback rate: 0.6% (1/156)

IMPROVEMENT RECOMMENDATIONS:
  1. Migrate 16 manual resources to IaC (target: 100% coverage)
  2. Enable auto-remediation for 80% of drift cases (current: 67%)
  3. Implement cost guardrails for dev environment (overspending 15%)
  4. Schedule quarterly provider version updates (currently 2 versions behind)
  5. Add chaos engineering tests to staging pipeline

MODULE USAGE ANALYTICS:
┌────────────────────┬──────────┬─────────────┬──────────────┐
│ Module             │ Versions │ Envs Used   │ Last Updated │
├────────────────────┼──────────┼─────────────┼──────────────┤
│ network/vpc        │ 4.2.1    │ 3/3         │ 2025-01-10   │
│ compute/ecs        │ 3.1.0    │ 3/3         │ 2025-01-08   │
│ database/rds       │ 3.8.2    │ 3/3         │ 2025-01-12   │
│ storage/s3         │ 2.5.0    │ 3/3         │ 2024-12-20   │
│ security/iam       │ 5.0.1    │ 3/3         │ 2025-01-14   │
│ monitoring/cw      │ 2.1.3    │ 2/3         │ 2025-01-05   │
│ serverless/lambda  │ 1.9.0    │ 2/3         │ 2024-12-28   │
└────────────────────┴──────────┴─────────────┴──────────────┘```

Disclaimer: All rights reserved by Circulos AI. These skills are specifically designed for Claude Code, Claude Cowork, Codex, and OpenClaw. When using or referencing any skill, please provide proper attribution to Circulos AI.