IT AI Skill

Infrastructure As Code

Manage cloud and on-premise infrastructure through code using Terraform, CloudFormation, Ansible, and similar tools for reproducible, version-controlled infrastructure deployment. Use when creating IaC templates, managing cloud provisioning, implementing Gi...

Infrastructure as Code (IaC)

Manage and provision cloud and on-premise infrastructure through version-controlled code using Terraform, CloudFormation, Ansible, and similar tools for reproducible, auditable, and automated infrastructure deployment.

Workflow

Phase 1: IaC Framework Design

  1. Tool selection and architecture:
  1. Module library development:
  1. Environment strategy:

Phase 2: Implementation & Automation

  1. Code development standards:
  1. CI/CD pipeline for infrastructure:
  1. State management and drift detection:

Phase 3: Governance & Operations

  1. Security and compliance:
  1. Cost management:
  1. Monitoring and observability:

Templates

IaC Pipeline Configuration

INFRASTRUCTURE AS CODE — CI/CD Pipeline
=========================================
Version: [2.3] | Platform: [GitHub Actions + Terraform]

PIPELINE STAGES:
  ┌────────────────────────────────────────────────────────────┐
  │ 1. CODE COMMIT                                            │
  │    • Developer pushes IaC changes to feature branch       │
  │    • Git hooks: terraform fmt (auto-format)                │
  │    • Branch protection: required reviews for infra changes │
  ├────────────────────────────────────────────────────────────┤
  │ 2. PRE-CI VALIDATION                                     │
  │    • terraform validate — syntax check                     │
  │    • tflint — best practices enforcement                   │
  │    • terraform fmt —check — formatting check               │
  │    • Failed? → Block PR with specific error messages       │
  ├────────────────────────────────────────────────────────────┤
  │ 3. SECURITY SCAN                                         │
  │    • checkov — policy compliance (CIS, SOC 2)              │
  │    • tfsec — security vulnerability detection              │
  │    • Failed? → Block PR with security findings report      │
  │    • Warning? → Flag for review (non-blocking)             │
  ├────────────────────────────────────────────────────────────┤
  │ 4. COST ESTIMATION                                       │
  │    • infracost estimate — cost delta analysis              │
  │    • Comment on PR with cost impact:                       │
  │      "This change adds ~$340/month (EC2: +$200, RDS: +$140)"│
  │    • Cost increase > $1000/month? → Require manager review │
  ├────────────────────────────────────────────────────────────┤
  │ 5. PLAN (dev environment)                                │
  │    • terraform plan — dev variables                        │
  │    • Generate execution plan (add/mod/destroy)             │
  │    • Comment on PR with plan summary                       │
  │    • Dev auto-apply (if approved by code owner)            │
  ├────────────────────────────────────────────────────────────┤
  │ 6. REVIEW & APPROVAL                                     │
  │    • Required: 1 code owner approval                       │
  │    • Required: 1 security team approval (if prod change)   │
  │    • Required: cost approval (if > $1000/month increase)   │
  │    • Plan output visible for manual review                  │
  ├────────────────────────────────────────────────────────────┤
  │ 7. APPLY (staging environment)                            │
  │    • terraform apply — staging variables (on merge)        │
  │    • Post-deploy validation tests                          │
  │    • Smoke tests against staging infrastructure            │
  │    • Failed? → Auto-rollback + alert                       │
  ├────────────────────────────────────────────────────────────┤
  │ 8. APPLY (production)                                    │
  │    • Manual trigger (after staging validation passes)      │
  │    • terraform apply — production variables                │
  │    • Change window enforcement (no prod changes 22:00-06:00)│
  │    • Post-deploy validation + monitoring alert              │
  │    • Slack notification to #infra-changes channel          │
  └────────────────────────────────────────────────────────────┘

STATE MANAGEMENT:
  Backend: AWS S3 + DynamoDB (state locking)
  State file encryption: AES-256 (S3 server-side encryption)
  State backup: Daily snapshot to separate S3 bucket
  State retention: 90 days (versioned)
  Access: IAM role-based (dev: read-only, deployer: read-write)

DRIFT DETECTION:
  Schedule: Every 6 hours (cron job)
  Action: terraform plan → diff detection
  Threshold: Any unplanned change = alert
  Notification: Slack #infra-drift channel + PagerDuty (if prod)
  Remediation: Auto-remediate (if safe) or create Jira ticket

Terraform Module Structure

IaC MODULE LIBRARY — Standard Structure
=========================================

Repository layout:
  infrastructure/
  ├── modules/                    # Reusable modules
  │   ├── network/               # VPC, subnets, security groups
  │   │   ├── main.tf
  │   │   ├── variables.tf
  │   │   ├── outputs.tf
  │   │   ├── versions.tf
  │   │   └── README.md
  │   ├── compute/               # EC2, ECS, EKS
  │   ├── database/              # RDS, ElastiCache, DynamoDB
  │   ├── storage/               # S3, EBS, EFS
  │   ├── security/              # IAM, KMS, WAF, Shield
  │   ├── monitoring/            # CloudWatch, Prometheus, Grafana
  │   └── serverless/            # Lambda, API Gateway, EventBridge
  │
  ├── environments/              # Environment configurations
  │   ├── dev/
  │   │   ├── main.tf            # Root module (calls child modules)
  │   │   ├── variables.tf
  │   │   ├── terraform.tfvars   # Environment-specific values
  │   │   └── backend.tf         # State backend config
  │   ├── staging/
  │   └── production/
  │
  ├── policies/                  # Policy-as-code
  │   ├── opa/                   # Open Policy Agent policies
  │   └── sentinel/              # Terraform Enterprise policies
  │
  ├── scripts/                   # Helper scripts
  │   ├── validate.sh
  │   ├── cost-estimate.sh
  │   └── drift-detect.sh
  │
  ├── .github/                   # CI/CD workflows
  │   └── workflows/
  │       ├── validate.yml
  │       ├── plan.yml
  │       └── apply.yml
  │
  └── docs/                      # Documentation
      ├── architecture.md
      ├── module-catalog.md
      └── runbooks/

MODULE DESIGN STANDARDS:
  Every module must include:
    ✓ main.tf — resource definitions
    ✓ variables.tf — input variables with descriptions and types
    ✓ outputs.tf — output values with descriptions
    ✓ versions.tf — provider and terraform version constraints
    ✓ README.md — usage examples, inputs/outputs table, dependencies
    ✓ examples/ — working example configurations

  Naming conventions:
    Resources: {environment}-{service}-{component}-{unique-id}
    Tags: Environment, Service, Owner, CostCenter, ManagedBy (IaC)
    State files: {environment}-{service}-terraform.tfstate

  Best practices:
    ✓ Use explicit versions for all providers and modules
    ✓ Never hardcode secrets (use var files or secret management)
    ✓ Use depends_on sparingly (prefer implicit dependencies)
    ✓ Use count/for_each instead of duplicating code
    ✓ Add lifecycle rules (create_before_destroy, prevent_destroy)
    ✓ Document all assumptions and known limitations
    ✓ Test modules in dev before staging/production use

Integration Points

Edge Cases

| Scenario | Handling | |----------|----------| | Terraform state corruption | Restore from versioned backup; manual state fix if needed; post-mortem | | Drift detected in production | Assess impact; auto-remediate if safe; manual intervention if risky; root cause analysis | | Module breaking change affects multiple environments | Version pinning; staged migration; backward compatibility testing | | CI/CD pipeline fails during production apply | Auto-rollback to previous state; incident response; manual recovery if needed | | Cost estimate shows unexpected increase | Block deployment; notify team; investigate root cause; optimize or approve | | Provider API changes break existing configurations | Provider version pinning; upgrade testing in dev; scheduled provider updates | | Multi-region deployment with state conflicts | Separate state files per region; remote state data sources for cross-region refs | | Emergency change needed (bypass CI/CD) | Break-glass procedure: manual apply with extra approval; post-hoc PR within 24h |

Output

Infrastructure Dashboard

INFRASTRUCTURE AS CODE — Operations Dashboard
================================================
As of: 2025-01-15

INFRASTRUCTURE OVERVIEW:
  Total managed resources: 2,847
  Environments: 3 (dev, staging, production)
  Cloud providers: 2 (AWS, GCP)
  IaC coverage: 94.2% (267 resources managed, 16 manual)
  Last full reconciliation: 2025-01-14

RECENT CHANGES (Last 7 Days):
  Pull requests merged: 12
  Resources added: 34 | Modified: 89 | Destroyed: 7
  Environments affected: dev (8), staging (3), production (1)
  Avg deployment time: 8.2 minutes
  Failed deployments: 1 (auto-rolled back successfully)

DRIFT STATUS:
  Last drift scan: 2025-01-15 12:00 UTC (6h ago)
  Drift detected: 3 resources
    1. prod-ec2-web-server-03 — Security group modified (manual change) → Auto-remediated
    2. staging-rds-analytics — Parameter group changed → Ticket created
    3. dev-s3-logs — Lifecycle rule missing → Auto-remediated
  Drift rate: 0.1% (3/2,847 resources) — within acceptable threshold

COST OVERVIEW (This Month):
  AWS:     $42,340/month (IaC-managed: 96%)
  GCP:     $18,720/month (IaC-managed: 91%)
  ─────────────────────────────
  Total:   $61,060/month
  vs. Budget: $65,000/month → 86.2% utilization ✓
  vs. Last month: +$2,100 (+3.5%) — within 5% variance ✓

TOP COST DRIVERS:
  1. EC2 instances:    $18,400/month (30.1%)
  2. RDS databases:    $12,800/month (21.0%)
  3. S3 storage:        $8,200/month (13.4%)
  4. EKS clusters:      $7,600/month (12.4%)
  5. Data transfer:     $5,400/month  (8.8%)

COMPLIANCE & SECURITY:
  Policy violations: 0 (all resources compliant)
  CIS benchmark score: 94/100 (4 minor findings — non-critical)
  Secrets exposed: 0
  Unencrypted resources: 0
  Over-privileged IAM roles: 2 (remediation tickets open)

CI/CD PIPELINE HEALTH:
  Total runs (last 30 days): 156
  Success rate: 97.4% (152/156)
  Avg plan time: 3.2 minutes | Avg apply time: 5.8 minutes
  Mean time to deploy: 9.0 minutes
  Rollback rate: 0.6% (1/156)

IMPROVEMENT RECOMMENDATIONS:
  1. Migrate 16 manual resources to IaC (target: 100% coverage)
  2. Enable auto-remediation for 80% of drift cases (current: 67%)
  3. Implement cost guardrails for dev environment (overspending 15%)
  4. Schedule quarterly provider version updates (currently 2 versions behind)
  5. Add chaos engineering tests to staging pipeline

MODULE USAGE ANALYTICS:
┌────────────────────┬──────────┬─────────────┬──────────────┐
│ Module             │ Versions │ Envs Used   │ Last Updated │
├────────────────────┼──────────┼─────────────┼──────────────┤
│ network/vpc        │ 4.2.1    │ 3/3         │ 2025-01-10   │
│ compute/ecs        │ 3.1.0    │ 3/3         │ 2025-01-08   │
│ database/rds       │ 3.8.2    │ 3/3         │ 2025-01-12   │
│ storage/s3         │ 2.5.0    │ 3/3         │ 2024-12-20   │
│ security/iam       │ 5.0.1    │ 3/3         │ 2025-01-14   │
│ monitoring/cw      │ 2.1.3    │ 2/3         │ 2025-01-05   │
│ serverless/lambda  │ 1.9.0    │ 2/3         │ 2024-12-28   │
└────────────────────┴──────────┴─────────────┴──────────────┘```