---
name: devops-cicd
description: Manage DevOps pipelines and CI/CD automation including build and test automation, containerization, deployment strategies, infrastructure as code (IaC), environment management, release management, and pipeline optimization. Use when setting up CI/CD pipelines, managing deployments, automating infrastructure provisioning, or optimizing build processes. Triggers on phrases like "CI/CD", "continuous integration", "continuous delivery", "deployment pipeline", "build automation", "containerization", "Docker", "Kubernetes", "infrastructure as code", "IaC", "Terraform", "Ansible", "release management", "blue-green deployment", "canary deployment", "rolling update", "environment promotion", "artifact management".
---

# DevOps & CI/CD Automation

Automate the entire software delivery lifecycle from code commit to production deployment.

## CI/CD Pipeline Architecture

### Pipeline Framework

```
CI/CD PIPELINE ARCHITECTURE:
════════════════════════════

PIPELINE PLATFORM: GitHub Actions (primary) + Jenkins (legacy, migrating)
  Total pipelines: 48 (35 microservices + 8 infrastructure + 5 shared)
  Pipelines per day: 280-350 (commits → builds → tests → deployments)
  Avg. pipeline duration: 12 minutes (build to deploy — staging)
  Success rate: 96.8% (target: >95%) ✓
  Jenkins migration: 8 pipelines remaining (target: Q2 2025)

PIPELINE STAGES (Standard):
  Stage 1: CODE COMMIT
    Trigger: Push to branch or PR merge
    Actions:
      - Lint (ESLint, Pylint, golangci-lint — language-specific)
      - Format check (Prettier, Black, gofmt)
      - Security scan (Semgrep, Trivy for deps)
      - Size check (bundle size diff, max +10%)
    Duration: 2-4 minutes
    Gate: Must pass → proceed (fail → block + notify)
  
  Stage 2: BUILD
    Actions:
      - Dependency resolution (npm, pip, go mod, Maven)
      - Compilation (Java, Go, TypeScript → JS)
      - Container image build (Docker build + push to registry)
      - Artifact generation (JAR, wheel, binary)
      - Build metadata (version, commit hash, timestamp)
    Duration: 3-5 minutes
    Gate: Build success → proceed (fail → notify developer)
  
  Stage 3: TEST
    Actions:
      - Unit tests (language-specific, coverage report)
      - Integration tests (mocked external services)
      - API tests (Postman/Newman, contract testing)
      - Security tests (SAST: SonarQube, Snyk)
      - Coverage threshold: >80% (unit), >60% (integration)
    Duration: 4-8 minutes
    Gate: All tests pass + coverage threshold → proceed
  
  Stage 4: STAGING DEPLOY
    Actions:
      - Deploy to staging environment (K8s rolling update)
      - Smoke tests (health checks, basic API tests)
      - E2E tests (Cypress/Selenium — critical paths)
      - Performance tests (k6 — load test, <200ms p99)
      - Security scan (DAST: OWASP ZAP)
    Duration: 5-8 minutes
    Gate: All tests pass → manual approval for prod
  
  Stage 5: PRODUCTION DEPLOY
    Actions:
      - Manual approval (production deployment gate)
      - Blue-green or canary deployment (strategy by service)
      - Health check validation (auto)
      - Rollback automation (if health check fails)
      - Post-deploy verification (smoke tests, monitoring)
    Duration: 3-5 minutes (deploy) + 10-15 min (verification)
    Gate: Auto-rollback on failure; manual approval required

DEPLOYMENT STRATEGIES:
  ┌─────────────────────────┬──────────┬──────────┬───────────────────┐
  │ Strategy                │ Services │ Downtime │ Rollback Speed    │
  ├─────────────────────────┼──────────┼──────────┼───────────────────┤
  │ Blue-green              │ 12       │ Zero     │ Instant (DNS)     │
  │ Canary (10%→50%→100%)  │ 18       │ Zero     │ <5 min (traffic)  │
  │ Rolling update          │ 5        │ Zero     │ <3 min (K8s)      │
  │ ────────────────────── │ ────── │ ────── │ ─────────────── │
  │ TOTAL                  │ 35     │ Zero   │ Auto              │
  └─────────────────────────┴──────────┴──────────┴───────────────────┘

  Selection criteria:
    Blue-green: Critical services (API, auth, payment) — instant rollback
    Canary: Customer-facing apps — gradual rollout, error monitoring
    Rolling: Internal services — K8s native, low risk

RELEASE MANAGEMENT:
  Versioning: Semantic versioning (SemVer: MAJOR.MINOR.PATCH)
  Release cadence:
    Hotfix: As needed (critical bug — within 24 hours)
    Patch: Weekly (minor bugs, improvements — every Tuesday)
    Minor: Monthly (new features — first Monday of month)
    Major: Quarterly (breaking changes — planned, communicated)
  
  Release notes: Auto-generated (commit analysis + conventional commits)
    Breaking changes: Highlighted, requires migration guide
    Features: Categorized, linked to Jira/issue tracker
    Bug fixes: Categorized, linked to resolved issues
    Performance: Measured improvement (benchmark data)
  
  Release approval:
    Hotfix: Engineering lead + CTO (urgent)
    Patch: Engineering lead (standard)
    Minor: Engineering lead + product manager
    Major: Engineering lead + product manager + CTO

ARTIFACT MANAGEMENT:
  Container registry: AWS ECR (primary) + Harbor (backup)
    Images: 48 services × 3 envs = 144 active images
    Tagging: Version + commit hash + timestamp
    Scanning: Trivy (vulnerability scan — every build)
    Cleanup: Images >90 days (auto-delete — keep last 30 versions)
    Storage: 2.5 TB (optimized with multi-layer caching)
  
  Package repository: Nexus (internal)
    Packages: 280 (internal libraries, shared modules)
    Languages: npm, pip, Maven, Go modules
    Proxy: Upstream repos (npmjs, PyPI, Maven Central)
    Security: Known vulnerability blocking (deny on high/critical)
```

## Infrastructure as Code (IaC)

### Automated Infrastructure Provisioning

```
INFRASTRUCTURE AS CODE:
════════════════════════

IaC TOOLS:
  Terraform: Cloud infrastructure (AWS, Azure, GCP)
  Ansible: Configuration management (servers, apps)
  Helm: Kubernetes application packaging
  Kubernetes manifests: K8s resources (native YAML)
  Packer: Custom AMI/image building

INFRASTRUCTURE INVENTORY (Terraform-managed):
  ┌──────────────────────────┬──────────┬──────────┐
  │ Resource Type            │ Count    │ Managed  │
  ├──────────────────────────┼──────────┼──────────┤
  │ EC2 instances            │ 45       │ 100%     │
  │ RDS databases            │ 15       │ 100%     │
  │ S3 buckets               │ 28       │ 100%     │
  │ VPCs + subnets           │ 12       │ 100%     │
  │ Load balancers           │ 18       │ 100%     │
  │ Security groups          │ 85       │ 100%     │
  │ IAM roles/policies       │ 120      │ 100%     │
  │ Route53 records          │ 240      │ 100%     │
  │ CloudFront distributions │ 5        │ 100%     │
  │ Azure VMs                │ 30       │ 100%     │
  │ Azure App Services       │ 15       │ 100%     │
  │ Azure SQL                │ 8        │ 100%     │
  │ ────────────────────── │ ────── │ ────── │
  │ TOTAL                  │ 721    │ 100%   │
  └──────────────────────────┴──────────┴──────────┘

  State management:
    Backend: AWS S3 + DynamoDB (locking)
    State files: 12 (per environment + shared)
    Encryption: AES-256 (at rest) + TLS (in transit)
    Access: Terraform service account (restricted)
    Backup: S3 versioning + cross-region replication

  Drift detection:
    Schedule: Daily (terraform plan — diff only)
    Alert: Drift detected → notification + ticket
    Auto-remediation: No (manual review required)
    Drift incidents (January): 2 (both minor, corrected)

ANSIBLE CONFIGURATION MANAGEMENT:
  Managed hosts: 165 (servers) + 45 (endpoints, partial)
  Playbooks: 85 (server setup, app deployment, security hardening)
  Roles: 42 (reusable, modular)
  Inventory: Dynamic (AWS EC2 + Azure)
  Facts cache: Redis (reduce API calls)
  Execution: Parallel (forks: 50)
  
  Last execution: Daily (convergence check)
  Compliance check: Weekly (configuration drift)
  Drift detected: <2% (auto-remediation available)

KUBERNETES MANAGEMENT:
  Clusters: 3 (production, staging, development)
  Nodes: 48 total (prod: 25, staging: 15, dev: 8)
  Namespaces: 35 (one per service + shared)
  Helm charts: 35 (one per service, versioned)
  
  Deployment automation:
    GitOps (ArgoCD): 35 services (declarative, sync)
    Manifest apply: 5 shared services (manual trigger)
    Auto-sync: Every 3 minutes (ArgoCD)
    Rollback: 1-click (ArgoCD UI) or auto (health check)
  
  K8s scaling:
    HPA (Horizontal Pod Autoscaler): 35 services (CPU + custom metrics)
    VPA (Vertical Pod Autoscaler): 35 services (memory recommendation)
    Cluster Autoscaler: 3 clusters (node scaling)
    Target utilization: 60-70% (CPU), 70-80% (memory)

ENVIRONMENT MANAGEMENT:
  Environments: 4 (production, staging, development, QA)
  Parity: 95% (config-driven, minimal differences)
  Provisioning time:
    New environment: <30 minutes (Terraform + Ansible)
    New service (all envs): <15 minutes (pipeline)
    Environment tear-down: <5 minutes (Terraform destroy)
  
  Environment isolation:
    Network: Separate VPCs/subnets
    Data: Staging/dev use anonymized or synthetic data
    Access: Environment-specific IAM roles
    Secrets: Separate vaults (environment-scoped)
  
  Secret management: AWS Secrets Manager + HashiCorp Vault
    Secrets: 380 (across all environments)
    Rotation: Automatic (weekly for standard, daily for critical)
    Access: Least privilege (service account scoped)
    Audit: Full logging (who accessed what when)

INFRASTRUCTURE PIPELINE:
  IaC change workflow:
    1. Developer writes/updates Terraform (PR)
    2. CI pipeline: terraform fmt, terraform validate, terrafmt
    3. PR check: terraform plan (diff preview in PR comment)
    4. Review: Team review (security, cost, architecture)
    5. Approval: 2 approvals (infra team + stakeholder)
    6. Merge: terraform apply (auto, staging first)
    7. Validate: Smoke tests, monitoring check
    8. Promote: terraform apply (production — manual approval)
  
  Change statistics (January 2025):
    IaC PRs: 45
    Approved: 42 (93.3%)
    Rejected: 3 (security review, cost concern)
    Avg. review time: 4 hours
    Avg. deployment time: 15 minutes (post-merge)
    Drift post-deploy: 0 (IaC ensures consistency)
```

## Pipeline Optimization

### Performance & Efficiency

```
PIPELINE OPTIMIZATION:
═════════════════════

BUILD PERFORMANCE:
  Total builds (January): 8,400 (280/day avg.)
  Build success rate: 96.8%
  Avg. build time: 4.2 minutes
  Median build time: 3.1 minutes
  P95 build time: 8.5 minutes (target: <10 min) ✓
  P99 build time: 12.3 minutes (target: <15 min) ✓
  
  Build caching:
    Dependency cache: 85% hit rate (S3-based)
    Layer cache (Docker): 92% hit rate
    Compilation cache (Bazel/ccache): 78% hit rate
    Impact: 40% reduction in build time (vs. no cache)
  
  Build parallelization:
    Matrix builds: 4 parallel runners (per service)
    Monorepo splitting: Independent services build in parallel
    Impact: 60% reduction in total pipeline time
    Runner utilization: 72% (well-managed)

COST OPTIMIZATION:
  CI/CD costs (January):
    GitHub Actions: $1,200 (840K minutes used)
    Self-hosted runners: $800 (EC2 spot instances)
    Container registry: $400 (ECR storage + transfers)
    Testing infrastructure: $600 (staging environment)
    ──────────────────────────────────────
    Total: $3,000/month
  
  Cost reduction (implemented):
    Spot instances: 60% of runners (spot + on-demand fallback)
    Runner auto-scale: Scale to 0 when idle (savings: 30%)
    Build cache: Reduce rebuild time (savings: $200/month)
    Artifact cleanup: Delete old artifacts (savings: $150/month)
  
  Pipeline efficiency metrics:
    Deployment frequency: 45/day (avg.)
    Lead time (commit → deploy): 18 minutes (avg.)
    Change failure rate: 3.2% (target: <5%) ✓
    Mean time to recovery (MTTR): 12 minutes (target: <15 min) ✓
    DORA metrics: All Elite (>75th percentile)

QUALITY GATES:
  Automated quality checks (every build):
    1. Code coverage: >80% (unit), >60% (integration)
    2. Code quality: SonarQube gate (no critical/blocker issues)
    3. Security: SAST (no high/critical vulnerabilities)
    4. Dependencies: No known CVEs (high/critical)
    5. License compliance: No prohibited licenses
    6. Performance: No regression (>20% slower)
    7. Bundle size: No increase >10%
  
  Gate statistics (January):
    Builds passing all gates: 96.8%
    Gate failures:
      Coverage below threshold: 12 (0.14%)
      Security vulnerability: 8 (0.10%)
      Performance regression: 5 (0.06%)
      License violation: 2 (0.02%)
      Other: 15 (0.18%)
  
  Gate effectiveness:
    Production incidents from pipeline: 0 (gates effective)
    Bugs caught in CI: 142 (vs. 12 in production — 12x improvement)
    Developer feedback: Positive (catch issues early)

RELEASE AUTOMATION METRICS:
  Releases (January):
    Hotfix: 3 (critical bugs — avg. 4 hours from detection)
    Patch: 4 (weekly — avg. 45 minutes deploy time)
    Minor: 1 (monthly — avg. 2 hours deploy + verification)
    Major: 0 (planned for Q1 — March)
  
  Release success rate: 97.5% (37/38 releases successful on first attempt)
  Rollback rate: 2.5% (1 release — canary caught issue, auto-rollback)
  Post-release incidents: 0 (all releases clean)
  Customer-facing incidents (release-related): 0 ✓
```

## Output

### DevOps & CI/CD Dashboard

```
DEVOPS & CI/CD DASHBOARD — Jan 2025
══════════════════════════════════

Pipelines:
  Total pipelines: 48 (35 microservices + 8 infra + 5 shared)
  Builds/day: 280-350
  Build success rate: 96.8%
  Avg. build time: 4.2 min
  Lead time (commit → deploy): 18 min
  DORA metrics: All Elite

Deployment:
  Deployment frequency: 45/day
  Downtime: 0 minutes (all zero-downtime strategies)
  Rollback rate: 2.5% (auto-rollback effective)
  Post-release incidents: 0
  Blue-green: 12 services, Canary: 18, Rolling: 5

Infrastructure:
  IaC coverage: 100% (721 resources)
  Drift: 0 (post-deploy)
  Environments: 4 (95% parity)
  K8s clusters: 3 (48 nodes, 35 services)
  Secrets: 380 (auto-rotation)

Quality:
  Gates: 7 automated checks (every build)
  Gate pass rate: 96.8%
  Bugs caught in CI: 142 (12x vs. production)
  Code coverage: >80% (unit), >60% (integration)
  Security vulnerabilities blocked: 8

Cost:
  CI/CD monthly cost: $3,000
  Runner utilization: 72%
  Spot instance savings: 30%
  Artifact cleanup: $150/month savings

Actions:
  1. Jenkins migration complete (8 pipelines — Feb target)
  2. Build time optimization (P99: 12.3 → <10 min)
  3. K8s cost optimization (rightsizing — March)
  4. Release cadence review (quarterly — April)
  5. Pipeline security review (annual — Q2)
```

## Integration Points

- Version control (GitHub, GitLab, Bitbucket): Source code, PR workflow
- Container registry (ECR, Harbor, Docker Hub): Image storage, scanning
- Orchestration (Kubernetes, Docker Swarm, ECS): Deployment, scaling
- Configuration management (Ansible, Chef, Puppet): Server config
- IaC tools (Terraform, CloudFormation): Infrastructure provisioning
- Testing frameworks (JUnit, pytest, Cypress, k6): Automated testing
- Quality tools (SonarQube, Snyk, Trivy): Code quality, security
- Package repositories (Nexus, Artifactory): Dependency management
- Secret management (Vault, AWS Secrets Manager): Credential storage
- Monitoring (Datadog, Prometheus, Grafana): Post-deploy verification
- ITSM (ServiceNow, Jira): Change management, incident tracking
- Communication (Slack, Teams, PagerDuty): Deployment notifications
- Artifact stores (S3, GCS, Azure Blob): Build artifact storage

## Edge Cases

- **Build failure (intermittent)**: Flaky test identification; retry policy; test isolation; root cause
- **Deployment rollback (production)**: Auto-rollback triggers; data migration reversal; monitoring validation
- **Infrastructure drift**: Terraform plan alert; manual review; state correction; prevention
- **Pipeline bottleneck**: Parallelization; caching optimization; runner scaling; queue monitoring
- **Secret rotation (mid-deployment)**: Zero-downtime secret update; dual-secret period; validation
- **K8s cluster failure**: Multi-cluster failover; pod rescheduling; data persistence; recovery
- **Dependency supply chain attack**: Dependency pinning; SBOM; vulnerability scanning; vendor verification
- **Container image vulnerability**: Trivy scan; auto-rebuild; CVE tracking; patch priority
- **Environment parity drift**: Configuration audit; IaC enforcement; periodic sync
- **Resource exhaustion (runner queue)**: Auto-scaling; spot instance fallback; priority queue; capacity