IT AI Skill

Devops Cicd

Manage DevOps pipelines and CI/CD automation including build and test automation, containerization, deployment strategies, infrastructure as code (IaC), environment management, release management, and pipeline optimization. Use when setting up CI/CD pipelines, managing deployments, automating infrastructure provisioning, or optimizing build processes. Triggers on phrases like "CI/CD", "continuous integration", "continuous delivery", "deployment pipeline", "build automation", "containerization", "Docker", "Kubernetes", "infrastructure as code", "IaC", "Terraform", "Ansible", "release management", "blue-green deployment", "canary deployment", "rolling update", "environment promotion", "artifact management".

DevOps & CI/CD Automation

Automate the entire software delivery lifecycle from code commit to production deployment.

CI/CD Pipeline Architecture

Pipeline Framework

CI/CD PIPELINE ARCHITECTURE:
════════════════════════════

PIPELINE PLATFORM: GitHub Actions (primary) + Jenkins (legacy, migrating)
  Total pipelines: 48 (35 microservices + 8 infrastructure + 5 shared)
  Pipelines per day: 280-350 (commits → builds → tests → deployments)
  Avg. pipeline duration: 12 minutes (build to deploy — staging)
  Success rate: 96.8% (target: >95%) ✓
  Jenkins migration: 8 pipelines remaining (target: Q2 2025)

PIPELINE STAGES (Standard):
  Stage 1: CODE COMMIT
    Trigger: Push to branch or PR merge
    Actions:
      - Lint (ESLint, Pylint, golangci-lint — language-specific)
      - Format check (Prettier, Black, gofmt)
      - Security scan (Semgrep, Trivy for deps)
      - Size check (bundle size diff, max +10%)
    Duration: 2-4 minutes
    Gate: Must pass → proceed (fail → block + notify)
  
  Stage 2: BUILD
    Actions:
      - Dependency resolution (npm, pip, go mod, Maven)
      - Compilation (Java, Go, TypeScript → JS)
      - Container image build (Docker build + push to registry)
      - Artifact generation (JAR, wheel, binary)
      - Build metadata (version, commit hash, timestamp)
    Duration: 3-5 minutes
    Gate: Build success → proceed (fail → notify developer)
  
  Stage 3: TEST
    Actions:
      - Unit tests (language-specific, coverage report)
      - Integration tests (mocked external services)
      - API tests (Postman/Newman, contract testing)
      - Security tests (SAST: SonarQube, Snyk)
      - Coverage threshold: >80% (unit), >60% (integration)
    Duration: 4-8 minutes
    Gate: All tests pass + coverage threshold → proceed
  
  Stage 4: STAGING DEPLOY
    Actions:
      - Deploy to staging environment (K8s rolling update)
      - Smoke tests (health checks, basic API tests)
      - E2E tests (Cypress/Selenium — critical paths)
      - Performance tests (k6 — load test, <200ms p99)
      - Security scan (DAST: OWASP ZAP)
    Duration: 5-8 minutes
    Gate: All tests pass → manual approval for prod
  
  Stage 5: PRODUCTION DEPLOY
    Actions:
      - Manual approval (production deployment gate)
      - Blue-green or canary deployment (strategy by service)
      - Health check validation (auto)
      - Rollback automation (if health check fails)
      - Post-deploy verification (smoke tests, monitoring)
    Duration: 3-5 minutes (deploy) + 10-15 min (verification)
    Gate: Auto-rollback on failure; manual approval required

DEPLOYMENT STRATEGIES:
  ┌─────────────────────────┬──────────┬──────────┬───────────────────┐
  │ Strategy                │ Services │ Downtime │ Rollback Speed    │
  ├─────────────────────────┼──────────┼──────────┼───────────────────┤
  │ Blue-green              │ 12       │ Zero     │ Instant (DNS)     │
  │ Canary (10%→50%→100%)  │ 18       │ Zero     │ <5 min (traffic)  │
  │ Rolling update          │ 5        │ Zero     │ <3 min (K8s)      │
  │ ────────────────────── │ ────── │ ────── │ ─────────────── │
  │ TOTAL                  │ 35     │ Zero   │ Auto              │
  └─────────────────────────┴──────────┴──────────┴───────────────────┘

  Selection criteria:
    Blue-green: Critical services (API, auth, payment) — instant rollback
    Canary: Customer-facing apps — gradual rollout, error monitoring
    Rolling: Internal services — K8s native, low risk

RELEASE MANAGEMENT:
  Versioning: Semantic versioning (SemVer: MAJOR.MINOR.PATCH)
  Release cadence:
    Hotfix: As needed (critical bug — within 24 hours)
    Patch: Weekly (minor bugs, improvements — every Tuesday)
    Minor: Monthly (new features — first Monday of month)
    Major: Quarterly (breaking changes — planned, communicated)
  
  Release notes: Auto-generated (commit analysis + conventional commits)
    Breaking changes: Highlighted, requires migration guide
    Features: Categorized, linked to Jira/issue tracker
    Bug fixes: Categorized, linked to resolved issues
    Performance: Measured improvement (benchmark data)
  
  Release approval:
    Hotfix: Engineering lead + CTO (urgent)
    Patch: Engineering lead (standard)
    Minor: Engineering lead + product manager
    Major: Engineering lead + product manager + CTO

ARTIFACT MANAGEMENT:
  Container registry: AWS ECR (primary) + Harbor (backup)
    Images: 48 services × 3 envs = 144 active images
    Tagging: Version + commit hash + timestamp
    Scanning: Trivy (vulnerability scan — every build)
    Cleanup: Images >90 days (auto-delete — keep last 30 versions)
    Storage: 2.5 TB (optimized with multi-layer caching)
  
  Package repository: Nexus (internal)
    Packages: 280 (internal libraries, shared modules)
    Languages: npm, pip, Maven, Go modules
    Proxy: Upstream repos (npmjs, PyPI, Maven Central)
    Security: Known vulnerability blocking (deny on high/critical)

Infrastructure as Code (IaC)

Automated Infrastructure Provisioning

INFRASTRUCTURE AS CODE:
════════════════════════

IaC TOOLS:
  Terraform: Cloud infrastructure (AWS, Azure, GCP)
  Ansible: Configuration management (servers, apps)
  Helm: Kubernetes application packaging
  Kubernetes manifests: K8s resources (native YAML)
  Packer: Custom AMI/image building

INFRASTRUCTURE INVENTORY (Terraform-managed):
  ┌──────────────────────────┬──────────┬──────────┐
  │ Resource Type            │ Count    │ Managed  │
  ├──────────────────────────┼──────────┼──────────┤
  │ EC2 instances            │ 45       │ 100%     │
  │ RDS databases            │ 15       │ 100%     │
  │ S3 buckets               │ 28       │ 100%     │
  │ VPCs + subnets           │ 12       │ 100%     │
  │ Load balancers           │ 18       │ 100%     │
  │ Security groups          │ 85       │ 100%     │
  │ IAM roles/policies       │ 120      │ 100%     │
  │ Route53 records          │ 240      │ 100%     │
  │ CloudFront distributions │ 5        │ 100%     │
  │ Azure VMs                │ 30       │ 100%     │
  │ Azure App Services       │ 15       │ 100%     │
  │ Azure SQL                │ 8        │ 100%     │
  │ ────────────────────── │ ────── │ ────── │
  │ TOTAL                  │ 721    │ 100%   │
  └──────────────────────────┴──────────┴──────────┘

  State management:
    Backend: AWS S3 + DynamoDB (locking)
    State files: 12 (per environment + shared)
    Encryption: AES-256 (at rest) + TLS (in transit)
    Access: Terraform service account (restricted)
    Backup: S3 versioning + cross-region replication

  Drift detection:
    Schedule: Daily (terraform plan — diff only)
    Alert: Drift detected → notification + ticket
    Auto-remediation: No (manual review required)
    Drift incidents (January): 2 (both minor, corrected)

ANSIBLE CONFIGURATION MANAGEMENT:
  Managed hosts: 165 (servers) + 45 (endpoints, partial)
  Playbooks: 85 (server setup, app deployment, security hardening)
  Roles: 42 (reusable, modular)
  Inventory: Dynamic (AWS EC2 + Azure)
  Facts cache: Redis (reduce API calls)
  Execution: Parallel (forks: 50)
  
  Last execution: Daily (convergence check)
  Compliance check: Weekly (configuration drift)
  Drift detected: <2% (auto-remediation available)

KUBERNETES MANAGEMENT:
  Clusters: 3 (production, staging, development)
  Nodes: 48 total (prod: 25, staging: 15, dev: 8)
  Namespaces: 35 (one per service + shared)
  Helm charts: 35 (one per service, versioned)
  
  Deployment automation:
    GitOps (ArgoCD): 35 services (declarative, sync)
    Manifest apply: 5 shared services (manual trigger)
    Auto-sync: Every 3 minutes (ArgoCD)
    Rollback: 1-click (ArgoCD UI) or auto (health check)
  
  K8s scaling:
    HPA (Horizontal Pod Autoscaler): 35 services (CPU + custom metrics)
    VPA (Vertical Pod Autoscaler): 35 services (memory recommendation)
    Cluster Autoscaler: 3 clusters (node scaling)
    Target utilization: 60-70% (CPU), 70-80% (memory)

ENVIRONMENT MANAGEMENT:
  Environments: 4 (production, staging, development, QA)
  Parity: 95% (config-driven, minimal differences)
  Provisioning time:
    New environment: <30 minutes (Terraform + Ansible)
    New service (all envs): <15 minutes (pipeline)
    Environment tear-down: <5 minutes (Terraform destroy)
  
  Environment isolation:
    Network: Separate VPCs/subnets
    Data: Staging/dev use anonymized or synthetic data
    Access: Environment-specific IAM roles
    Secrets: Separate vaults (environment-scoped)
  
  Secret management: AWS Secrets Manager + HashiCorp Vault
    Secrets: 380 (across all environments)
    Rotation: Automatic (weekly for standard, daily for critical)
    Access: Least privilege (service account scoped)
    Audit: Full logging (who accessed what when)

INFRASTRUCTURE PIPELINE:
  IaC change workflow:
    1. Developer writes/updates Terraform (PR)
    2. CI pipeline: terraform fmt, terraform validate, terrafmt
    3. PR check: terraform plan (diff preview in PR comment)
    4. Review: Team review (security, cost, architecture)
    5. Approval: 2 approvals (infra team + stakeholder)
    6. Merge: terraform apply (auto, staging first)
    7. Validate: Smoke tests, monitoring check
    8. Promote: terraform apply (production — manual approval)
  
  Change statistics (January 2025):
    IaC PRs: 45
    Approved: 42 (93.3%)
    Rejected: 3 (security review, cost concern)
    Avg. review time: 4 hours
    Avg. deployment time: 15 minutes (post-merge)
    Drift post-deploy: 0 (IaC ensures consistency)

Pipeline Optimization

Performance & Efficiency

PIPELINE OPTIMIZATION:
═════════════════════

BUILD PERFORMANCE:
  Total builds (January): 8,400 (280/day avg.)
  Build success rate: 96.8%
  Avg. build time: 4.2 minutes
  Median build time: 3.1 minutes
  P95 build time: 8.5 minutes (target: <10 min) ✓
  P99 build time: 12.3 minutes (target: <15 min) ✓
  
  Build caching:
    Dependency cache: 85% hit rate (S3-based)
    Layer cache (Docker): 92% hit rate
    Compilation cache (Bazel/ccache): 78% hit rate
    Impact: 40% reduction in build time (vs. no cache)
  
  Build parallelization:
    Matrix builds: 4 parallel runners (per service)
    Monorepo splitting: Independent services build in parallel
    Impact: 60% reduction in total pipeline time
    Runner utilization: 72% (well-managed)

COST OPTIMIZATION:
  CI/CD costs (January):
    GitHub Actions: $1,200 (840K minutes used)
    Self-hosted runners: $800 (EC2 spot instances)
    Container registry: $400 (ECR storage + transfers)
    Testing infrastructure: $600 (staging environment)
    ──────────────────────────────────────
    Total: $3,000/month
  
  Cost reduction (implemented):
    Spot instances: 60% of runners (spot + on-demand fallback)
    Runner auto-scale: Scale to 0 when idle (savings: 30%)
    Build cache: Reduce rebuild time (savings: $200/month)
    Artifact cleanup: Delete old artifacts (savings: $150/month)
  
  Pipeline efficiency metrics:
    Deployment frequency: 45/day (avg.)
    Lead time (commit → deploy): 18 minutes (avg.)
    Change failure rate: 3.2% (target: <5%) ✓
    Mean time to recovery (MTTR): 12 minutes (target: <15 min) ✓
    DORA metrics: All Elite (>75th percentile)

QUALITY GATES:
  Automated quality checks (every build):
    1. Code coverage: >80% (unit), >60% (integration)
    2. Code quality: SonarQube gate (no critical/blocker issues)
    3. Security: SAST (no high/critical vulnerabilities)
    4. Dependencies: No known CVEs (high/critical)
    5. License compliance: No prohibited licenses
    6. Performance: No regression (>20% slower)
    7. Bundle size: No increase >10%
  
  Gate statistics (January):
    Builds passing all gates: 96.8%
    Gate failures:
      Coverage below threshold: 12 (0.14%)
      Security vulnerability: 8 (0.10%)
      Performance regression: 5 (0.06%)
      License violation: 2 (0.02%)
      Other: 15 (0.18%)
  
  Gate effectiveness:
    Production incidents from pipeline: 0 (gates effective)
    Bugs caught in CI: 142 (vs. 12 in production — 12x improvement)
    Developer feedback: Positive (catch issues early)

RELEASE AUTOMATION METRICS:
  Releases (January):
    Hotfix: 3 (critical bugs — avg. 4 hours from detection)
    Patch: 4 (weekly — avg. 45 minutes deploy time)
    Minor: 1 (monthly — avg. 2 hours deploy + verification)
    Major: 0 (planned for Q1 — March)
  
  Release success rate: 97.5% (37/38 releases successful on first attempt)
  Rollback rate: 2.5% (1 release — canary caught issue, auto-rollback)
  Post-release incidents: 0 (all releases clean)
  Customer-facing incidents (release-related): 0 ✓

Output

DevOps & CI/CD Dashboard

DEVOPS & CI/CD DASHBOARD — Jan 2025
══════════════════════════════════

Pipelines:
  Total pipelines: 48 (35 microservices + 8 infra + 5 shared)
  Builds/day: 280-350
  Build success rate: 96.8%
  Avg. build time: 4.2 min
  Lead time (commit → deploy): 18 min
  DORA metrics: All Elite

Deployment:
  Deployment frequency: 45/day
  Downtime: 0 minutes (all zero-downtime strategies)
  Rollback rate: 2.5% (auto-rollback effective)
  Post-release incidents: 0
  Blue-green: 12 services, Canary: 18, Rolling: 5

Infrastructure:
  IaC coverage: 100% (721 resources)
  Drift: 0 (post-deploy)
  Environments: 4 (95% parity)
  K8s clusters: 3 (48 nodes, 35 services)
  Secrets: 380 (auto-rotation)

Quality:
  Gates: 7 automated checks (every build)
  Gate pass rate: 96.8%
  Bugs caught in CI: 142 (12x vs. production)
  Code coverage: >80% (unit), >60% (integration)
  Security vulnerabilities blocked: 8

Cost:
  CI/CD monthly cost: $3,000
  Runner utilization: 72%
  Spot instance savings: 30%
  Artifact cleanup: $150/month savings

Actions:
  1. Jenkins migration complete (8 pipelines — Feb target)
  2. Build time optimization (P99: 12.3 → <10 min)
  3. K8s cost optimization (rightsizing — March)
  4. Release cadence review (quarterly — April)
  5. Pipeline security review (annual — Q2)

Integration Points

Version control (GitHub, GitLab, Bitbucket): Source code, PR workflow
Container registry (ECR, Harbor, Docker Hub): Image storage, scanning
Orchestration (Kubernetes, Docker Swarm, ECS): Deployment, scaling
Configuration management (Ansible, Chef, Puppet): Server config
IaC tools (Terraform, CloudFormation): Infrastructure provisioning
Testing frameworks (JUnit, pytest, Cypress, k6): Automated testing
Quality tools (SonarQube, Snyk, Trivy): Code quality, security
Package repositories (Nexus, Artifactory): Dependency management
Secret management (Vault, AWS Secrets Manager): Credential storage
Monitoring (Datadog, Prometheus, Grafana): Post-deploy verification
ITSM (ServiceNow, Jira): Change management, incident tracking
Communication (Slack, Teams, PagerDuty): Deployment notifications
Artifact stores (S3, GCS, Azure Blob): Build artifact storage

Edge Cases

Build failure (intermittent): Flaky test identification; retry policy; test isolation; root cause
Deployment rollback (production): Auto-rollback triggers; data migration reversal; monitoring validation
Infrastructure drift: Terraform plan alert; manual review; state correction; prevention
Pipeline bottleneck: Parallelization; caching optimization; runner scaling; queue monitoring
Secret rotation (mid-deployment): Zero-downtime secret update; dual-secret period; validation
K8s cluster failure: Multi-cluster failover; pod rescheduling; data persistence; recovery
Dependency supply chain attack: Dependency pinning; SBOM; vulnerability scanning; vendor verification
Container image vulnerability: Trivy scan; auto-rebuild; CVE tracking; patch priority
Environment parity drift: Configuration audit; IaC enforcement; periodic sync
Resource exhaustion (runner queue): Auto-scaling; spot instance fallback; priority queue; capacity

Disclaimer: All rights reserved by Circulos AI. These skills are specifically designed for Claude Code, Claude Cowork, Codex, and OpenClaw. When using or referencing any skill, please provide proper attribution to Circulos AI.