IT AI Skill
Devops Cicd
Manage DevOps pipelines and CI/CD automation including build and test automation, containerization, deployment strategies, infrastructure as code (IaC), environment management, release management, and pipeline optimization. Use when setting up CI/CD pipelin...
DevOps & CI/CD Automation
Automate the entire software delivery lifecycle from code commit to production deployment.
CI/CD Pipeline Architecture
Pipeline Framework
CI/CD PIPELINE ARCHITECTURE:
════════════════════════════
PIPELINE PLATFORM: GitHub Actions (primary) + Jenkins (legacy, migrating)
Total pipelines: 48 (35 microservices + 8 infrastructure + 5 shared)
Pipelines per day: 280-350 (commits → builds → tests → deployments)
Avg. pipeline duration: 12 minutes (build to deploy — staging)
Success rate: 96.8% (target: >95%) ✓
Jenkins migration: 8 pipelines remaining (target: Q2 2025)
PIPELINE STAGES (Standard):
Stage 1: CODE COMMIT
Trigger: Push to branch or PR merge
Actions:
- Lint (ESLint, Pylint, golangci-lint — language-specific)
- Format check (Prettier, Black, gofmt)
- Security scan (Semgrep, Trivy for deps)
- Size check (bundle size diff, max +10%)
Duration: 2-4 minutes
Gate: Must pass → proceed (fail → block + notify)
Stage 2: BUILD
Actions:
- Dependency resolution (npm, pip, go mod, Maven)
- Compilation (Java, Go, TypeScript → JS)
- Container image build (Docker build + push to registry)
- Artifact generation (JAR, wheel, binary)
- Build metadata (version, commit hash, timestamp)
Duration: 3-5 minutes
Gate: Build success → proceed (fail → notify developer)
Stage 3: TEST
Actions:
- Unit tests (language-specific, coverage report)
- Integration tests (mocked external services)
- API tests (Postman/Newman, contract testing)
- Security tests (SAST: SonarQube, Snyk)
- Coverage threshold: >80% (unit), >60% (integration)
Duration: 4-8 minutes
Gate: All tests pass + coverage threshold → proceed
Stage 4: STAGING DEPLOY
Actions:
- Deploy to staging environment (K8s rolling update)
- Smoke tests (health checks, basic API tests)
- E2E tests (Cypress/Selenium — critical paths)
- Performance tests (k6 — load test, <200ms p99)
- Security scan (DAST: OWASP ZAP)
Duration: 5-8 minutes
Gate: All tests pass → manual approval for prod
Stage 5: PRODUCTION DEPLOY
Actions:
- Manual approval (production deployment gate)
- Blue-green or canary deployment (strategy by service)
- Health check validation (auto)
- Rollback automation (if health check fails)
- Post-deploy verification (smoke tests, monitoring)
Duration: 3-5 minutes (deploy) + 10-15 min (verification)
Gate: Auto-rollback on failure; manual approval required
DEPLOYMENT STRATEGIES:
┌─────────────────────────┬──────────┬──────────┬───────────────────┐
│ Strategy │ Services │ Downtime │ Rollback Speed │
├─────────────────────────┼──────────┼──────────┼───────────────────┤
│ Blue-green │ 12 │ Zero │ Instant (DNS) │
│ Canary (10%→50%→100%) │ 18 │ Zero │ <5 min (traffic) │
│ Rolling update │ 5 │ Zero │ <3 min (K8s) │
│ ────────────────────── │ ────── │ ────── │ ─────────────── │
│ TOTAL │ 35 │ Zero │ Auto │
└─────────────────────────┴──────────┴──────────┴───────────────────┘
Selection criteria:
Blue-green: Critical services (API, auth, payment) — instant rollback
Canary: Customer-facing apps — gradual rollout, error monitoring
Rolling: Internal services — K8s native, low risk
RELEASE MANAGEMENT:
Versioning: Semantic versioning (SemVer: MAJOR.MINOR.PATCH)
Release cadence:
Hotfix: As needed (critical bug — within 24 hours)
Patch: Weekly (minor bugs, improvements — every Tuesday)
Minor: Monthly (new features — first Monday of month)
Major: Quarterly (breaking changes — planned, communicated)
Release notes: Auto-generated (commit analysis + conventional commits)
Breaking changes: Highlighted, requires migration guide
Features: Categorized, linked to Jira/issue tracker
Bug fixes: Categorized, linked to resolved issues
Performance: Measured improvement (benchmark data)
Release approval:
Hotfix: Engineering lead + CTO (urgent)
Patch: Engineering lead (standard)
Minor: Engineering lead + product manager
Major: Engineering lead + product manager + CTO
ARTIFACT MANAGEMENT:
Container registry: AWS ECR (primary) + Harbor (backup)
Images: 48 services × 3 envs = 144 active images
Tagging: Version + commit hash + timestamp
Scanning: Trivy (vulnerability scan — every build)
Cleanup: Images >90 days (auto-delete — keep last 30 versions)
Storage: 2.5 TB (optimized with multi-layer caching)
Package repository: Nexus (internal)
Packages: 280 (internal libraries, shared modules)
Languages: npm, pip, Maven, Go modules
Proxy: Upstream repos (npmjs, PyPI, Maven Central)
Security: Known vulnerability blocking (deny on high/critical)
Infrastructure as Code (IaC)
Automated Infrastructure Provisioning
INFRASTRUCTURE AS CODE:
════════════════════════
IaC TOOLS:
Terraform: Cloud infrastructure (AWS, Azure, GCP)
Ansible: Configuration management (servers, apps)
Helm: Kubernetes application packaging
Kubernetes manifests: K8s resources (native YAML)
Packer: Custom AMI/image building
INFRASTRUCTURE INVENTORY (Terraform-managed):
┌──────────────────────────┬──────────┬──────────┐
│ Resource Type │ Count │ Managed │
├──────────────────────────┼──────────┼──────────┤
│ EC2 instances │ 45 │ 100% │
│ RDS databases │ 15 │ 100% │
│ S3 buckets │ 28 │ 100% │
│ VPCs + subnets │ 12 │ 100% │
│ Load balancers │ 18 │ 100% │
│ Security groups │ 85 │ 100% │
│ IAM roles/policies │ 120 │ 100% │
│ Route53 records │ 240 │ 100% │
│ CloudFront distributions │ 5 │ 100% │
│ Azure VMs │ 30 │ 100% │
│ Azure App Services │ 15 │ 100% │
│ Azure SQL │ 8 │ 100% │
│ ────────────────────── │ ────── │ ────── │
│ TOTAL │ 721 │ 100% │
└──────────────────────────┴──────────┴──────────┘
State management:
Backend: AWS S3 + DynamoDB (locking)
State files: 12 (per environment + shared)
Encryption: AES-256 (at rest) + TLS (in transit)
Access: Terraform service account (restricted)
Backup: S3 versioning + cross-region replication
Drift detection:
Schedule: Daily (terraform plan — diff only)
Alert: Drift detected → notification + ticket
Auto-remediation: No (manual review required)
Drift incidents (January): 2 (both minor, corrected)
ANSIBLE CONFIGURATION MANAGEMENT:
Managed hosts: 165 (servers) + 45 (endpoints, partial)
Playbooks: 85 (server setup, app deployment, security hardening)
Roles: 42 (reusable, modular)
Inventory: Dynamic (AWS EC2 + Azure)
Facts cache: Redis (reduce API calls)
Execution: Parallel (forks: 50)
Last execution: Daily (convergence check)
Compliance check: Weekly (configuration drift)
Drift detected: <2% (auto-remediation available)
KUBERNETES MANAGEMENT:
Clusters: 3 (production, staging, development)
Nodes: 48 total (prod: 25, staging: 15, dev: 8)
Namespaces: 35 (one per service + shared)
Helm charts: 35 (one per service, versioned)
Deployment automation:
GitOps (ArgoCD): 35 services (declarative, sync)
Manifest apply: 5 shared services (manual trigger)
Auto-sync: Every 3 minutes (ArgoCD)
Rollback: 1-click (ArgoCD UI) or auto (health check)
K8s scaling:
HPA (Horizontal Pod Autoscaler): 35 services (CPU + custom metrics)
VPA (Vertical Pod Autoscaler): 35 services (memory recommendation)
Cluster Autoscaler: 3 clusters (node scaling)
Target utilization: 60-70% (CPU), 70-80% (memory)
ENVIRONMENT MANAGEMENT:
Environments: 4 (production, staging, development, QA)
Parity: 95% (config-driven, minimal differences)
Provisioning time:
New environment: <30 minutes (Terraform + Ansible)
New service (all envs): <15 minutes (pipeline)
Environment tear-down: <5 minutes (Terraform destroy)
Environment isolation:
Network: Separate VPCs/subnets
Data: Staging/dev use anonymized or synthetic data
Access: Environment-specific IAM roles
Secrets: Separate vaults (environment-scoped)
Secret management: AWS Secrets Manager + HashiCorp Vault
Secrets: 380 (across all environments)
Rotation: Automatic (weekly for standard, daily for critical)
Access: Least privilege (service account scoped)
Audit: Full logging (who accessed what when)
INFRASTRUCTURE PIPELINE:
IaC change workflow:
1. Developer writes/updates Terraform (PR)
2. CI pipeline: terraform fmt, terraform validate, terrafmt
3. PR check: terraform plan (diff preview in PR comment)
4. Review: Team review (security, cost, architecture)
5. Approval: 2 approvals (infra team + stakeholder)
6. Merge: terraform apply (auto, staging first)
7. Validate: Smoke tests, monitoring check
8. Promote: terraform apply (production — manual approval)
Change statistics (January 2025):
IaC PRs: 45
Approved: 42 (93.3%)
Rejected: 3 (security review, cost concern)
Avg. review time: 4 hours
Avg. deployment time: 15 minutes (post-merge)
Drift post-deploy: 0 (IaC ensures consistency)
Pipeline Optimization
Performance & Efficiency
PIPELINE OPTIMIZATION:
═════════════════════
BUILD PERFORMANCE:
Total builds (January): 8,400 (280/day avg.)
Build success rate: 96.8%
Avg. build time: 4.2 minutes
Median build time: 3.1 minutes
P95 build time: 8.5 minutes (target: <10 min) ✓
P99 build time: 12.3 minutes (target: <15 min) ✓
Build caching:
Dependency cache: 85% hit rate (S3-based)
Layer cache (Docker): 92% hit rate
Compilation cache (Bazel/ccache): 78% hit rate
Impact: 40% reduction in build time (vs. no cache)
Build parallelization:
Matrix builds: 4 parallel runners (per service)
Monorepo splitting: Independent services build in parallel
Impact: 60% reduction in total pipeline time
Runner utilization: 72% (well-managed)
COST OPTIMIZATION:
CI/CD costs (January):
GitHub Actions: $1,200 (840K minutes used)
Self-hosted runners: $800 (EC2 spot instances)
Container registry: $400 (ECR storage + transfers)
Testing infrastructure: $600 (staging environment)
──────────────────────────────────────
Total: $3,000/month
Cost reduction (implemented):
Spot instances: 60% of runners (spot + on-demand fallback)
Runner auto-scale: Scale to 0 when idle (savings: 30%)
Build cache: Reduce rebuild time (savings: $200/month)
Artifact cleanup: Delete old artifacts (savings: $150/month)
Pipeline efficiency metrics:
Deployment frequency: 45/day (avg.)
Lead time (commit → deploy): 18 minutes (avg.)
Change failure rate: 3.2% (target: <5%) ✓
Mean time to recovery (MTTR): 12 minutes (target: <15 min) ✓
DORA metrics: All Elite (>75th percentile)
QUALITY GATES:
Automated quality checks (every build):
1. Code coverage: >80% (unit), >60% (integration)
2. Code quality: SonarQube gate (no critical/blocker issues)
3. Security: SAST (no high/critical vulnerabilities)
4. Dependencies: No known CVEs (high/critical)
5. License compliance: No prohibited licenses
6. Performance: No regression (>20% slower)
7. Bundle size: No increase >10%
Gate statistics (January):
Builds passing all gates: 96.8%
Gate failures:
Coverage below threshold: 12 (0.14%)
Security vulnerability: 8 (0.10%)
Performance regression: 5 (0.06%)
License violation: 2 (0.02%)
Other: 15 (0.18%)
Gate effectiveness:
Production incidents from pipeline: 0 (gates effective)
Bugs caught in CI: 142 (vs. 12 in production — 12x improvement)
Developer feedback: Positive (catch issues early)
RELEASE AUTOMATION METRICS:
Releases (January):
Hotfix: 3 (critical bugs — avg. 4 hours from detection)
Patch: 4 (weekly — avg. 45 minutes deploy time)
Minor: 1 (monthly — avg. 2 hours deploy + verification)
Major: 0 (planned for Q1 — March)
Release success rate: 97.5% (37/38 releases successful on first attempt)
Rollback rate: 2.5% (1 release — canary caught issue, auto-rollback)
Post-release incidents: 0 (all releases clean)
Customer-facing incidents (release-related): 0 ✓
Output
DevOps & CI/CD Dashboard
DEVOPS & CI/CD DASHBOARD — Jan 2025
══════════════════════════════════
Pipelines:
Total pipelines: 48 (35 microservices + 8 infra + 5 shared)
Builds/day: 280-350
Build success rate: 96.8%
Avg. build time: 4.2 min
Lead time (commit → deploy): 18 min
DORA metrics: All Elite
Deployment:
Deployment frequency: 45/day
Downtime: 0 minutes (all zero-downtime strategies)
Rollback rate: 2.5% (auto-rollback effective)
Post-release incidents: 0
Blue-green: 12 services, Canary: 18, Rolling: 5
Infrastructure:
IaC coverage: 100% (721 resources)
Drift: 0 (post-deploy)
Environments: 4 (95% parity)
K8s clusters: 3 (48 nodes, 35 services)
Secrets: 380 (auto-rotation)
Quality:
Gates: 7 automated checks (every build)
Gate pass rate: 96.8%
Bugs caught in CI: 142 (12x vs. production)
Code coverage: >80% (unit), >60% (integration)
Security vulnerabilities blocked: 8
Cost:
CI/CD monthly cost: $3,000
Runner utilization: 72%
Spot instance savings: 30%
Artifact cleanup: $150/month savings
Actions:
1. Jenkins migration complete (8 pipelines — Feb target)
2. Build time optimization (P99: 12.3 → <10 min)
3. K8s cost optimization (rightsizing — March)
4. Release cadence review (quarterly — April)
5. Pipeline security review (annual — Q2)
Integration Points
- Version control (GitHub, GitLab, Bitbucket): Source code, PR workflow
- Container registry (ECR, Harbor, Docker Hub): Image storage, scanning
- Orchestration (Kubernetes, Docker Swarm, ECS): Deployment, scaling
- Configuration management (Ansible, Chef, Puppet): Server config
- IaC tools (Terraform, CloudFormation): Infrastructure provisioning
- Testing frameworks (JUnit, pytest, Cypress, k6): Automated testing
- Quality tools (SonarQube, Snyk, Trivy): Code quality, security
- Package repositories (Nexus, Artifactory): Dependency management
- Secret management (Vault, AWS Secrets Manager): Credential storage
- Monitoring (Datadog, Prometheus, Grafana): Post-deploy verification
- ITSM (ServiceNow, Jira): Change management, incident tracking
- Communication (Slack, Teams, PagerDuty): Deployment notifications
- Artifact stores (S3, GCS, Azure Blob): Build artifact storage
Edge Cases
- Build failure (intermittent): Flaky test identification; retry policy; test isolation; root cause
- Deployment rollback (production): Auto-rollback triggers; data migration reversal; monitoring validation
- Infrastructure drift: Terraform plan alert; manual review; state correction; prevention
- Pipeline bottleneck: Parallelization; caching optimization; runner scaling; queue monitoring
- Secret rotation (mid-deployment): Zero-downtime secret update; dual-secret period; validation
- K8s cluster failure: Multi-cluster failover; pod rescheduling; data persistence; recovery
- Dependency supply chain attack: Dependency pinning; SBOM; vulnerability scanning; vendor verification
- Container image vulnerability: Trivy scan; auto-rebuild; CVE tracking; patch priority
- Environment parity drift: Configuration audit; IaC enforcement; periodic sync
- Resource exhaustion (runner queue): Auto-scaling; spot instance fallback; priority queue; capacity