IT AI Skill
Devops Culture Practices
Establish and mature DevOps culture including continuous delivery, infrastructure automation, monitoring culture, blameless postmortems, team collaboration, and DevOps metrics (DORA). Use when building DevOps practices, implementing continuous delivery, est...
DevOps Culture & Practices
Establish and mature DevOps culture including continuous delivery, infrastructure automation, monitoring culture, blameless postmortems, and DORA metrics.
Workflow
1. DORA Metrics & DevOps Maturity
DORA METRICS (Four Key Measures)
═══════════════════════════════════════
Metric Elite High Medium Low Current
──────────────────────────────────────────────────────────────────────────────────────
Deployment Frequency Multiple/day 1x/day-1x/week 1x/week-1x/month 1x/quarter+ 1x/day
Lead Time for Changes < 1 hour < 1 day < 1 week < 1 month < 4 hours
MTTR (Mean Time to < 1 hour < 4 hours < 1 day > 1 day < 2 hours
Recover)
Change Failure Rate 0-15% 15-30% 30-45% 45-60% 8%
Current Quartile: ELITE ✓
Historical Trend:
═══════════════════════════════════════
Quarter Deploy Freq Lead Time MTTR Change Fail Quartile
──────────────────────────────────────────────────────────────────────────
Q1 2023 1x/week 2 days 6 hours 18% High
Q2 2023 3x/week 12 hours 4 hours 15% High
Q3 2023 5x/week 4 hours 3 hours 12% High
Q4 2023 2x/day 2 hours 2 hours 10% Elite
Q1 2024 5x/day 1 hour 1.5 hours 9% Elite
Q2 2024 8x/day 45 min 1 hour 8% Elite
Q3 2024 12x/day 30 min 1 hour 7% Elite
Q4 2024 15x/day 25 min 45 min 8% Elite
DEVOPS MATURITY MODEL:
═══════════════════════════════════════
Level 1: Initial (Ad-hoc)
→ Manual deployments
→ No automation
→ Siloed teams
→ Reactive monitoring
Level 2: Repeatable (Basic Automation)
→ CI pipeline (automated build + test)
→ Scripted deployments
→ Basic monitoring
→ Some collaboration
Level 3: Defined (Standard Practices)
→ CD pipeline (automated deployment)
→ IaC (Terraform)
→ Structured incident response
→ Shared ownership
Level 4: Managed (Optimized)
→ Continuous deployment
→ Full observability
→ Blameless culture
→ DORA metrics tracked
Level 5: Optimizing (Industry Leading)
→ Self-healing infrastructure
→ AI-assisted operations
→ Zero-touch deployments
→ Elite DORA metrics
Current Level: 4 (Managed, approaching 5)
2. Continuous Delivery Pipeline
CONTINUOUS DELIVERY PIPELINE
═══════════════════════════════════════
Commit → Build → Test → Security → Stage → Approve → Deploy → Verify → Monitor
Stage Details:
═══════════════════════════════════════
1. Commit (Developer)
→ Pre-commit hooks (lint, format, secrets scan)
→ Conventional commit messages
→ Branch protection (required status checks)
2. Build (CI)
→ Install dependencies
→ Compile (TypeScript → JavaScript)
→ Lint (ESLint, Prettier)
→ Type check (TypeScript)
→ Build artifacts (Docker image)
3. Test (CI)
→ Unit tests (Jest, ≥80% coverage)
→ Integration tests (Testcontainers)
→ Code coverage report
→ Performance tests (critical paths)
4. Security (CI)
→ SAST (SonarQube)
→ Dependency scan (npm audit, Snyk)
→ Secret scan (git-secrets, truffleHog)
→ Container scan (Trivy)
5. Stage (CD)
→ Deploy to staging (identical to prod)
→ E2E tests (Cypress)
→ Smoke tests
→ Performance baseline (k6)
6. Approve (Gate)
→ Manual approval (production)
→ Automated approval (staging, dev)
→ Approval policy: 2 engineers + manager
7. Deploy (CD)
→ Canary (5% → 25% → 50% → 100%)
→ Rolling update (zero downtime)
→ Health checks (readiness, liveness)
→ Rollback (automatic on failure)
8. Verify (Post-Deploy)
→ Smoke tests (automated)
→ Error rate monitoring (1 hour)
→ Performance monitoring (1 hour)
→ Business metrics (conversion, revenue)
9. Monitor (Ongoing)
→ APM (Datadog, New Relic)
→ Logs (ELK, CloudWatch)
→ Traces (distributed tracing)
→ Alerts (PagerDuty, Slack)
3. Blameless Postmortem Culture
BLAMELESS POSTMORTEM PROCESS
═══════════════════════════════════════
When to Conduct:
═══════════════════════════════════════
→ P1 incident (system down, data loss)
→ P2 incident (major feature broken)
→ SLA breach (significant impact)
→ Near-miss (caught before impact)
→ Customer-impacting bug (high visibility)
Postmortem Template:
═══════════════════════════════════════
# Incident: [Title]
## Summary
What happened: [Brief description, 2-3 sentences]
When: [Date/time, duration]
Impact: [Customers affected, revenue impact, data impact]
Severity: [P1/P2/P3]
## Timeline
[HH:MM] Incident detected (monitoring/alert)
[HH:MM] Incident declared (P1/P2)
[HH:MM] On-call engineer engaged
[HH:MM] Root cause identified
[HH:MM] Fix deployed
[HH:MM] Incident resolved
[HH:MM] Postmortem initiated
## Root Cause
[Technical root cause — be specific]
[Contributing factors — systemic issues]
## Impact
→ Customers affected: [Number/percentage]
→ Duration: [X hours/minutes]
→ Revenue impact: [$amount, if applicable]
→ Data impact: [None/minimal/significant]
→ Customer complaints: [Number]
## What Went Well
→ [Positive observation 1]
→ [Positive observation 2]
## What Could Be Improved
→ [Area for improvement 1]
→ [Area for improvement 2]
## Action Items
Action Owner Due Date Status
────────────────────────────────────────────────────────────────────────
Add monitoring for X Alice MM/DD In Progress
Improve alert threshold Bob MM/DD Planned
Update runbook for Y Charlie MM/DD Done
Implement circuit breaker Diana MM/DD Planned
## Lessons Learned
→ [Key lesson 1]
→ [Key lesson 2]
KEY PRINCIPLES:
═══════════════════════════════════════
→ Blameless: Focus on process, not people
→ Transparent: Share with entire organization
→ Action-oriented: Every postmortem has action items
→ Timely: Within 48 hours of incident resolution
→ Follow-up: Track action items to completion
→ Celebrate: Recognize learnings, not just failures
4. Team Collaboration & Practices
DEVOPS COLLABORATION PRACTICES
═══════════════════════════════════════
You Build, You Run:
═══════════════════════════════════════
→ Development teams own their services in production
→ On-call rotation (shared responsibility)
→ Runbooks maintained by development teams
→ Monitoring alerts routed to service owners
→ Postmortems led by service owners
Benefits:
→ Faster incident response (domain expertise)
→ Better software quality (empathy for operations)
→ Reduced handoff delays (no silos)
→ Continuous improvement (direct feedback)
CEREMONIES:
═══════════════════════════════════════
Ceremony Frequency Duration Participants Purpose
────────────────────────────────────────────────────────────────────────────────
Daily standup Daily 15 min Dev team Status, blockers
Sprint planning Bi-weekly 2 hours Dev team + PO Scope, estimate
Sprint review Bi-weekly 1 hour Dev team + stakeholders Demo, feedback
Sprint retrospective Bi-weekly 1 hour Dev team Improvements
Incident review As needed 1 hour Incident team Blameless postmortem
Tech debt review Monthly 1 hour Dev team + lead Backlog, priorities
Architecture review Quarterly 2 hours Tech leads Roadmap, decisions
Security review Monthly 1 hour Security + dev Vulns, compliance
SHARED RESPONSIBILITIES:
═══════════════════════════════════════
Development Team:
→ Write code + tests
→ Maintain CI/CD pipeline
→ Write/run documentation
→ On-call rotation
→ Incident response (own services)
Platform Team:
→ Provide shared infrastructure
→ Maintain CI/CD platform
→ Provide monitoring tools
→ Security guardrails
→ Developer experience (DX)
Operations/SRE:
→ Site reliability
→ Capacity planning
→ Disaster recovery
→ Performance optimization
→ On-call (infrastructure)
5. DevOps Metrics & Reporting
DEVOPS METRICS DASHBOARD
═══════════════════════════════════════
Delivery Metrics:
═══════════════════════════════════════
Metric Target Current Trend Status
────────────────────────────────────────────────────────────────────────
Deployment frequency ≥ 1/day 15/day ↑ ✓ Elite
Lead time ≤ 1 hour 25 min ↓ ✓ Elite
Change failure rate ≤ 15% 8% → ✓ Elite
MTTR ≤ 1 hour 45 min ↓ ✓ Elite
Build success rate ≥ 95% 98% → ✓ Good
Test coverage ≥ 80% 84% ↑ ✓ Good
Reliability Metrics:
═══════════════════════════════════════
Metric Target Current Trend Status
────────────────────────────────────────────────────────────────────────
Uptime ≥ 99.9% 99.95% → ✓ Good
Error budget ≥ 99.9% 99.97% → ✓ Good
P1 incidents/month ≤ 2 1 ↓ ✓ Good
P2 incidents/month ≤ 5 3 ↓ ✓ Good
MTTFD (detect) ≤ 5 min 2 min ↓ ✓ Good
MTTA (acknowledge) ≤ 15 min 8 min ↓ ✓ Good
Cost Metrics:
═══════════════════════════════════════
Metric Target Current Trend Status
────────────────────────────────────────────────────────────────────────
Cloud cost/month ≤ $50K $42K → ✓ Good
Cost per deploy ≤ $5 $2 ↓ ✓ Good
Resource utilization 60-80% 72% → ✓ Good
Waste (idle) ≤ 5% 3% ↓ ✓ Good
Edge Cases
- Regulated industries: Compliance-approved deployment process
- Global teams: Async collaboration, timezone coverage
- Large organizations: Platform team, guild model
- Startups: Wear many hats, rapid iteration
- Legacy migration: Strangler pattern, gradual modernization
Integration Points
- CI/CD: GitHub Actions, GitLab CI, Jenkins, ArgoCD
- Monitoring: Datadog, New Relic, Prometheus, Grafana
- Incident: PagerDuty, Opsgenie, Incident.io
- Collaboration: Slack, Jira, Confluence, Notion
- IaC: Terraform, CloudFormation, Pulumi
- Container: Docker, Kubernetes, Helm
Output
DevOps Status
DEVOPS STATUS — Q4 2024
═══════════════════════════════════════
DORA quartile: Elite (all 4 metrics)
Deployments/day: 15 (↑ from 5 in Q3)
Lead time: 25 min (target: <60 min) ✓
Change failure rate: 8% (target: ≤15%) ✓
MTTR: 45 min (target: ≤60 min) ✓
Incidents: 1 P1, 3 P2 (↓ from Q3)
Error budget: 99.97% (healthy) ✓
Cloud cost: $42K/month (within budget) ✓
Next priority: Implement self-healing, improve test coverage to 90%