---
name: devops-culture-practices
description: Establish and mature DevOps culture including continuous delivery, infrastructure automation, monitoring culture, blameless postmortems, team collaboration, and DevOps metrics (DORA). Use when building DevOps practices, implementing continuous delivery, establishing blameless culture, measuring DevOps maturity, or improving team collaboration. Triggers on phrases like "DevOps culture", "DevOps practices", "continuous delivery", "DORA metrics", "blameless postmortem", "DevOps maturity", "deployment frequency", "lead time", "MTTR", "change failure rate", "site reliability", "SRE", "you build you run", "shift left", "trunk-based development", "feature flags", "continuous deployment".
---

# DevOps Culture & Practices

Establish and mature DevOps culture including continuous delivery, infrastructure automation, monitoring culture, blameless postmortems, and DORA metrics.

## Workflow

### 1. DORA Metrics & DevOps Maturity

```
DORA METRICS (Four Key Measures)
═══════════════════════════════════════

Metric                  Elite         High          Medium        Low       Current
──────────────────────────────────────────────────────────────────────────────────────
Deployment Frequency    Multiple/day  1x/day-1x/week 1x/week-1x/month 1x/quarter+ 1x/day
Lead Time for Changes   < 1 hour      < 1 day       < 1 week      < 1 month  < 4 hours
MTTR (Mean Time to      < 1 hour      < 4 hours     < 1 day       > 1 day   < 2 hours
  Recover)
Change Failure Rate     0-15%         15-30%        30-45%        45-60%    8%

  Current Quartile: ELITE ✓

  Historical Trend:
═══════════════════════════════════════

  Quarter     Deploy Freq   Lead Time   MTTR     Change Fail  Quartile
  ──────────────────────────────────────────────────────────────────────────
  Q1 2023     1x/week       2 days      6 hours  18%          High
  Q2 2023     3x/week       12 hours    4 hours  15%          High
  Q3 2023     5x/week       4 hours     3 hours  12%          High
  Q4 2023     2x/day        2 hours     2 hours  10%          Elite
  Q1 2024     5x/day        1 hour      1.5 hours 9%          Elite
  Q2 2024     8x/day        45 min      1 hour   8%           Elite
  Q3 2024     12x/day       30 min      1 hour   7%           Elite
  Q4 2024     15x/day       25 min      45 min   8%           Elite

DEVOPS MATURITY MODEL:
═══════════════════════════════════════

  Level 1: Initial (Ad-hoc)
    → Manual deployments
    → No automation
    → Siloed teams
    → Reactive monitoring

  Level 2: Repeatable (Basic Automation)
    → CI pipeline (automated build + test)
    → Scripted deployments
    → Basic monitoring
    → Some collaboration

  Level 3: Defined (Standard Practices)
    → CD pipeline (automated deployment)
    → IaC (Terraform)
    → Structured incident response
    → Shared ownership

  Level 4: Managed (Optimized)
    → Continuous deployment
    → Full observability
    → Blameless culture
    → DORA metrics tracked

  Level 5: Optimizing (Industry Leading)
    → Self-healing infrastructure
    → AI-assisted operations
    → Zero-touch deployments
    → Elite DORA metrics

  Current Level: 4 (Managed, approaching 5)
```

### 2. Continuous Delivery Pipeline

```
CONTINUOUS DELIVERY PIPELINE
═══════════════════════════════════════

  Commit → Build → Test → Security → Stage → Approve → Deploy → Verify → Monitor

  Stage Details:
═══════════════════════════════════════

  1. Commit (Developer)
    → Pre-commit hooks (lint, format, secrets scan)
    → Conventional commit messages
    → Branch protection (required status checks)

  2. Build (CI)
    → Install dependencies
    → Compile (TypeScript → JavaScript)
    → Lint (ESLint, Prettier)
    → Type check (TypeScript)
    → Build artifacts (Docker image)

  3. Test (CI)
    → Unit tests (Jest, ≥80% coverage)
    → Integration tests (Testcontainers)
    → Code coverage report
    → Performance tests (critical paths)

  4. Security (CI)
    → SAST (SonarQube)
    → Dependency scan (npm audit, Snyk)
    → Secret scan (git-secrets, truffleHog)
    → Container scan (Trivy)

  5. Stage (CD)
    → Deploy to staging (identical to prod)
    → E2E tests (Cypress)
    → Smoke tests
    → Performance baseline (k6)

  6. Approve (Gate)
    → Manual approval (production)
    → Automated approval (staging, dev)
    → Approval policy: 2 engineers + manager

  7. Deploy (CD)
    → Canary (5% → 25% → 50% → 100%)
    → Rolling update (zero downtime)
    → Health checks (readiness, liveness)
    → Rollback (automatic on failure)

  8. Verify (Post-Deploy)
    → Smoke tests (automated)
    → Error rate monitoring (1 hour)
    → Performance monitoring (1 hour)
    → Business metrics (conversion, revenue)

  9. Monitor (Ongoing)
    → APM (Datadog, New Relic)
    → Logs (ELK, CloudWatch)
    → Traces (distributed tracing)
    → Alerts (PagerDuty, Slack)
```

### 3. Blameless Postmortem Culture

```
BLAMELESS POSTMORTEM PROCESS
═══════════════════════════════════════

When to Conduct:
═══════════════════════════════════════

  → P1 incident (system down, data loss)
  → P2 incident (major feature broken)
  → SLA breach (significant impact)
  → Near-miss (caught before impact)
  → Customer-impacting bug (high visibility)

Postmortem Template:
═══════════════════════════════════════

  # Incident: [Title]

  ## Summary
  What happened: [Brief description, 2-3 sentences]
  When: [Date/time, duration]
  Impact: [Customers affected, revenue impact, data impact]
  Severity: [P1/P2/P3]

  ## Timeline
  [HH:MM] Incident detected (monitoring/alert)
  [HH:MM] Incident declared (P1/P2)
  [HH:MM] On-call engineer engaged
  [HH:MM] Root cause identified
  [HH:MM] Fix deployed
  [HH:MM] Incident resolved
  [HH:MM] Postmortem initiated

  ## Root Cause
  [Technical root cause — be specific]
  [Contributing factors — systemic issues]

  ## Impact
  → Customers affected: [Number/percentage]
  → Duration: [X hours/minutes]
  → Revenue impact: [$amount, if applicable]
  → Data impact: [None/minimal/significant]
  → Customer complaints: [Number]

  ## What Went Well
  → [Positive observation 1]
  → [Positive observation 2]

  ## What Could Be Improved
  → [Area for improvement 1]
  → [Area for improvement 2]

  ## Action Items
  Action                          Owner       Due Date    Status
  ────────────────────────────────────────────────────────────────────────
  Add monitoring for X            Alice       MM/DD       In Progress
  Improve alert threshold         Bob         MM/DD       Planned
  Update runbook for Y            Charlie     MM/DD       Done
  Implement circuit breaker       Diana       MM/DD       Planned

  ## Lessons Learned
  → [Key lesson 1]
  → [Key lesson 2]

KEY PRINCIPLES:
═══════════════════════════════════════

  → Blameless: Focus on process, not people
  → Transparent: Share with entire organization
  → Action-oriented: Every postmortem has action items
  → Timely: Within 48 hours of incident resolution
  → Follow-up: Track action items to completion
  → Celebrate: Recognize learnings, not just failures
```

### 4. Team Collaboration & Practices

```
DEVOPS COLLABORATION PRACTICES
═══════════════════════════════════════

You Build, You Run:
═══════════════════════════════════════

  → Development teams own their services in production
  → On-call rotation (shared responsibility)
  → Runbooks maintained by development teams
  → Monitoring alerts routed to service owners
  → Postmortems led by service owners

  Benefits:
    → Faster incident response (domain expertise)
    → Better software quality (empathy for operations)
    → Reduced handoff delays (no silos)
    → Continuous improvement (direct feedback)

CEREMONIES:
═══════════════════════════════════════

  Ceremony              Frequency    Duration    Participants       Purpose
  ────────────────────────────────────────────────────────────────────────────────
  Daily standup          Daily       15 min      Dev team           Status, blockers
  Sprint planning        Bi-weekly   2 hours     Dev team + PO      Scope, estimate
  Sprint review          Bi-weekly   1 hour      Dev team + stakeholders Demo, feedback
  Sprint retrospective   Bi-weekly   1 hour      Dev team           Improvements
  Incident review        As needed   1 hour      Incident team      Blameless postmortem
  Tech debt review       Monthly     1 hour      Dev team + lead    Backlog, priorities
  Architecture review    Quarterly   2 hours     Tech leads         Roadmap, decisions
  Security review        Monthly     1 hour      Security + dev     Vulns, compliance

SHARED RESPONSIBILITIES:
═══════════════════════════════════════

  Development Team:
    → Write code + tests
    → Maintain CI/CD pipeline
    → Write/run documentation
    → On-call rotation
    → Incident response (own services)

  Platform Team:
    → Provide shared infrastructure
    → Maintain CI/CD platform
    → Provide monitoring tools
    → Security guardrails
    → Developer experience (DX)

  Operations/SRE:
    → Site reliability
    → Capacity planning
    → Disaster recovery
    → Performance optimization
    → On-call (infrastructure)
```

### 5. DevOps Metrics & Reporting

```
DEVOPS METRICS DASHBOARD
═══════════════════════════════════════

Delivery Metrics:
═══════════════════════════════════════

  Metric              Target        Current      Trend    Status
  ────────────────────────────────────────────────────────────────────────
  Deployment frequency  ≥ 1/day     15/day       ↑        ✓ Elite
  Lead time             ≤ 1 hour    25 min       ↓        ✓ Elite
  Change failure rate   ≤ 15%       8%           →        ✓ Elite
  MTTR                  ≤ 1 hour    45 min       ↓        ✓ Elite
  Build success rate    ≥ 95%       98%          →        ✓ Good
  Test coverage         ≥ 80%       84%          ↑        ✓ Good

Reliability Metrics:
═══════════════════════════════════════

  Metric              Target        Current      Trend    Status
  ────────────────────────────────────────────────────────────────────────
  Uptime              ≥ 99.9%       99.95%       →        ✓ Good
  Error budget        ≥ 99.9%       99.97%       →        ✓ Good
  P1 incidents/month  ≤ 2           1            ↓        ✓ Good
  P2 incidents/month  ≤ 5           3            ↓        ✓ Good
  MTTFD (detect)      ≤ 5 min       2 min        ↓        ✓ Good
  MTTA (acknowledge)  ≤ 15 min      8 min        ↓        ✓ Good

Cost Metrics:
═══════════════════════════════════════

  Metric              Target        Current      Trend    Status
  ────────────────────────────────────────────────────────────────────────
  Cloud cost/month    ≤ $50K        $42K         →        ✓ Good
  Cost per deploy     ≤ $5          $2           ↓        ✓ Good
  Resource utilization 60-80%       72%          →        ✓ Good
  Waste (idle)        ≤ 5%          3%           ↓        ✓ Good
```

## Edge Cases

- **Regulated industries**: Compliance-approved deployment process
- **Global teams**: Async collaboration, timezone coverage
- **Large organizations**: Platform team, guild model
- **Startups**: Wear many hats, rapid iteration
- **Legacy migration**: Strangler pattern, gradual modernization

## Integration Points

- **CI/CD**: GitHub Actions, GitLab CI, Jenkins, ArgoCD
- **Monitoring**: Datadog, New Relic, Prometheus, Grafana
- **Incident**: PagerDuty, Opsgenie, Incident.io
- **Collaboration**: Slack, Jira, Confluence, Notion
- **IaC**: Terraform, CloudFormation, Pulumi
- **Container**: Docker, Kubernetes, Helm

## Output

### DevOps Status

```
DEVOPS STATUS — Q4 2024
═══════════════════════════════════════

DORA quartile: Elite (all 4 metrics)
Deployments/day: 15 (↑ from 5 in Q3)
Lead time: 25 min (target: <60 min) ✓
Change failure rate: 8% (target: ≤15%) ✓
MTTR: 45 min (target: ≤60 min) ✓
Incidents: 1 P1, 3 P2 (↓ from Q3)
Error budget: 99.97% (healthy) ✓
Cloud cost: $42K/month (within budget) ✓
Next priority: Implement self-healing, improve test coverage to 90%
```
