---
name: platform-engineering
description: Design and maintain internal developer platforms (IDP) including golden paths, self-service infrastructure, developer portals, and platform reliability. Use when building developer platforms, creating golden paths, implementing self-service infrastructure, or measuring platform adoption. Triggers on phrases like "platform engineering", "IDP", "internal developer platform", "golden path", "self-service infrastructure", "developer portal", "Backstage", "platform reliability", "developer experience", "DevEx", "platform adoption", "scaffolding", "template library", "platform SLI".
---

# Platform Engineering

Design and maintain internal developer platforms (IDP) including golden paths, self-service infrastructure, and developer portals.

## Workflow

### 1. Platform Architecture

```
INTERNAL DEVELOPER PLATFORM (IDP)
═══════════════════════════════════════

PLATFORM LAYERS:
═══════════════════════════════════════

Layer 1: Developer Portal (Backstage)
  → Service catalog (all applications registered)
  → Template library (scaffolding for new services)
  → Documentation hub
  → API references
  → Support requests

Layer 2: Golden Paths (Opinionated Workflows)
  → Create new service (1-click from template)
  → Deploy to environment (auto CI/CD + infra)
  → Add database (provision + connect)
  → Add message queue (provision + connect)
  → Add monitoring (auto-instrument + dashboards)

Layer 2b: Supported Paths (Custom Workflows)
  → Custom CI/CD pipelines
  → Custom infrastructure (Terraform)
  → Non-standard deployments

Layer 3: Shared Infrastructure (Managed by Platform Team)
  → Kubernetes clusters (EKS/GKE)
  → Databases (managed PostgreSQL, Redis)
  → Message queues (SQS, Kafka)
  → CI/CD runners
  → Monitoring stack (Prometheus, Grafana)
  → Secret management (Vault)
  → Service mesh (Istio)

Layer 4: Cloud Provider (AWS/Azure/GCP)
  → Networking (VPC, subnets, load balancers)
  → Compute (EC2, ECS, Lambda, EKS)
  → Storage (S3, EBS, RDS)
  → Security (IAM, KMS, security groups)

DEVELOPER PORTAL (Backstage):
═══════════════════════════════════════

  Features:
    → Service catalog: All applications with metadata
    → Software templates: Scaffolding for new services
    → Plugin ecosystem: Custom plugins for organization needs
    → Documentation: Embedded docs, runbooks, ADRs
    → API catalog: Internal API discovery
    → Support: Request infrastructure, report issues
```

### 2. Golden Paths & Templates

```
GOLDEN PATH: Create New Microservice
═══════════════════════════════════════

  Step 1: Developer selects template in Backstage
    → Options: API service, Worker service, Web app, CLI tool
    → Language: Python, Node.js, Go, Java

  Step 2: Fill in metadata
    → Service name, owner team, description
    → Environment (dev/staging/prod)
    → Resource requirements (CPU, memory)
    → Dependencies (database, cache, queue)

  Step 3: Platform provisions automatically
    → Creates Git repository (from template)
    → Provisions Kubernetes namespace
    → Sets up CI/CD pipeline (GitHub Actions)
    → Creates Prometheus alerts
    → Provisions database (if selected)
    → Creates Grafana dashboard
    → Sets up ArgoCD application
    → Generates documentation

  Step 4: Developer starts coding
    → Clone repository
    → Run locally (Docker Compose)
    → Push to branch → CI runs tests
    → Merge to main → Deploy to dev
    → Promote to staging → QA
    → Promote to production → Live

  Time to first deployment: < 15 minutes (from zero)
  Previous time: 2-3 days (manual setup)

TEMPLATE LIBRARY:
═══════════════════════════════════════

Template               Tech Stack          Includes                  Usage
────────────────────────────────────────────────────────────────────────────
Python API             FastAPI + SQLAlc    CI/CD, K8s, DB, Monitor  45
Node.js API            Express + Prisma    CI/CD, K8s, DB, Monitor  30
Go Service             Gin + gRPC          CI/CD, K8s, Monitor      20
Worker Service         Celery/RQ + Redis   CI/CD, K8s, Queue        15
Web App                Next.js + Prisma    CI/CD, Vercel, DB        25
CLI Tool               Click/Go            CI/CD, Release           5
```

### 3. Self-Service Infrastructure

```
SELF-SERVICE CATALOG
═══════════════════════════════════════

Available Resources:
═══════════════════════════════════════

Resource              Provisioning Time    Approval    Cost Visibility
──────────────────────────────────────────────────────────────────────
Database (PostgreSQL)  <5 minutes          Auto (dev)  $150/mo base
Redis Cache            <2 minutes          Auto (dev)  $50/mo base
Message Queue (SQS)    <1 minute           Auto        Pay-per-use
K8s Namespace          <1 minute           Auto        Shared cost
Cloud Storage (S3)     <1 minute           Auto        Pay-per-use
Load Balancer          <2 minutes          Auto (dev)  $25/mo
DNS Record             <1 minute           Auto        $0
SSL Certificate        <5 minutes          Auto        $0
Secret (Vault)         <1 minute           Auto        $0

APPROVAL FLOW:
═══════════════════════════════════════

  Dev environment: Auto-approve (no gatekeeping)
  Staging: Auto-approve (with cost notification)
  Production: Team lead approval + cost review

  Guardrails:
    → Max instance size per environment
    → Allowed regions only
    → Required tags enforced
    → Budget limits per team
    → Auto-shutdown for dev resources
```

### 4. Platform Reliability & SLIs

```
PLATFORM RELIABILITY METRICS
═══════════════════════════════════════

Platform SLIs:
═══════════════════════════════════════

SLO                    Target    Current    Measurement
──────────────────────────────────────────────────────────────
Self-service success   99%       97%        Provisioning success rate
Provisioning latency   <5 min    3 min      P95 time to provision
CI/CD pipeline success  98%      96%        Pipeline pass rate
Platform API uptime    99.9%     99.95%     API availability
Template freshness     <30 days  15 days    Days since last update

ADOPTION METRICS:
═══════════════════════════════════════

  Golden path adoption: 68% of new services
  Supported path: 25% of new services
  Golden path (legacy): 7% (migrated services)
  Template usage: 140 services created from templates
  Portal active users: 85% of engineering

DEVELOPER SATISFICATION:
═══════════════════════════════════════

  Survey (Q4 2024, n=120):
    → Overall satisfaction: 4.1/5.0
    → Time-to-deploy: 4.5/5.0 (improved from 2.8)
    → Documentation quality: 3.8/5.0
    → Platform reliability: 4.2/5.0
    → Support responsiveness: 4.0/5.0

  Top requested features:
    1. More language templates (Rust, Ruby)
    2. One-click staging → prod promotion
    3. Cost visibility per service
    4. Environment cloning
    5. Database migration tooling
```

### 5. Platform Roadmap

```
PLATFORM ROADMAP — 2025
═══════════════════════════════════════

Q1 2025: Developer Experience
═══════════════════════════════════════

  → Rust and Ruby templates
  → Local development environment (Dev Containers)
  → Improved documentation (searchable)
  → Cost visibility per service (dashboard)
  → Environment cloning feature

Q2 2025: Automation
═══════════════════════════════════════

  → One-click staging → production promotion
  → Automated canary deployments
  → Database migration framework
  → Automated security scanning in golden path
  → Service dependency graph

Q3 2025: Scale
═══════════════════════════════════════

  → Multi-cluster support
  → Multi-region deployment
  → Advanced autoscaling
  → Disaster recovery automation
  → Cost optimization recommendations

Q4 2025: Intelligence
═══════════════════════════════════════

  → AI-assisted template selection
  → Anomaly detection for services
  → Predictive scaling
  → Automated performance tuning
  → Developer productivity analytics
```

## Edge Cases

- **Large teams**: 1,000+ developers need scalable platform
- **Legacy migration**: Golden path adoption for existing services
- **Multi-cloud**: Platform abstraction across cloud providers
- **Security**: Platform security review, vulnerability management
- **Cost management**: Platform cost allocation to teams

## Integration Points

- **IDP tools**: Backstage, Humanitec, Port
- **Cloud**: AWS, Azure, GCP
- **Kubernetes**: EKS, GKE, AKS
- **CI/CD**: GitHub Actions, GitLab CI, Jenkins
- **IaC**: Terraform, Pulumi, Crossplane
- **Monitoring**: Prometheus, Grafana, Datadog

## Output

### Platform Status

```
PLATFORM STATUS — Q4 2024
═══════════════════════════════════════

Services on platform: 180 (68% golden path)
Template usage: 140 services created
Portal active users: 85% of engineering
Platform reliability: 99.95% uptime
Developer satisfaction: 4.1/5.0

Q1 roadmap: Rust/Ruby templates, cost visibility, env cloning
```
