---
name: it-documentation-cmdb
description: Build and maintain IT documentation systems including Configuration Management Database (CMDB), technical documentation, runbooks, knowledge bases, and system architecture diagrams. Use when setting up CMDB, creating IT documentation standards, building system runbooks, documenting infrastructure, managing configuration items, creating architecture diagrams, or establishing documentation workflows. Triggers on phrases like "CMDB", "configuration management", "IT documentation", "runbook", "knowledge base", "system documentation", "architecture diagram", "technical writing", "configuration item", "asset documentation".
---

# IT Documentation & CMDB

Maintain comprehensive IT documentation, configuration management, and operational knowledge bases.

## Workflow

1. Define documentation standards: templates, formats, review cycles, approval processes, and ownership.
2. Implement CMDB: discover configuration items (CIs), establish relationships, automate data collection.
3. Create system documentation: architecture diagrams, deployment topology, network diagrams, data flow maps.
4. Build operational runbooks: step-by-step procedures for common operations, incident response, and maintenance.
5. Develop knowledge base: troubleshooting guides, FAQs, known errors, workarounds, best practices.
6. Establish documentation governance: review cadence, change triggers, version control, access management.
7. Automate documentation where possible: IaC-generated docs, auto-discovered CMDB, API-generated diagrams.
8. Train team on documentation practices: when to document, how to write, quality standards.
9. Audit documentation quarterly: completeness, accuracy, accessibility, relevance.
10. Continuously improve: feedback loops, usage analytics, documentation quality metrics.

## CMDB Implementation

```
CMDB FRAMEWORK
================

Configuration Item (CI) Categories:

  1. Hardware CIs:
     - Servers (physical): model, serial, location, specs, warranty, owner
     - Network devices: router, switch, firewall, load balancer, WAP
     - Storage devices: SAN, NAS, tape library, backup appliance
     - Endpoints: laptops, desktops, phones, tablets, printers
     - Data center: racks, PDUs, UPS, CRAC units

  2. Software CIs:
     - Operating systems: version, patch level, license
     - Applications: version, configuration, dependencies, owner
     - Middleware: web server, app server, message queue
     - Databases: type, version, size, owner, backup status
     - Libraries and frameworks: version, license, known vulnerabilities

  3. Virtual CIs:
     - Virtual machines: host, specs, OS, applications, snapshots
     - Containers: image, version, registry, deployment target
     - Kubernetes: clusters, namespaces, deployments, services
     - Serverless: functions, triggers, dependencies

  4. Cloud CIs:
     - Compute: EC2, VMs, instances (type, size, region, AZ)
     - Storage: S3 buckets, blobs, disks (size, encryption, access)
     - Network: VPCs, subnets, security groups, load balancers
     - Database: RDS, DynamoDB, CosmosDB (type, size, Multi-AZ)
     - Managed services: Lambda, SQS, SNS, EKS (configuration)

  5. Service CIs:
     - Business services: payroll, CRM, email, website, API
     - IT services: DNS, DHCP, NTP, proxy, VPN
     - Sub-services: components that support parent services
     - Dependencies: service-to-service relationships

  6. Document CIs:
     - Policies and procedures
     - Contracts and licenses
     - Certifications and compliance evidence
     - Architecture diagrams and documentation

CI Attributes (standard fields):

  Identity:
    CI_ID: Unique identifier (auto-generated, format: HW-0001, SW-0001, SVC-0001)
    Name: Human-readable name
    Type: Category and sub-category
    Status: Planned, Staged, In Operation, Retired, Decommissioned

  Technical:
    Model/Version: Manufacturer and version
    Serial Number: Manufacturer serial (for hardware)
    IP Address: Primary IP (for networked devices)
    Hostname: Network name
    Location: Data center, rack, U-position (for hardware)
    Specifications: CPU, RAM, storage, network (for compute)

  Ownership:
    Technical Owner: Person responsible for operation
    Business Owner: Person responsible for business value
    Support Group: Team that provides L2/L3 support
    Cost Center: Budget allocation code

  Relationships:
    Depends On: List of CIs this CI depends on
    Supports: List of services this CI supports
    Contains: List of CIs contained within this CI
    Connected To: Network connections and dependencies

  Lifecycle:
    Date Acquired: Purchase or deployment date
    Expected End of Life: Planned retirement date
    Warranty Expiration: Manufacturer warranty end date
    Last Updated: Last modification to CI record
    Created By: Person who created the record
    Approved By: Person who approved the record

CMDB Accuracy Metrics:

  Target: > 95% accuracy for critical CIs; > 85% for all CIs

  Measurement methods:
    1. Automated discovery reconciliation (quarterly)
       - Run discovery tool; compare against CMDB
       - Identify: missing CIs, stale CIs, incorrect attributes
       - Remediation: update CMDB within 5 business days

    2. Audit sample (quarterly)
       - Random sample of 50 CIs; verify each attribute
       - Accuracy = correct attributes / total attributes checked
       - Report results to IT management

    3. Change management correlation (ongoing)
       - Every change request must reference affected CIs
       - Post-change: verify CI attributes updated
       - Alert on changes without CMDB update
```

## Documentation Standards

```
DOCUMENTATION STANDARDS
=========================

Document types and templates:

  1. System Architecture Document (SAD):
     Purpose: Describe system architecture, components, and interactions
     Audience: Engineers, architects, new team members
     Format: Markdown or Confluence page
     Sections:
       - Overview (1 paragraph)
       - Architecture diagram (visual)
       - Component inventory (table)
       - Data flow description
       - Integration points (APIs, message queues)
       - Security architecture
       - Scaling and performance characteristics
       - Known limitations and technical debt
     Review cycle: Quarterly or on significant change
     Owner: System architect or lead engineer

  2. Operational Runbook:
     Purpose: Step-by-step procedures for routine operations
     Audience: On-call engineers, operations team
     Format: Markdown with numbered steps; copy-paste commands
     Sections:
       - Procedure name and description
       - Prerequisites (access, tools, approvals needed)
       - Estimated time to complete
       - Step-by-step procedure (numbered, with commands)
       - Expected output for each step
       - Verification steps (confirm success)
       - Rollback procedure (if something goes wrong)
       - Escalation path (when to call for help)
     Review cycle: Semi-annually or after each incident
     Owner: Operations team lead

  3. Incident Response Playbook:
     Purpose: Procedures for specific incident types
     Audience: Incident responders, on-call team
     Format: Markdown; link from alert definitions
     Sections:
       - Incident type and trigger conditions
       - Severity classification
       - Immediate actions (first 15 minutes)
       - Diagnosis steps (structured investigation)
       - Resolution options (prioritized by speed/risk)
       - Verification (confirm resolution)
       - Post-incident actions (PIR, monitoring updates)
     Review cycle: After each incident; minimum annually
     Owner: SRE team lead

  4. Knowledge Base Article:
     Purpose: Troubleshooting guide for common issues
     Audience: Support team, end users, engineers
     Format: Structured FAQ with search-friendly content
     Sections:
       - Title (searchable, includes error codes/symptoms)
       - Symptoms (what the user sees)
       - Root cause (why it happens)
       - Resolution (step-by-step fix)
       - Prevention (how to avoid recurrence)
       - Related articles (links)
     Review cycle: Annually or when issue pattern changes
     Owner: Subject matter expert

  5. Standard Operating Procedure (SOP):
     Purpose: Formal process documentation for compliance
     Audience: Auditors, management, process owners
     Format: Document with version control and approval
     Sections:
       - Document ID, version, date
       - Purpose and scope
       - Roles and responsibilities (RACI matrix)
       - Procedure (detailed steps)
       - References (related policies, regulations)
       - Revision history
       - Approval signatures
     Review cycle: Annually or on regulatory change
     Owner: Process owner

  6. Project Documentation:
     Purpose: Track project scope, decisions, and outcomes
     Audience: Project team, stakeholders, auditors
     Format: Project folder with structured documents
     Sections:
       - Project charter (scope, objectives, timeline, budget)
       - Design documents (technical specifications)
       - Meeting notes and decisions
       - Test plans and results
       - Go-live checklist
       - Post-implementation review
     Review cycle: At project closure; archive after 1 year
     Owner: Project manager

Documentation quality checklist:

  ☐ Clear and concise language (avoid jargon where possible)
  ☐ Step-by-step instructions with expected outcomes
  ☐ Screenshots or diagrams where helpful
  ☐ Commands are copy-paste ready (tested and verified)
  ☐ Version number and last updated date visible
  ☐ Owner/contact person identified
  ☐ Links to related documentation
  ☐ Search-friendly title and tags
  ☐ Accessible to intended audience (appropriate technical level)
  ☐ Reviewed and approved by subject matter expert
```

## Runbook Examples

```
RUNBOOK: Database Failover (PostgreSQL)
=========================================

Trigger: Primary database unresponsive or degraded for > 60 seconds
Severity: SEV-1 (if production)

Prerequisites:
  - SSH access to database servers
  - psql client installed
  - Patroni cluster access
  - Notification channels available (Slack, PagerDuty)

Estimated time: 5–15 minutes

Step 1: Verify Failure
  ```
  # Check if primary is responding
  psql -h primary-db.internal -U monitoring -c "SELECT 1;"

  # Check Patroni status
  patronictl -c /etc/patroni/patroni.yml list

  # Check logs
  tail -100 /var/log/postgresql/postgresql-14-main.log
  ```
  Expected: Connection refused or timeout; Patroni shows primary as stopped

Step 2: Initiate Failover
  ```
  # Automatic failover (Patroni)
  # Patroni should auto-failover within 30 seconds

  # If Patroni not auto-failing, trigger manually:
  patronictl -c /etc/patroni/patroni.yml failover primary-db

  # Verify new primary
  patronictl -c /etc/patroni/patroni.yml list
  ```
  Expected: New primary elected; cluster shows one leader

Step 3: Update Connection Strings
  ```
  # If using virtual IP / service discovery, no action needed
  # If using static connection strings, update:
  # - Application configuration
  # - Connection poolers (PgBouncer)
  # - ETL jobs and scheduled tasks

  # Verify application connectivity
  curl -s http://app-health-endpoint/health
  ```

Step 4: Verify Data Integrity
  ```
  # Connect to new primary
  psql -h new-primary.internal -U monitoring

  # Check replication status (should be empty — new primary has no replica)
  SELECT * FROM pg_stat_replication;

  # Check database size (compare with pre-failover)
  SELECT pg_database_size('production_db');
  ```

Step 5: Restore Old Primary as Replica
  ```
  # Once old primary is recovered:
  patronictl -c /etc/patroni/patroni.yml restart old-primary

  # Verify it joins as replica
  patronictl -c /etc/patroni/patroni.yml list
  ```

Step 6: Verification
  - Application health check passes
  - Error rate returned to baseline (< 0.1%)
  - Database queries executing normally
  - Monitoring alerts cleared

Rollback:
  If failover causes issues, manual switchover back to original primary:
  patronictl -c /etc/patroni/patroni.yml switchover --leader new-primary

Escalation:
  If failover fails after 15 minutes: escalate to DBA team lead
  If data loss suspected: escalate to VP Engineering immediately
```

## Integration Points

- **ServiceNow CMDB**: Enterprise CMDB; automated discovery; reconciliation; service mapping; ITSM integration
- **Landbot / FlexNet / Snow Software**: Software asset management; license compliance; CMDB enrichment
- **Confluence / Notion / SharePoint**: Documentation platforms; wiki; templates; version control; search
- **Atlassian Documentation**: Integrated with Jira; API documentation; developer docs
- **Diagrams.net (Draw.io) / Lucidchart / Visio**: Architecture diagrams; network diagrams; flowcharts
- **AWS Service Catalog / Azure Resource Graph**: Cloud resource inventory; automated CMDB for cloud
- **Wazuh / Nagios / Zabbix**: Auto-discovery tools; CI detection; relationship mapping
- **Structurizr / C4 Model**: Architecture as code; code-based architecture diagrams; version-controlled docs
- **Markdown / Git**: Version-controlled documentation; Git repo per system; PR-based review process

## Edge Cases

- **Legacy system documentation** (systems with no documentation, original team departed): Reverse-engineer architecture through: code analysis, log analysis, network traffic analysis, interview remaining team members; prioritize critical systems first; create "living documentation" that team maintains going forward; budget 2–4 weeks per major undocumented system
  - Approach: start with architecture diagram (boxes and arrows), then add detail iteratively
  - Tools: code dependency analyzers, network mapping tools, packet captures
  - Team effort: 1 senior engineer + 1 junior engineer for 2–4 weeks per system

- **Cloud-native CMDB** (containers, serverless, ephemeral resources): Traditional CMDB struggles with ephemeral resources (containers live minutes); use cloud-native inventory: AWS Resource Explorer, Azure Resource Graph, GCP Asset Inventory; tag-based management; auto-generated documentation from IaC (Terraform docs); CI/CD pipeline as documentation source
  - Strategy: CMDB for stable infrastructure; dynamic inventory for ephemeral resources
  - Tooling: Terraform state as source of truth; CloudFormation exports; Kubernetes API for pod/service inventory
  - Integration: feed cloud inventory into CMDB for relationship mapping

- **Documentation as Code** (treating docs like software): Store documentation in Git repositories alongside code; use Markdown or AsciiDoc; CI pipeline builds and deploys documentation; PR review process for documentation changes; version docs with releases; automated link checking; stale content detection
  - Tools: MkDocs, Docusaurus, Sphinx, Jekyll (static site generators)
  - CI/CD: GitHub Actions / GitLab CI builds docs on merge; deploys to documentation site
  - Standards: every PR must update relevant documentation; "no undocumented changes" policy

- **Multi-language documentation** (global teams, global users): Maintain documentation in multiple languages; use translation management system; designate language owners; establish translation workflow; version docs per language; flag translations for review when source changes
  - Tools: Crowdin, Lokalise, Transifex (translation management)
  - Process: source doc updated → translation team notified → translation completed within 5 business days → review → publish
  - Cost: $0.05–$0.15 per word for professional translation; $5K–$50K/year for enterprise documentation

- **Regulatory documentation** (audit-ready, signed, version-controlled): Documents must be immutable once approved; version control with approval signatures; retention per regulatory requirement (7 years typical); access control (read-only for most, edit only by process owner); audit trail of all changes
  - Tools: DocuSign for signatures; SharePoint with versioning; ServiceNow for policy management
  - Retention: minimum 7 years; archive in immutable storage
  - Access: role-based access control; audit log of all document access

- **Documentation adoption** (team doesn't write or read documentation): Leadership mandate: documentation is part of definition of done; measure documentation coverage (% of systems with current docs); tie to performance reviews; make documentation easy (templates, auto-generation); celebrate good documentation; hold documentation quality reviews in sprint retrospectives
  - Metrics: documentation coverage (% systems documented); documentation freshness (last updated date); documentation usage (page views, search queries)
  - Incentives: "documentation champion" recognition; documentation quality in sprint demos; new hire onboarding quality metric
