IT AI Skill
It Documentation Cmdb
Build and maintain IT documentation systems including Configuration Management Database (CMDB), technical documentation, runbooks, knowledge bases, and system architecture diagrams. Use when setting up CMDB, creating IT documentation standards, building sys...
IT Documentation & CMDB
Maintain comprehensive IT documentation, configuration management, and operational knowledge bases.
Workflow
- Define documentation standards: templates, formats, review cycles, approval processes, and ownership.
- Implement CMDB: discover configuration items (CIs), establish relationships, automate data collection.
- Create system documentation: architecture diagrams, deployment topology, network diagrams, data flow maps.
- Build operational runbooks: step-by-step procedures for common operations, incident response, and maintenance.
- Develop knowledge base: troubleshooting guides, FAQs, known errors, workarounds, best practices.
- Establish documentation governance: review cadence, change triggers, version control, access management.
- Automate documentation where possible: IaC-generated docs, auto-discovered CMDB, API-generated diagrams.
- Train team on documentation practices: when to document, how to write, quality standards.
- Audit documentation quarterly: completeness, accuracy, accessibility, relevance.
- Continuously improve: feedback loops, usage analytics, documentation quality metrics.
CMDB Implementation
CMDB FRAMEWORK
================
Configuration Item (CI) Categories:
1. Hardware CIs:
- Servers (physical): model, serial, location, specs, warranty, owner
- Network devices: router, switch, firewall, load balancer, WAP
- Storage devices: SAN, NAS, tape library, backup appliance
- Endpoints: laptops, desktops, phones, tablets, printers
- Data center: racks, PDUs, UPS, CRAC units
2. Software CIs:
- Operating systems: version, patch level, license
- Applications: version, configuration, dependencies, owner
- Middleware: web server, app server, message queue
- Databases: type, version, size, owner, backup status
- Libraries and frameworks: version, license, known vulnerabilities
3. Virtual CIs:
- Virtual machines: host, specs, OS, applications, snapshots
- Containers: image, version, registry, deployment target
- Kubernetes: clusters, namespaces, deployments, services
- Serverless: functions, triggers, dependencies
4. Cloud CIs:
- Compute: EC2, VMs, instances (type, size, region, AZ)
- Storage: S3 buckets, blobs, disks (size, encryption, access)
- Network: VPCs, subnets, security groups, load balancers
- Database: RDS, DynamoDB, CosmosDB (type, size, Multi-AZ)
- Managed services: Lambda, SQS, SNS, EKS (configuration)
5. Service CIs:
- Business services: payroll, CRM, email, website, API
- IT services: DNS, DHCP, NTP, proxy, VPN
- Sub-services: components that support parent services
- Dependencies: service-to-service relationships
6. Document CIs:
- Policies and procedures
- Contracts and licenses
- Certifications and compliance evidence
- Architecture diagrams and documentation
CI Attributes (standard fields):
Identity:
CI_ID: Unique identifier (auto-generated, format: HW-0001, SW-0001, SVC-0001)
Name: Human-readable name
Type: Category and sub-category
Status: Planned, Staged, In Operation, Retired, Decommissioned
Technical:
Model/Version: Manufacturer and version
Serial Number: Manufacturer serial (for hardware)
IP Address: Primary IP (for networked devices)
Hostname: Network name
Location: Data center, rack, U-position (for hardware)
Specifications: CPU, RAM, storage, network (for compute)
Ownership:
Technical Owner: Person responsible for operation
Business Owner: Person responsible for business value
Support Group: Team that provides L2/L3 support
Cost Center: Budget allocation code
Relationships:
Depends On: List of CIs this CI depends on
Supports: List of services this CI supports
Contains: List of CIs contained within this CI
Connected To: Network connections and dependencies
Lifecycle:
Date Acquired: Purchase or deployment date
Expected End of Life: Planned retirement date
Warranty Expiration: Manufacturer warranty end date
Last Updated: Last modification to CI record
Created By: Person who created the record
Approved By: Person who approved the record
CMDB Accuracy Metrics:
Target: > 95% accuracy for critical CIs; > 85% for all CIs
Measurement methods:
1. Automated discovery reconciliation (quarterly)
- Run discovery tool; compare against CMDB
- Identify: missing CIs, stale CIs, incorrect attributes
- Remediation: update CMDB within 5 business days
2. Audit sample (quarterly)
- Random sample of 50 CIs; verify each attribute
- Accuracy = correct attributes / total attributes checked
- Report results to IT management
3. Change management correlation (ongoing)
- Every change request must reference affected CIs
- Post-change: verify CI attributes updated
- Alert on changes without CMDB update
Documentation Standards
DOCUMENTATION STANDARDS
=========================
Document types and templates:
1. System Architecture Document (SAD):
Purpose: Describe system architecture, components, and interactions
Audience: Engineers, architects, new team members
Format: Markdown or Confluence page
Sections:
- Overview (1 paragraph)
- Architecture diagram (visual)
- Component inventory (table)
- Data flow description
- Integration points (APIs, message queues)
- Security architecture
- Scaling and performance characteristics
- Known limitations and technical debt
Review cycle: Quarterly or on significant change
Owner: System architect or lead engineer
2. Operational Runbook:
Purpose: Step-by-step procedures for routine operations
Audience: On-call engineers, operations team
Format: Markdown with numbered steps; copy-paste commands
Sections:
- Procedure name and description
- Prerequisites (access, tools, approvals needed)
- Estimated time to complete
- Step-by-step procedure (numbered, with commands)
- Expected output for each step
- Verification steps (confirm success)
- Rollback procedure (if something goes wrong)
- Escalation path (when to call for help)
Review cycle: Semi-annually or after each incident
Owner: Operations team lead
3. Incident Response Playbook:
Purpose: Procedures for specific incident types
Audience: Incident responders, on-call team
Format: Markdown; link from alert definitions
Sections:
- Incident type and trigger conditions
- Severity classification
- Immediate actions (first 15 minutes)
- Diagnosis steps (structured investigation)
- Resolution options (prioritized by speed/risk)
- Verification (confirm resolution)
- Post-incident actions (PIR, monitoring updates)
Review cycle: After each incident; minimum annually
Owner: SRE team lead
4. Knowledge Base Article:
Purpose: Troubleshooting guide for common issues
Audience: Support team, end users, engineers
Format: Structured FAQ with search-friendly content
Sections:
- Title (searchable, includes error codes/symptoms)
- Symptoms (what the user sees)
- Root cause (why it happens)
- Resolution (step-by-step fix)
- Prevention (how to avoid recurrence)
- Related articles (links)
Review cycle: Annually or when issue pattern changes
Owner: Subject matter expert
5. Standard Operating Procedure (SOP):
Purpose: Formal process documentation for compliance
Audience: Auditors, management, process owners
Format: Document with version control and approval
Sections:
- Document ID, version, date
- Purpose and scope
- Roles and responsibilities (RACI matrix)
- Procedure (detailed steps)
- References (related policies, regulations)
- Revision history
- Approval signatures
Review cycle: Annually or on regulatory change
Owner: Process owner
6. Project Documentation:
Purpose: Track project scope, decisions, and outcomes
Audience: Project team, stakeholders, auditors
Format: Project folder with structured documents
Sections:
- Project charter (scope, objectives, timeline, budget)
- Design documents (technical specifications)
- Meeting notes and decisions
- Test plans and results
- Go-live checklist
- Post-implementation review
Review cycle: At project closure; archive after 1 year
Owner: Project manager
Documentation quality checklist:
☐ Clear and concise language (avoid jargon where possible)
☐ Step-by-step instructions with expected outcomes
☐ Screenshots or diagrams where helpful
☐ Commands are copy-paste ready (tested and verified)
☐ Version number and last updated date visible
☐ Owner/contact person identified
☐ Links to related documentation
☐ Search-friendly title and tags
☐ Accessible to intended audience (appropriate technical level)
☐ Reviewed and approved by subject matter expert
Runbook Examples
RUNBOOK: Database Failover (PostgreSQL)
=========================================
Trigger: Primary database unresponsive or degraded for > 60 seconds
Severity: SEV-1 (if production)
Prerequisites:
- SSH access to database servers
- psql client installed
- Patroni cluster access
- Notification channels available (Slack, PagerDuty)
Estimated time: 5–15 minutes
Step 1: Verify Failure
Check if primary is responding
psql -h primary-db.internal -U monitoring -c "SELECT 1;"
Check Patroni status
patronictl -c /etc/patroni/patroni.yml list
Check logs
tail -100 /var/log/postgresql/postgresql-14-main.log
Expected: Connection refused or timeout; Patroni shows primary as stopped
Step 2: Initiate Failover
Automatic failover (Patroni)
Patroni should auto-failover within 30 seconds
If Patroni not auto-failing, trigger manually:
patronictl -c /etc/patroni/patroni.yml failover primary-db
Verify new primary
patronictl -c /etc/patroni/patroni.yml list
Expected: New primary elected; cluster shows one leader
Step 3: Update Connection Strings
If using virtual IP / service discovery, no action needed
If using static connection strings, update:
- Application configuration
- Connection poolers (PgBouncer)
- ETL jobs and scheduled tasks
Verify application connectivity
curl -s http://app-health-endpoint/health
Step 4: Verify Data Integrity
Connect to new primary
psql -h new-primary.internal -U monitoring
Check replication status (should be empty — new primary has no replica)
SELECT * FROM pg_stat_replication;
Check database size (compare with pre-failover)
SELECT pg_database_size('production_db');
Step 5: Restore Old Primary as Replica
Once old primary is recovered:
patronictl -c /etc/patroni/patroni.yml restart old-primary
Verify it joins as replica
patronictl -c /etc/patroni/patroni.yml list
Step 6: Verification
- Application health check passes
- Error rate returned to baseline (< 0.1%)
- Database queries executing normally
- Monitoring alerts cleared
Rollback:
If failover causes issues, manual switchover back to original primary:
patronictl -c /etc/patroni/patroni.yml switchover --leader new-primary
Escalation:
If failover fails after 15 minutes: escalate to DBA team lead
If data loss suspected: escalate to VP Engineering immediately
Integration Points
- ServiceNow CMDB: Enterprise CMDB; automated discovery; reconciliation; service mapping; ITSM integration
- Landbot / FlexNet / Snow Software: Software asset management; license compliance; CMDB enrichment
- Confluence / Notion / SharePoint: Documentation platforms; wiki; templates; version control; search
- Atlassian Documentation: Integrated with Jira; API documentation; developer docs
- Diagrams.net (Draw.io) / Lucidchart / Visio: Architecture diagrams; network diagrams; flowcharts
- AWS Service Catalog / Azure Resource Graph: Cloud resource inventory; automated CMDB for cloud
- Wazuh / Nagios / Zabbix: Auto-discovery tools; CI detection; relationship mapping
- Structurizr / C4 Model: Architecture as code; code-based architecture diagrams; version-controlled docs
- Markdown / Git: Version-controlled documentation; Git repo per system; PR-based review process
Edge Cases
- Legacy system documentation (systems with no documentation, original team departed): Reverse-engineer architecture through: code analysis, log analysis, network traffic analysis, interview remaining team members; prioritize critical systems first; create "living documentation" that team maintains going forward; budget 2–4 weeks per major undocumented system
- Approach: start with architecture diagram (boxes and arrows), then add detail iteratively
- Tools: code dependency analyzers, network mapping tools, packet captures
- Team effort: 1 senior engineer + 1 junior engineer for 2–4 weeks per system
- Cloud-native CMDB (containers, serverless, ephemeral resources): Traditional CMDB struggles with ephemeral resources (containers live minutes); use cloud-native inventory: AWS Resource Explorer, Azure Resource Graph, GCP Asset Inventory; tag-based management; auto-generated documentation from IaC (Terraform docs); CI/CD pipeline as documentation source
- Strategy: CMDB for stable infrastructure; dynamic inventory for ephemeral resources
- Tooling: Terraform state as source of truth; CloudFormation exports; Kubernetes API for pod/service inventory
- Integration: feed cloud inventory into CMDB for relationship mapping
- Documentation as Code (treating docs like software): Store documentation in Git repositories alongside code; use Markdown or AsciiDoc; CI pipeline builds and deploys documentation; PR review process for documentation changes; version docs with releases; automated link checking; stale content detection
- Tools: MkDocs, Docusaurus, Sphinx, Jekyll (static site generators)
- CI/CD: GitHub Actions / GitLab CI builds docs on merge; deploys to documentation site
- Standards: every PR must update relevant documentation; "no undocumented changes" policy
- Multi-language documentation (global teams, global users): Maintain documentation in multiple languages; use translation management system; designate language owners; establish translation workflow; version docs per language; flag translations for review when source changes
- Tools: Crowdin, Lokalise, Transifex (translation management)
- Process: source doc updated → translation team notified → translation completed within 5 business days → review → publish
- Cost: $0.05–$0.15 per word for professional translation; $5K–$50K/year for enterprise documentation
- Regulatory documentation (audit-ready, signed, version-controlled): Documents must be immutable once approved; version control with approval signatures; retention per regulatory requirement (7 years typical); access control (read-only for most, edit only by process owner); audit trail of all changes
- Tools: DocuSign for signatures; SharePoint with versioning; ServiceNow for policy management
- Retention: minimum 7 years; archive in immutable storage
- Access: role-based access control; audit log of all document access
- Documentation adoption (team doesn't write or read documentation): Leadership mandate: documentation is part of definition of done; measure documentation coverage (% of systems with current docs); tie to performance reviews; make documentation easy (templates, auto-generation); celebrate good documentation; hold documentation quality reviews in sprint retrospectives
- Metrics: documentation coverage (% systems documented); documentation freshness (last updated date); documentation usage (page views, search queries)
- Incentives: "documentation champion" recognition; documentation quality in sprint demos; new hire onboarding quality metric