best-practices
incident-management
devops
reliability

How to 'Incident Management'

Our recommendations to minimize downtime and build more resilient systems.

WT
Warrn Team
January 10, 2024
8 min read

5 Incident Management Best Practices That Actually Work

After analyzing thousands of incidents across hundreds of organizations, we've identified the key practices that separate high-performing teams from those constantly fighting fires. Here are the five most impactful strategies you can implement today.

1. Implement Clear Severity Classifications

The Problem: Teams waste precious minutes during incidents debating whether something is a P1 or P2, while customers are experiencing downtime.

The Solution: Define crystal-clear severity levels with specific criteria:

P1 - Critical

  • Customer Impact: Complete service outage or security breach
  • Response Time: 5 minutes
  • Escalation: Immediate C-level notification
  • Example: Payment processing down, data breach, complete site outage

P2 - High

  • Customer Impact: Major feature degraded, affecting >50% of users
  • Response Time: 15 minutes
  • Escalation: Engineering manager notification
  • Example: Login issues, slow API responses, partial feature outage

P3 - Medium

  • Customer Impact: Minor feature issues, affecting <10% of users
  • Response Time: 1 hour
  • Escalation: Team lead notification
  • Example: UI bugs, non-critical API errors, cosmetic issues

P4 - Low

  • Customer Impact: Internal tools affected, no customer impact
  • Response Time: Next business day
  • Escalation: Standard ticket queue
  • Example: Internal dashboard issues, monitoring alerts, documentation updates

Pro Tip: Print these definitions and post them where your team can see them. During high-stress incidents, even experienced engineers can forget the criteria.

2. Establish Role-Based Response Teams

The Problem: Too many people jumping into incident calls creates chaos and slows resolution.

The Solution: Define specific roles with clear responsibilities:

Incident Commander

  • Responsibility: Coordinate response, make decisions, communicate with stakeholders
  • Skills: Strong communication, decision-making under pressure
  • Not responsible for: Technical troubleshooting

Technical Lead

  • Responsibility: Direct technical investigation and resolution
  • Skills: Deep system knowledge, debugging expertise
  • Not responsible for: External communication

Communications Lead

  • Responsibility: Update status pages, notify customers, coordinate with support
  • Skills: Clear writing, customer empathy
  • Not responsible for: Technical decisions

Subject Matter Expert (SME)

  • Responsibility: Provide domain-specific knowledge
  • Skills: Deep expertise in affected system
  • Rotation: Called in as needed, not always present
markdown
## Sample Incident Response Structure **P1 Incident Response Team:** - 1 Incident Commander - 1 Technical Lead - 1 Communications Lead - 1-2 SMEs (as needed) **P2 Incident Response Team:** - 1 Technical Lead (doubles as commander) - 1 SME - Communications handled by on-call rotation

3. Create Actionable Runbooks

The Problem: Generic runbooks that say "check the logs" or "restart the service" waste time during incidents.

The Solution: Write runbooks like recipes - specific, step-by-step instructions that anyone can follow.

Good Runbook Example: Database Connection Issues

bash
# Database Connection Timeout Investigation ## Step 1: Check Connection Pool Status kubectl exec -it app-pod -- curl http://localhost:8080/health/db # Expected: {"status": "healthy", "connections": {"active": 5, "idle": 15}} # If connections > 80% of pool size, proceed to Step 2 ## Step 2: Identify Long-Running Queries kubectl exec -it postgres-pod -- psql -c " SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';" ## Step 3: Check for Lock Contention kubectl exec -it postgres-pod -- psql -c " SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid, blocked_activity.usename AS blocked_user, blocking_activity.usename AS blocking_user, blocked_activity.query AS blocked_statement FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.granted;" ## Step 4: Emergency Mitigation # If queries are stuck for >10 minutes: kubectl scale deployment app --replicas=10 # This adds capacity while investigating root cause

Runbook Checklist

  • ✅ Specific commands with expected outputs
  • ✅ Decision trees for different scenarios
  • ✅ Emergency mitigation steps
  • ✅ When to escalate
  • ✅ Post-incident cleanup tasks

4. Invest in Observability, Not Just Monitoring

The Problem: Traditional monitoring tells you what is broken, but not why it's broken.

The Solution: Build observability that helps you understand system behavior, not just detect failures.

The Three Pillars of Observability

Metrics: The What

yaml
# Good metrics focus on user experience - API response time (95th percentile) - Error rate by endpoint - Database query duration - Queue depth and processing time # Avoid vanity metrics - CPU usage (unless correlated with user impact) - Memory usage (unless causing OOM) - Disk space (unless affecting performance)

Logs: The Why

json
// Structured logging with context { "timestamp": "2024-01-10T15:30:45Z", "level": "ERROR", "service": "payment-processor", "trace_id": "abc123", "user_id": "user456", "error": "payment_gateway_timeout", "gateway": "stripe", "amount": 2999, "currency": "USD", "retry_count": 2, "message": "Payment gateway timeout after 30s" }

Traces: The How

  • Distributed tracing to follow requests across services
  • Correlation IDs to connect related events
  • Span annotations to capture business context

Observability Best Practices

  1. Instrument at the business logic level, not just infrastructure
  2. Use consistent naming conventions across all services
  3. Include correlation IDs in all logs and traces
  4. Set up alerts on symptoms, not causes
  5. Create dashboards for each service owner

5. Conduct Blameless Post-Mortems

The Problem: Teams either skip post-mortems entirely or use them to assign blame, missing opportunities to improve.

The Solution: Run structured, blameless post-mortems focused on system improvements.

Post-Mortem Template

Incident Summary

  • Date/Time: When did it occur?
  • Duration: How long did it last?
  • Impact: Who was affected and how?
  • Severity: P1/P2/P3/P4

Timeline

14:32 - First alert received (database connection timeouts) 14:35 - Incident commander assigned, war room created 14:38 - Identified high query volume from new feature deployment 14:42 - Rolled back deployment, connections stabilized 14:45 - Confirmed service recovery 14:50 - Post-incident communication sent

Root Cause Analysis

Use the "5 Whys" technique:

  1. Why did the database connections timeout?
    • Query volume exceeded connection pool capacity
  2. Why did query volume spike?
    • New feature generated N+1 queries
  3. Why weren't the N+1 queries caught?
    • No performance testing on staging environment
  4. Why is staging environment different?
    • Staging uses smaller dataset, doesn't reveal scaling issues
  5. Why don't we have realistic staging data?
    • No process for data sanitization and staging refresh

Action Items

  • Owner: Who's responsible
  • Due Date: When will it be done
  • Priority: P1/P2/P3
  • Tracking: Link to ticket/PR

Action Items from Database Timeout Incident

ActionOwnerDue DatePriorityStatus
Implement query performance testing in CI@sarah2024-01-20P1In Progress
Set up staging data refresh automation@mike2024-01-25P2Not Started
Add database connection pool monitoring@alex2024-01-15P1Complete
Create N+1 query detection tooling@team2024-02-01P3Not Started

Post-Mortem Best Practices

  • Schedule within 48 hours while details are fresh
  • Include all stakeholders, not just engineers
  • Focus on systems and processes, not individuals
  • Track action items to completion
  • Share learnings across the organization

Measuring Success

How do you know if these practices are working? Track these key metrics:

Incident Response Metrics

  • Mean Time to Detection (MTTD): How quickly do you discover incidents?
  • Mean Time to Resolution (MTTR): How quickly do you fix them?
  • Incident Recurrence Rate: Are you solving root causes?
  • False Positive Rate: Are your alerts actionable?

Team Health Metrics

  • On-call Burnout Score: Survey your team regularly
  • Post-Mortem Action Item Completion: Are you actually improving?
  • Runbook Usage: Are they helpful during incidents?
  • Cross-Training Coverage: Can multiple people handle each system?

Getting Started

Don't try to implement everything at once. Here's a practical rollout plan:

Week 1-2: Foundations

  • Define severity classifications
  • Assign incident response roles
  • Create incident communication templates

Week 3-4: Documentation

  • Audit existing runbooks
  • Rewrite top 3 most-used runbooks
  • Create post-mortem template

Week 5-8: Observability

  • Implement structured logging
  • Set up business-level metrics
  • Create service-specific dashboards

Week 9-12: Process

  • Run first blameless post-mortem
  • Track action items to completion
  • Measure and iterate on metrics

Conclusion

Incident management isn't about preventing all incidents - it's about responding to them effectively when they happen. These five practices will help your team:

  • Respond faster with clear processes and roles
  • Resolve issues more efficiently with actionable runbooks
  • Learn from failures through blameless post-mortems
  • Prevent recurring issues with better observability
  • Reduce team stress with predictable, practiced responses

Remember: The best incident management system is the one your team actually uses. Start small, measure your progress, and iterate based on what works for your organization.


Want to see how Warrn can help automate these best practices? Schedule a demo to learn how our AI-powered platform implements these strategies out of the box.

Let us help you deliver excellence

Get modern with your incident response.