5 Incident Management Best Practices That Actually Work

After analyzing thousands of incidents across hundreds of organizations, we've identified the key practices that separate high-performing teams from those constantly fighting fires. Here are the five most impactful strategies you can implement today.

1. Implement Clear Severity Classifications

The Problem: Teams waste precious minutes during incidents debating whether something is a P1 or P2, while customers are experiencing downtime.

The Solution: Define crystal-clear severity levels with specific criteria:

P1 - Critical

Customer Impact: Complete service outage or security breach
Response Time: 5 minutes
Escalation: Immediate C-level notification
Example: Payment processing down, data breach, complete site outage

P2 - High

Customer Impact: Major feature degraded, affecting >50% of users
Response Time: 15 minutes
Escalation: Engineering manager notification
Example: Login issues, slow API responses, partial feature outage

P3 - Medium

Customer Impact: Minor feature issues, affecting <10% of users
Response Time: 1 hour
Escalation: Team lead notification
Example: UI bugs, non-critical API errors, cosmetic issues

P4 - Low

Customer Impact: Internal tools affected, no customer impact
Response Time: Next business day
Escalation: Standard ticket queue
Example: Internal dashboard issues, monitoring alerts, documentation updates

Pro Tip: Print these definitions and post them where your team can see them. During high-stress incidents, even experienced engineers can forget the criteria.

2. Establish Role-Based Response Teams

The Problem: Too many people jumping into incident calls creates chaos and slows resolution.

The Solution: Define specific roles with clear responsibilities:

Incident Commander

Responsibility: Coordinate response, make decisions, communicate with stakeholders
Skills: Strong communication, decision-making under pressure
Not responsible for: Technical troubleshooting

Technical Lead

Responsibility: Direct technical investigation and resolution
Skills: Deep system knowledge, debugging expertise
Not responsible for: External communication

Communications Lead

Responsibility: Update status pages, notify customers, coordinate with support
Skills: Clear writing, customer empathy
Not responsible for: Technical decisions

Subject Matter Expert (SME)

Responsibility: Provide domain-specific knowledge
Skills: Deep expertise in affected system
Rotation: Called in as needed, not always present

markdown

## Sample Incident Response Structure

**P1 Incident Response Team:**
- 1 Incident Commander
- 1 Technical Lead  
- 1 Communications Lead
- 1-2 SMEs (as needed)

**P2 Incident Response Team:**
- 1 Technical Lead (doubles as commander)
- 1 SME
- Communications handled by on-call rotation

3. Create Actionable Runbooks

The Problem: Generic runbooks that say "check the logs" or "restart the service" waste time during incidents.

The Solution: Write runbooks like recipes - specific, step-by-step instructions that anyone can follow.

Good Runbook Example: Database Connection Issues

bash

# Database Connection Timeout Investigation

## Step 1: Check Connection Pool Status
kubectl exec -it app-pod -- curl http://localhost:8080/health/db
# Expected: {"status": "healthy", "connections": {"active": 5, "idle": 15}}
# If connections > 80% of pool size, proceed to Step 2

## Step 2: Identify Long-Running Queries  
kubectl exec -it postgres-pod -- psql -c "
SELECT pid, now() - pg_stat_activity.query_start AS duration, query 
FROM pg_stat_activity 
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';"

## Step 3: Check for Lock Contention
kubectl exec -it postgres-pod -- psql -c "
SELECT blocked_locks.pid AS blocked_pid,
       blocking_locks.pid AS blocking_pid,
       blocked_activity.usename AS blocked_user,
       blocking_activity.usename AS blocking_user,
       blocked_activity.query AS blocked_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;"

## Step 4: Emergency Mitigation
# If queries are stuck for >10 minutes:
kubectl scale deployment app --replicas=10
# This adds capacity while investigating root cause

Runbook Checklist

✅ Specific commands with expected outputs
✅ Decision trees for different scenarios
✅ Emergency mitigation steps
✅ When to escalate
✅ Post-incident cleanup tasks

4. Invest in Observability, Not Just Monitoring

The Problem: Traditional monitoring tells you what is broken, but not why it's broken.

The Solution: Build observability that helps you understand system behavior, not just detect failures.

The Three Pillars of Observability

Metrics: The What

yaml

# Good metrics focus on user experience
- API response time (95th percentile)
- Error rate by endpoint
- Database query duration
- Queue depth and processing time

# Avoid vanity metrics
- CPU usage (unless correlated with user impact)
- Memory usage (unless causing OOM)
- Disk space (unless affecting performance)

Logs: The Why

json

// Structured logging with context
{
  "timestamp": "2024-01-10T15:30:45Z",
  "level": "ERROR",
  "service": "payment-processor",
  "trace_id": "abc123",
  "user_id": "user456", 
  "error": "payment_gateway_timeout",
  "gateway": "stripe",
  "amount": 2999,
  "currency": "USD",
  "retry_count": 2,
  "message": "Payment gateway timeout after 30s"
}

Traces: The How

Distributed tracing to follow requests across services
Correlation IDs to connect related events
Span annotations to capture business context

Observability Best Practices

Instrument at the business logic level, not just infrastructure
Use consistent naming conventions across all services
Include correlation IDs in all logs and traces
Set up alerts on symptoms, not causes
Create dashboards for each service owner

5. Conduct Blameless Post-Mortems

The Problem: Teams either skip post-mortems entirely or use them to assign blame, missing opportunities to improve.

The Solution: Run structured, blameless post-mortems focused on system improvements.

Post-Mortem Template

Incident Summary

Date/Time: When did it occur?
Duration: How long did it last?
Impact: Who was affected and how?
Severity: P1/P2/P3/P4

Timeline

32 - First alert received (database connection timeouts)
35 - Incident commander assigned, war room created
38 - Identified high query volume from new feature deployment
42 - Rolled back deployment, connections stabilized
45 - Confirmed service recovery
50 - Post-incident communication sent

Root Cause Analysis

Use the "5 Whys" technique:

Why did the database connections timeout?
- Query volume exceeded connection pool capacity
Why did query volume spike?
- New feature generated N+1 queries
Why weren't the N+1 queries caught?
- No performance testing on staging environment
Why is staging environment different?
- Staging uses smaller dataset, doesn't reveal scaling issues
Why don't we have realistic staging data?
- No process for data sanitization and staging refresh

Action Items

Owner: Who's responsible
Due Date: When will it be done
Priority: P1/P2/P3
Tracking: Link to ticket/PR

Action Items from Database Timeout Incident

Action	Owner	Due Date	Priority	Status
Implement query performance testing in CI	@sarah	2024-01-20	P1	In Progress
Set up staging data refresh automation	@mike	2024-01-25	P2	Not Started
Add database connection pool monitoring	@alex	2024-01-15	P1	Complete
Create N+1 query detection tooling	@team	2024-02-01	P3	Not Started

Post-Mortem Best Practices

Schedule within 48 hours while details are fresh
Include all stakeholders, not just engineers
Focus on systems and processes, not individuals
Track action items to completion
Share learnings across the organization

Measuring Success

How do you know if these practices are working? Track these key metrics:

Incident Response Metrics

Mean Time to Detection (MTTD): How quickly do you discover incidents?
Mean Time to Resolution (MTTR): How quickly do you fix them?
Incident Recurrence Rate: Are you solving root causes?
False Positive Rate: Are your alerts actionable?

Team Health Metrics

On-call Burnout Score: Survey your team regularly
Post-Mortem Action Item Completion: Are you actually improving?
Runbook Usage: Are they helpful during incidents?
Cross-Training Coverage: Can multiple people handle each system?

Getting Started

Don't try to implement everything at once. Here's a practical rollout plan:

Week 1-2: Foundations

Define severity classifications
Assign incident response roles
Create incident communication templates

Week 3-4: Documentation

Audit existing runbooks
Rewrite top 3 most-used runbooks
Create post-mortem template

Week 5-8: Observability

Implement structured logging
Set up business-level metrics
Create service-specific dashboards

Week 9-12: Process

Run first blameless post-mortem
Track action items to completion
Measure and iterate on metrics

Conclusion

Incident management isn't about preventing all incidents - it's about responding to them effectively when they happen. These five practices will help your team:

Respond faster with clear processes and roles
Resolve issues more efficiently with actionable runbooks
Learn from failures through blameless post-mortems
Prevent recurring issues with better observability
Reduce team stress with predictable, practiced responses

Remember: The best incident management system is the one your team actually uses. Start small, measure your progress, and iterate based on what works for your organization.

Want to see how Warrn can help automate these best practices? Schedule a demo to learn how our AI-powered platform implements these strategies out of the box.

How to 'Incident Management'