How to 'Incident Management'
Our recommendations to minimize downtime and build more resilient systems.
5 Incident Management Best Practices That Actually Work
After analyzing thousands of incidents across hundreds of organizations, we've identified the key practices that separate high-performing teams from those constantly fighting fires. Here are the five most impactful strategies you can implement today.
1. Implement Clear Severity Classifications
The Problem: Teams waste precious minutes during incidents debating whether something is a P1 or P2, while customers are experiencing downtime.
The Solution: Define crystal-clear severity levels with specific criteria:
P1 - Critical
- Customer Impact: Complete service outage or security breach
- Response Time: 5 minutes
- Escalation: Immediate C-level notification
- Example: Payment processing down, data breach, complete site outage
P2 - High
- Customer Impact: Major feature degraded, affecting >50% of users
- Response Time: 15 minutes
- Escalation: Engineering manager notification
- Example: Login issues, slow API responses, partial feature outage
P3 - Medium
- Customer Impact: Minor feature issues, affecting <10% of users
- Response Time: 1 hour
- Escalation: Team lead notification
- Example: UI bugs, non-critical API errors, cosmetic issues
P4 - Low
- Customer Impact: Internal tools affected, no customer impact
- Response Time: Next business day
- Escalation: Standard ticket queue
- Example: Internal dashboard issues, monitoring alerts, documentation updates
Pro Tip: Print these definitions and post them where your team can see them. During high-stress incidents, even experienced engineers can forget the criteria.
2. Establish Role-Based Response Teams
The Problem: Too many people jumping into incident calls creates chaos and slows resolution.
The Solution: Define specific roles with clear responsibilities:
Incident Commander
- Responsibility: Coordinate response, make decisions, communicate with stakeholders
- Skills: Strong communication, decision-making under pressure
- Not responsible for: Technical troubleshooting
Technical Lead
- Responsibility: Direct technical investigation and resolution
- Skills: Deep system knowledge, debugging expertise
- Not responsible for: External communication
Communications Lead
- Responsibility: Update status pages, notify customers, coordinate with support
- Skills: Clear writing, customer empathy
- Not responsible for: Technical decisions
Subject Matter Expert (SME)
- Responsibility: Provide domain-specific knowledge
- Skills: Deep expertise in affected system
- Rotation: Called in as needed, not always present
## Sample Incident Response Structure
**P1 Incident Response Team:**
- 1 Incident Commander
- 1 Technical Lead
- 1 Communications Lead
- 1-2 SMEs (as needed)
**P2 Incident Response Team:**
- 1 Technical Lead (doubles as commander)
- 1 SME
- Communications handled by on-call rotation
3. Create Actionable Runbooks
The Problem: Generic runbooks that say "check the logs" or "restart the service" waste time during incidents.
The Solution: Write runbooks like recipes - specific, step-by-step instructions that anyone can follow.
Good Runbook Example: Database Connection Issues
# Database Connection Timeout Investigation
## Step 1: Check Connection Pool Status
kubectl exec -it app-pod -- curl http://localhost:8080/health/db
# Expected: {"status": "healthy", "connections": {"active": 5, "idle": 15}}
# If connections > 80% of pool size, proceed to Step 2
## Step 2: Identify Long-Running Queries
kubectl exec -it postgres-pod -- psql -c "
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';"
## Step 3: Check for Lock Contention
kubectl exec -it postgres-pod -- psql -c "
SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.usename AS blocked_user,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;"
## Step 4: Emergency Mitigation
# If queries are stuck for >10 minutes:
kubectl scale deployment app --replicas=10
# This adds capacity while investigating root cause
Runbook Checklist
- ✅ Specific commands with expected outputs
- ✅ Decision trees for different scenarios
- ✅ Emergency mitigation steps
- ✅ When to escalate
- ✅ Post-incident cleanup tasks
4. Invest in Observability, Not Just Monitoring
The Problem: Traditional monitoring tells you what is broken, but not why it's broken.
The Solution: Build observability that helps you understand system behavior, not just detect failures.
The Three Pillars of Observability
Metrics: The What
# Good metrics focus on user experience
- API response time (95th percentile)
- Error rate by endpoint
- Database query duration
- Queue depth and processing time
# Avoid vanity metrics
- CPU usage (unless correlated with user impact)
- Memory usage (unless causing OOM)
- Disk space (unless affecting performance)
Logs: The Why
// Structured logging with context
{
"timestamp": "2024-01-10T15:30:45Z",
"level": "ERROR",
"service": "payment-processor",
"trace_id": "abc123",
"user_id": "user456",
"error": "payment_gateway_timeout",
"gateway": "stripe",
"amount": 2999,
"currency": "USD",
"retry_count": 2,
"message": "Payment gateway timeout after 30s"
}
Traces: The How
- Distributed tracing to follow requests across services
- Correlation IDs to connect related events
- Span annotations to capture business context
Observability Best Practices
- Instrument at the business logic level, not just infrastructure
- Use consistent naming conventions across all services
- Include correlation IDs in all logs and traces
- Set up alerts on symptoms, not causes
- Create dashboards for each service owner
5. Conduct Blameless Post-Mortems
The Problem: Teams either skip post-mortems entirely or use them to assign blame, missing opportunities to improve.
The Solution: Run structured, blameless post-mortems focused on system improvements.
Post-Mortem Template
Incident Summary
- Date/Time: When did it occur?
- Duration: How long did it last?
- Impact: Who was affected and how?
- Severity: P1/P2/P3/P4
Timeline
14:32 - First alert received (database connection timeouts)
14:35 - Incident commander assigned, war room created
14:38 - Identified high query volume from new feature deployment
14:42 - Rolled back deployment, connections stabilized
14:45 - Confirmed service recovery
14:50 - Post-incident communication sent
Root Cause Analysis
Use the "5 Whys" technique:
- Why did the database connections timeout?
- Query volume exceeded connection pool capacity
- Why did query volume spike?
- New feature generated N+1 queries
- Why weren't the N+1 queries caught?
- No performance testing on staging environment
- Why is staging environment different?
- Staging uses smaller dataset, doesn't reveal scaling issues
- Why don't we have realistic staging data?
- No process for data sanitization and staging refresh
Action Items
- Owner: Who's responsible
- Due Date: When will it be done
- Priority: P1/P2/P3
- Tracking: Link to ticket/PR
Action Items from Database Timeout Incident
Action | Owner | Due Date | Priority | Status |
---|---|---|---|---|
Implement query performance testing in CI | @sarah | 2024-01-20 | P1 | In Progress |
Set up staging data refresh automation | @mike | 2024-01-25 | P2 | Not Started |
Add database connection pool monitoring | @alex | 2024-01-15 | P1 | Complete |
Create N+1 query detection tooling | @team | 2024-02-01 | P3 | Not Started |
Post-Mortem Best Practices
- Schedule within 48 hours while details are fresh
- Include all stakeholders, not just engineers
- Focus on systems and processes, not individuals
- Track action items to completion
- Share learnings across the organization
Measuring Success
How do you know if these practices are working? Track these key metrics:
Incident Response Metrics
- Mean Time to Detection (MTTD): How quickly do you discover incidents?
- Mean Time to Resolution (MTTR): How quickly do you fix them?
- Incident Recurrence Rate: Are you solving root causes?
- False Positive Rate: Are your alerts actionable?
Team Health Metrics
- On-call Burnout Score: Survey your team regularly
- Post-Mortem Action Item Completion: Are you actually improving?
- Runbook Usage: Are they helpful during incidents?
- Cross-Training Coverage: Can multiple people handle each system?
Getting Started
Don't try to implement everything at once. Here's a practical rollout plan:
Week 1-2: Foundations
- Define severity classifications
- Assign incident response roles
- Create incident communication templates
Week 3-4: Documentation
- Audit existing runbooks
- Rewrite top 3 most-used runbooks
- Create post-mortem template
Week 5-8: Observability
- Implement structured logging
- Set up business-level metrics
- Create service-specific dashboards
Week 9-12: Process
- Run first blameless post-mortem
- Track action items to completion
- Measure and iterate on metrics
Conclusion
Incident management isn't about preventing all incidents - it's about responding to them effectively when they happen. These five practices will help your team:
- Respond faster with clear processes and roles
- Resolve issues more efficiently with actionable runbooks
- Learn from failures through blameless post-mortems
- Prevent recurring issues with better observability
- Reduce team stress with predictable, practiced responses
Remember: The best incident management system is the one your team actually uses. Start small, measure your progress, and iterate based on what works for your organization.
Want to see how Warrn can help automate these best practices? Schedule a demo to learn how our AI-powered platform implements these strategies out of the box.