Skip to main content

Incident Management

Complete guide for managing incidents in Burdenoff products.

Incident Classification

Severity Levels

SeverityImpactResponse TimeExamples
SEV 1Critical< 5 minComplete outage, data loss, security breach
SEV 2High< 15 minMajor feature down, severe performance issues
SEV 3Medium< 1 hourMinor feature issue, moderate performance degradation
SEV 4LowNext dayCosmetic issues, minor bugs

Incident Response Process

1. Detection

  • Automated monitoring alerts
  • User reports
  • Internal discovery
  • Security alerts

2. Acknowledgement

Alert triggered → PagerDuty → On-call engineer

Acknowledge within 5 min

3. Initial Response

  • Assess severity
  • Notify stakeholders
  • Create incident channel
  • Update status page
  • Begin investigation

4. Investigation

  • Review logs
  • Check metrics
  • Examine recent changes
  • Review related systems
  • Identify root cause

5. Mitigation

  • Apply temporary fix
  • Restore service
  • Verify resolution
  • Monitor stability

6. Resolution

  • Implement permanent fix
  • Deploy solution
  • Verify fix
  • Update documentation
  • Close incident

7. Post-Mortem

  • Document timeline
  • Identify root cause
  • List action items
  • Share learnings
  • Update runbooks

Incident Communication

Internal Communication

Slack Channel: #incident-[date]-[product]
- Regular updates (every 30 minutes)
- Status changes
- Action items
- Resolution updates

External Communication

Status Page: https://status.burdenoff.com
- Initial notification
- Investigation update
- Resolution update
- Post-mortem (if applicable)

Communication Templates

Initial Notification

Title: [Product] Service Disruption

We are currently investigating an issue affecting [product/feature].
Users may experience [impact description].

Our team is actively working to resolve this issue.
Next update in 30 minutes.

Status: Investigating
Start Time: [timestamp]

Resolution Update

Title: [Product] Service Restored

The issue affecting [product/feature] has been resolved.
Service has been restored to normal operation.

Root cause: [brief description]
Full post-mortem: [link]

Status: Resolved
Duration: [HH:MM]

Incident Roles

Incident Commander (IC)

  • Coordinates response
  • Makes decisions
  • Manages communication
  • Assigns tasks
  • Declares resolution

Technical Lead

  • Investigates issue
  • Implements fixes
  • Coordinates engineers
  • Provides technical updates

Communications Lead

  • Updates stakeholders
  • Manages status page
  • Coordinates customer support
  • Handles escalations

Scribe

  • Documents timeline
  • Records decisions
  • Tracks action items
  • Maintains incident log

On-Call Procedures

Handoff Checklist

  • Review open incidents
  • Check recent deployments
  • Review pending alerts
  • Update contact info
  • Test paging system

Response Workflow

1. Acknowledge alert (< 5 min)
2. Assess severity
3. Page additional help if needed
4. Begin investigation
5. Communicate status
6. Implement fix
7. Monitor resolution
8. Document incident

Escalation Path

Primary On-Call
↓ (no response after 5 min)
Secondary On-Call
↓ (no response after 5 min)
Team Lead
↓ (SEV 1 incidents)
Engineering Director
↓ (critical escalation)
CTO

Incident Tools

PagerDuty

  • Alert routing
  • On-call scheduling
  • Incident tracking
  • Status updates

Slack

  • #incidents: General incidents
  • #sev1-alerts: Critical alerts
  • #incident-[id]: Specific incident

Status Page

  • Public status updates
  • Subscriber notifications
  • Incident history
  • Uptime metrics

Incident Documentation

  • Internal wiki
  • Post-mortem repository
  • Runbook updates
  • Lessons learned

Common Incident Scenarios

Database Outage

1. Check database health
2. Review connection pool
3. Check disk space
4. Review slow queries
5. Restart if necessary
6. Restore from backup if needed

High Error Rate

1. Identify error pattern
2. Check recent deployments
3. Review error logs
4. Rollback if needed
5. Apply fix
6. Monitor error rate

Performance Degradation

1. Check resource usage
2. Review recent changes
3. Analyze slow queries
4. Scale resources if needed
5. Optimize bottleneck
6. Monitor performance

Security Incident

1. Isolate affected systems
2. Preserve evidence
3. Assess damage
4. Contain breach
5. Notify security team
6. Follow security playbook

Post-Mortem Process

Timeline

  • Within 48 hours of resolution
  • Schedule 1-hour meeting
  • Invite all participants
  • Share draft beforehand

Post-Mortem Template

# Incident Post-Mortem: [Title]

## Summary
Brief description of the incident.

## Impact
- Duration: [time]
- Affected users: [number/percentage]
- Services affected: [list]
- Severity: [SEV level]

## Timeline
- HH:MM - Event description
- HH:MM - Event description
...

## Root Cause
Detailed explanation of what caused the incident.

## Resolution
How the incident was resolved.

## What Went Well
- Things that worked as expected
- Effective responses

## What Went Wrong
- Things that didn't work
- Gaps in procedures

## Action Items
- [ ] Action item 1 (Owner: Name, Due: Date)
- [ ] Action item 2 (Owner: Name, Due: Date)

## Lessons Learned
Key takeaways and improvements.

Follow-up

  • Track action items
  • Update runbooks
  • Improve monitoring
  • Share learnings
  • Prevent recurrence

Metrics and Reporting

Key Metrics

  • MTTD: Mean Time To Detect
  • MTTA: Mean Time To Acknowledge
  • MTTR: Mean Time To Resolve
  • MTBF: Mean Time Between Failures

Reporting

  • Weekly incident summary
  • Monthly trends analysis
  • Quarterly review
  • Annual report

Prevention

Proactive Measures

  • Chaos engineering
  • Load testing
  • Security audits
  • Code reviews
  • Monitoring improvements

Continuous Improvement

  • Runbook updates
  • Training sessions
  • Tool improvements
  • Process refinement

Next Steps