Incident Response Runbook
Step-by-Step Response Process
Phase 1: Detection & Acknowledgment (0-5 minutes)
Alert arrives in Rocket.Chat #incidents channel
On-call engineer:
-
Acknowledge in channel:
@oncall acknowledging - investigating -
Create incident thread:
/incident create "Brief description of issue"This creates dedicated channel:
#incident-YYYY-MM-DD-NNN -
Determine severity (P0/P1/P2/P3)
-
If P0, immediately notify:
- Tag @leadership in incident channel
- Escalate to DevOps lead
- Consider external communication
Phase 2: Initial Investigation (5-15 minutes)
Gather information:
# Check recent deployments
# In incident channel, bot shows recent deploys
# Check monitoring dashboards
# - CPU, memory, disk usage
# - Error rates
# - Response times
# Check logs
# Access via your log aggregation tool
# Check DigitalOcean status
https://status.digitalocean.com
Update incident channel with findings:
10:05 - API error rate spiked to 25% starting 09:58
10:06 - Checking recent changes... deployment at 09:55
10:08 - Logs show database connection timeouts
Phase 3: Mitigation (15-30 minutes)
Goal: Stop the bleeding, restore service (not necessarily fix root cause)
Common mitigation playbooks:
# Rollback recent deployment
ansible-playbook playbooks/emergency/rollback.yml \
-e "app=api version=previous" \
-i inventory/production
# Restart service
ansible-playbook playbooks/emergency/restart-service.yml \
-e "service=api host=prod-api-01" \
-i inventory/production
# Scale up resources
ansible-playbook playbooks/emergency/scale-up.yml \
-e "service=api replicas=3" \
-i inventory/production
# Enable maintenance mode
ansible-playbook playbooks/emergency/maintenance-mode.yml \
-e "enable=true" \
-i inventory/production
Update every 10 minutes:
10:15 - Initiating rollback to v1.2.3
10:20 - Rollback complete, monitoring error rates
10:25 - Error rate decreasing, now at 5%
10:30 - Service stabilized, error rate normal
Phase 4: Communication
Internal (Rocket.Chat):
- Update incident channel every 10 minutes
- Tag relevant teams: @backend @frontend @devops
- Keep updates brief and factual
External (if customer-facing):
- Update status page (if you have one)
- Notify #customer-success team
- Consider email to affected customers
Template - Internal Update:
⏰ 10:15 UPDATE
Status: INVESTIGATING / MITIGATING / RESOLVED
Impact: [what's affected, how many users]
Actions: [what we're doing]
ETA: [estimated resolution time]
Next update: 10:25
Template - External Update:
We're experiencing issues with [service].
Impact: [what's affected]
Status: [investigating/fixing/monitoring]
Workaround: [if available]
ETA: [if known]
Updates every 15 minutes at status.company.com
Phase 5: Resolution
Service restored:
✅ RESOLVED - 10:45
Timeline:
- 09:58: Issue detected
- 10:02: Incident declared
- 10:20: Mitigation applied
- 10:45: Service fully restored
Root cause: Database connection pool exhaustion
Fix applied: Rolled back deployment, increased pool size
Monitoring: Added alerts for pool utilization
Next steps:
- Post-incident review: Tomorrow 2pm
- Action items: Track in GitHub issue
Close incident:
/incident close "Resolved by rollback + pool scaling"
Phase 6: Post-Incident Review (Within 48 hours)
Schedule meeting with relevant team members
Use template: Post-Incident Review Template
Publish in Docusaurus blog section for team learning
Communication Matrix
| Severity | Rocket.Chat | Status Page | Leadership | |
|---|---|---|---|---|
| P0 | Immediate | Yes | Yes | Immediately |
| P1 | Immediate | If >1hr | If customer-facing | If >30min |
| P2 | When fixed | No | No | Daily summary |
| P3 | When fixed | No | No | No |
Decision Trees
Is This an Incident?
Service impaired? → Yes → Incident
→ No → Maybe monitoring issue
Customer impact? → Yes → P0 or P1
→ No → P2 or P3
Security related? → Yes → Also alert #security
→ No → Continue normal process
Should I Rollback?
Recent deployment? → Yes → Rollback
→ No → Investigate other causes
Error rate >20%? → Yes → Rollback
→ No → Try other mitigation
Can reproduce in staging? → Yes → Fix in code first
→ No → Rollback to be safe
Should I Escalate?
Stuck >30 min? → Yes → Escalate to lead
→ No → Continue
Security breach? → Yes → Escalate immediately
→ No → Continue
Data loss risk? → Yes → Escalate + get approval before action
→ No → Continue
Customer outage >1hr? → Yes → Escalate + external comms
→ No → Continue
Tools Quick Reference
Monitoring: [Link to Grafana/monitoring]
Logs: [Link to log system]
Recent Deployments: Check CI/CD history
Server Access: Via Netbird VPN + SSH
Emergency Playbooks: infrastructure/playbooks/emergency/
Emergency Contacts
Rocket.Chat:
@oncall- Current on-call engineer@devops-lead- DevOps team lead@security-team- Security incidents@leadership- CEO, CTO (P0 only)
Phone: On-call phone number in #devops channel topic
Post-Incident Checklist
After resolving incident:
- Incident marked as resolved
- Timeline documented
- Root cause identified
- Post-incident review scheduled (within 48 hours)
- Action items created in GitHub
- Customers notified (if applicable)
- Monitoring/alerts updated
- Runbook updated if needed
- Team debriefed
Learning from Incidents
Every incident is a learning opportunity:
- What worked well?
- What didn't work?
- What would you do differently?
- How can we prevent this?
Document in postmortem, share with team
Track action items and actually complete them
Update runbooks based on what you learned