Incident Response
Problem We're Solving
Chaotic incident handling: Unclear procedures, slow response times, no proper documentation, repeated mistakes.
Solution
Clear runbooks + severity levels + blameless postmortems + automated alerting via Rocket.Chat
Severity Levels
| Level | Description | Example | Response | Who |
|---|---|---|---|---|
| P0 | Complete outage, data breach | Site down, security breach | Immediate | On-call + leadership |
| P1 | Major degradation | API errors >50%, DB offline | < 15 min | On-call engineer |
| P2 | Minor issue | Single feature broken | < 2 hours | Assigned engineer |
| P3 | Cosmetic | UI glitch, typo | Next sprint | Backlog |
Quick Response Guide
P0/P1 Incident Flow
Alert → Acknowledge → Investigate → Mitigate → Communicate → Resolve → Review
↓ ↓ ↓ ↓ ↓ ↓ ↓
Auto <5min <15min <30min Every 10m Doc 48hrs
On-Call Rotation
Current on-call: Posted in Rocket.Chat #devops topic
Schedule: Weekly rotation (Monday to Monday)
Alerts sent to:
- Rocket.Chat @oncall mention
- Phone call (via monitoring system)
- SMS backup
On-call expectations:
- Response time: < 5 minutes
- Available 24/7 during your week
- Have laptop + VPN access ready
- Escalate if needed (don't hero)
Incident Response Process
1. Alert Triggered
Automated monitoring detects issue:
- Service health check fails
- Error rate > 5%
- Response time > 2s
- Disk/CPU/memory threshold exceeded
Alert sent to Rocket.Chat #incidents channel
2. Acknowledge (< 5 minutes)
# In Rocket.Chat #incidents channel
@oncall acknowledging - investigating now
Create incident thread:
/incident create "API response time degraded"
# Bot creates: #incident-2025-11-10-001
3. Investigate & Mitigate (< 15 minutes)
Check dashboards:
- Monitoring (Grafana/similar)
- Recent deployments
- DigitalOcean status
- Error logs
Common quick fixes:
# Restart service
ansible-playbook playbooks/emergency/restart-service.yml -e "service=api host=prod-api-01"
# Rollback deployment
ansible-playbook playbooks/emergency/rollback.yml -e "app=api version=previous"
# Scale up
ansible-playbook playbooks/emergency/scale-up.yml -e "service=api replicas=3"
4. Communicate (Every 10 minutes)
Internal updates in incident channel:
10:05 - Investigating high error rate on API
10:15 - Root cause: database connection pool exhausted
10:25 - Mitigation: scaled DB connections, monitoring recovery
10:35 - Service restored, error rate back to normal
External (if needed):
- Update status page
- Email affected customers
- Post in #customer-success
5. Resolve
# In #incident-2025-11-10-001
✅ RESOLVED at 10:45
Root cause: Database connection pool too small
Fix: Increased pool size from 20 to 50
Monitoring: Added alerts for pool utilization
Next: Post-incident review tomorrow 2pm
Close incident:
/incident close "Resolved by scaling DB connections"
6. Post-Incident Review (Within 48 hours)
Blameless postmortem - Template below
Detailed Runbooks
- Incident Response Runbook - Step-by-step procedures
- Post-Incident Review Template - Blameless postmortem format
- Communication Templates - Stakeholder updates
- Common Scenarios:
Common Incident Types
Application Down
Symptoms: Health checks failing, 503 errors
Quick checks:
- Service running?
systemctl status <service> - Recent deployment? Check CI/CD
- Resource exhaustion? Check CPU/memory/disk
- Database connection? Test DB connectivity
Mitigation playbook: Service Down Runbook
Database Issues
Symptoms: Slow queries, connection errors, timeouts
Quick checks:
- DB server running?
- Connection pool exhausted?
- Slow query causing lock?
- Disk full?
Mitigation playbook: Database Issues Runbook
Security Incident
Symptoms: Unusual access, failed auth attempts, alerts from security tools
Immediate actions:
- Alert #security channel
- Don't delete evidence
- Isolate affected systems
- Escalate to security team
Mitigation playbook: Security Breach Runbook
Tools & Access
Monitoring: (link to your monitoring dashboard)
Logs: (link to log aggregation)
Metrics: (link to metrics dashboard)
Playbooks: infrastructure/playbooks/emergency/
Emergency access: Request via /access emergency in #incidents channel
Escalation Path
On-call Engineer
↓ (if stuck or needs help)
DevOps Lead
↓ (if major outage or security)
CTO + Security Lead
↓ (if critical or PR needed)
CEO
When to escalate:
- You're stuck and can't resolve within 30 min
- Security breach suspected
- Data loss possible
- Customer-facing outage > 1 hour
- Need decision on trade-offs
Metrics
Track incident handling:
- MTTA (Mean Time To Acknowledge): Target < 5 min
- MTTM (Mean Time To Mitigate): Target < 30 min
- MTTR (Mean Time To Resolve): Target < 2 hours for P1
- Incidents per month: Track trend
- Repeat incidents: Should decrease over time
Prevention > Response
Goal: Reduce incidents through proactive measures
- Monitoring & Alerts: Catch issues before customers do
- Automated remediation: Auto-restart failed services
- Chaos engineering: Test failure scenarios in staging
- Postmortem action items: Fix root causes
- Runbook automation: Turn manual steps into playbooks