Skip to main content

Incident Response

Problem We're Solving

Chaotic incident handling: Unclear procedures, slow response times, no proper documentation, repeated mistakes.

Solution

Clear runbooks + severity levels + blameless postmortems + automated alerting via Rocket.Chat

Severity Levels

LevelDescriptionExampleResponseWho
P0Complete outage, data breachSite down, security breachImmediateOn-call + leadership
P1Major degradationAPI errors >50%, DB offline< 15 minOn-call engineer
P2Minor issueSingle feature broken< 2 hoursAssigned engineer
P3CosmeticUI glitch, typoNext sprintBacklog

Quick Response Guide

P0/P1 Incident Flow

Alert → Acknowledge → Investigate → Mitigate → Communicate → Resolve → Review
↓ ↓ ↓ ↓ ↓ ↓ ↓
Auto <5min <15min <30min Every 10m Doc 48hrs

On-Call Rotation

Current on-call: Posted in Rocket.Chat #devops topic

Schedule: Weekly rotation (Monday to Monday)

Alerts sent to:

  • Rocket.Chat @oncall mention
  • Phone call (via monitoring system)
  • SMS backup

On-call expectations:

  • Response time: < 5 minutes
  • Available 24/7 during your week
  • Have laptop + VPN access ready
  • Escalate if needed (don't hero)

Incident Response Process

1. Alert Triggered

Automated monitoring detects issue:

  • Service health check fails
  • Error rate > 5%
  • Response time > 2s
  • Disk/CPU/memory threshold exceeded

Alert sent to Rocket.Chat #incidents channel

2. Acknowledge (< 5 minutes)

# In Rocket.Chat #incidents channel
@oncall acknowledging - investigating now

Create incident thread:

/incident create "API response time degraded"
# Bot creates: #incident-2025-11-10-001

3. Investigate & Mitigate (< 15 minutes)

Check dashboards:

  • Monitoring (Grafana/similar)
  • Recent deployments
  • DigitalOcean status
  • Error logs

Common quick fixes:

# Restart service
ansible-playbook playbooks/emergency/restart-service.yml -e "service=api host=prod-api-01"

# Rollback deployment
ansible-playbook playbooks/emergency/rollback.yml -e "app=api version=previous"

# Scale up
ansible-playbook playbooks/emergency/scale-up.yml -e "service=api replicas=3"

4. Communicate (Every 10 minutes)

Internal updates in incident channel:

10:05 - Investigating high error rate on API
10:15 - Root cause: database connection pool exhausted
10:25 - Mitigation: scaled DB connections, monitoring recovery
10:35 - Service restored, error rate back to normal

External (if needed):

  • Update status page
  • Email affected customers
  • Post in #customer-success

5. Resolve

# In #incident-2025-11-10-001
✅ RESOLVED at 10:45
Root cause: Database connection pool too small
Fix: Increased pool size from 20 to 50
Monitoring: Added alerts for pool utilization
Next: Post-incident review tomorrow 2pm

Close incident:

/incident close "Resolved by scaling DB connections"

6. Post-Incident Review (Within 48 hours)

Blameless postmortem - Template below

Detailed Runbooks

Common Incident Types

Application Down

Symptoms: Health checks failing, 503 errors

Quick checks:

  1. Service running? systemctl status <service>
  2. Recent deployment? Check CI/CD
  3. Resource exhaustion? Check CPU/memory/disk
  4. Database connection? Test DB connectivity

Mitigation playbook: Service Down Runbook

Database Issues

Symptoms: Slow queries, connection errors, timeouts

Quick checks:

  1. DB server running?
  2. Connection pool exhausted?
  3. Slow query causing lock?
  4. Disk full?

Mitigation playbook: Database Issues Runbook

Security Incident

Symptoms: Unusual access, failed auth attempts, alerts from security tools

Immediate actions:

  1. Alert #security channel
  2. Don't delete evidence
  3. Isolate affected systems
  4. Escalate to security team

Mitigation playbook: Security Breach Runbook

Tools & Access

Monitoring: (link to your monitoring dashboard) Logs: (link to log aggregation) Metrics: (link to metrics dashboard) Playbooks: infrastructure/playbooks/emergency/

Emergency access: Request via /access emergency in #incidents channel

Escalation Path

On-call Engineer
↓ (if stuck or needs help)
DevOps Lead
↓ (if major outage or security)
CTO + Security Lead
↓ (if critical or PR needed)
CEO

When to escalate:

  • You're stuck and can't resolve within 30 min
  • Security breach suspected
  • Data loss possible
  • Customer-facing outage > 1 hour
  • Need decision on trade-offs

Metrics

Track incident handling:

  • MTTA (Mean Time To Acknowledge): Target < 5 min
  • MTTM (Mean Time To Mitigate): Target < 30 min
  • MTTR (Mean Time To Resolve): Target < 2 hours for P1
  • Incidents per month: Track trend
  • Repeat incidents: Should decrease over time

Prevention > Response

Goal: Reduce incidents through proactive measures

  1. Monitoring & Alerts: Catch issues before customers do
  2. Automated remediation: Auto-restart failed services
  3. Chaos engineering: Test failure scenarios in staging
  4. Postmortem action items: Fix root causes
  5. Runbook automation: Turn manual steps into playbooks