Skip to main content

Incident Response Runbook

Step-by-Step Response Process

Phase 1: Detection & Acknowledgment (0-5 minutes)

Alert arrives in Rocket.Chat #incidents channel

On-call engineer:

  1. Acknowledge in channel:

    @oncall acknowledging - investigating
  2. Create incident thread:

    /incident create "Brief description of issue"

    This creates dedicated channel: #incident-YYYY-MM-DD-NNN

  3. Determine severity (P0/P1/P2/P3)

  4. If P0, immediately notify:

    • Tag @leadership in incident channel
    • Escalate to DevOps lead
    • Consider external communication

Phase 2: Initial Investigation (5-15 minutes)

Gather information:

# Check recent deployments
# In incident channel, bot shows recent deploys

# Check monitoring dashboards
# - CPU, memory, disk usage
# - Error rates
# - Response times

# Check logs
# Access via your log aggregation tool

# Check DigitalOcean status
https://status.digitalocean.com

Update incident channel with findings:

10:05 - API error rate spiked to 25% starting 09:58
10:06 - Checking recent changes... deployment at 09:55
10:08 - Logs show database connection timeouts

Phase 3: Mitigation (15-30 minutes)

Goal: Stop the bleeding, restore service (not necessarily fix root cause)

Common mitigation playbooks:

# Rollback recent deployment
ansible-playbook playbooks/emergency/rollback.yml \
-e "app=api version=previous" \
-i inventory/production

# Restart service
ansible-playbook playbooks/emergency/restart-service.yml \
-e "service=api host=prod-api-01" \
-i inventory/production

# Scale up resources
ansible-playbook playbooks/emergency/scale-up.yml \
-e "service=api replicas=3" \
-i inventory/production

# Enable maintenance mode
ansible-playbook playbooks/emergency/maintenance-mode.yml \
-e "enable=true" \
-i inventory/production

Update every 10 minutes:

10:15 - Initiating rollback to v1.2.3
10:20 - Rollback complete, monitoring error rates
10:25 - Error rate decreasing, now at 5%
10:30 - Service stabilized, error rate normal

Phase 4: Communication

Internal (Rocket.Chat):

  • Update incident channel every 10 minutes
  • Tag relevant teams: @backend @frontend @devops
  • Keep updates brief and factual

External (if customer-facing):

  • Update status page (if you have one)
  • Notify #customer-success team
  • Consider email to affected customers

Template - Internal Update:

⏰ 10:15 UPDATE

Status: INVESTIGATING / MITIGATING / RESOLVED
Impact: [what's affected, how many users]
Actions: [what we're doing]
ETA: [estimated resolution time]
Next update: 10:25

Template - External Update:

We're experiencing issues with [service].

Impact: [what's affected]
Status: [investigating/fixing/monitoring]
Workaround: [if available]
ETA: [if known]

Updates every 15 minutes at status.company.com

Phase 5: Resolution

Service restored:

✅ RESOLVED - 10:45

Timeline:
- 09:58: Issue detected
- 10:02: Incident declared
- 10:20: Mitigation applied
- 10:45: Service fully restored

Root cause: Database connection pool exhaustion
Fix applied: Rolled back deployment, increased pool size
Monitoring: Added alerts for pool utilization

Next steps:
- Post-incident review: Tomorrow 2pm
- Action items: Track in GitHub issue

Close incident:

/incident close "Resolved by rollback + pool scaling"

Phase 6: Post-Incident Review (Within 48 hours)

Schedule meeting with relevant team members

Use template: Post-Incident Review Template

Publish in Docusaurus blog section for team learning

Communication Matrix

SeverityRocket.ChatEmailStatus PageLeadership
P0ImmediateYesYesImmediately
P1ImmediateIf >1hrIf customer-facingIf >30min
P2When fixedNoNoDaily summary
P3When fixedNoNoNo

Decision Trees

Is This an Incident?

Service impaired? → Yes → Incident
→ No → Maybe monitoring issue

Customer impact? → Yes → P0 or P1
→ No → P2 or P3

Security related? → Yes → Also alert #security
→ No → Continue normal process

Should I Rollback?

Recent deployment? → Yes → Rollback
→ No → Investigate other causes

Error rate >20%? → Yes → Rollback
→ No → Try other mitigation

Can reproduce in staging? → Yes → Fix in code first
→ No → Rollback to be safe

Should I Escalate?

Stuck >30 min? → Yes → Escalate to lead
→ No → Continue

Security breach? → Yes → Escalate immediately
→ No → Continue

Data loss risk? → Yes → Escalate + get approval before action
→ No → Continue

Customer outage >1hr? → Yes → Escalate + external comms
→ No → Continue

Tools Quick Reference

Monitoring: [Link to Grafana/monitoring] Logs: [Link to log system] Recent Deployments: Check CI/CD history Server Access: Via Netbird VPN + SSH Emergency Playbooks: infrastructure/playbooks/emergency/

Emergency Contacts

Rocket.Chat:

  • @oncall - Current on-call engineer
  • @devops-lead - DevOps team lead
  • @security-team - Security incidents
  • @leadership - CEO, CTO (P0 only)

Phone: On-call phone number in #devops channel topic

Post-Incident Checklist

After resolving incident:

  • Incident marked as resolved
  • Timeline documented
  • Root cause identified
  • Post-incident review scheduled (within 48 hours)
  • Action items created in GitHub
  • Customers notified (if applicable)
  • Monitoring/alerts updated
  • Runbook updated if needed
  • Team debriefed

Learning from Incidents

Every incident is a learning opportunity:

  • What worked well?
  • What didn't work?
  • What would you do differently?
  • How can we prevent this?

Document in postmortem, share with team

Track action items and actually complete them

Update runbooks based on what you learned