Incident Response Runbook

Step-by-Step Response Process

Phase 1: Detection & Acknowledgment (0-5 minutes)

Alert arrives in Rocket.Chat #incidents channel

On-call engineer:

Acknowledge in channel:
```
@oncall acknowledging - investigating
```
Create incident thread:
```
/incident create "Brief description of issue"
```
This creates dedicated channel: #incident-YYYY-MM-DD-NNN
Determine severity (P0/P1/P2/P3)
If P0, immediately notify:
- Tag @leadership in incident channel
- Escalate to DevOps lead
- Consider external communication

Phase 2: Initial Investigation (5-15 minutes)

Gather information:

# Check recent deployments
# In incident channel, bot shows recent deploys

# Check monitoring dashboards
# - CPU, memory, disk usage
# - Error rates
# - Response times

# Check logs
# Access via your log aggregation tool

# Check DigitalOcean status
https://status.digitalocean.com

Update incident channel with findings:

05 - API error rate spiked to 25% starting 09:58
06 - Checking recent changes... deployment at 09:55
08 - Logs show database connection timeouts

Phase 3: Mitigation (15-30 minutes)

Goal: Stop the bleeding, restore service (not necessarily fix root cause)

Common mitigation playbooks:

# Rollback recent deployment
ansible-playbook playbooks/emergency/rollback.yml \
  -e "app=api version=previous" \
  -i inventory/production

# Restart service
ansible-playbook playbooks/emergency/restart-service.yml \
  -e "service=api host=prod-api-01" \
  -i inventory/production

# Scale up resources
ansible-playbook playbooks/emergency/scale-up.yml \
  -e "service=api replicas=3" \
  -i inventory/production

# Enable maintenance mode
ansible-playbook playbooks/emergency/maintenance-mode.yml \
  -e "enable=true" \
  -i inventory/production

Update every 10 minutes:

15 - Initiating rollback to v1.2.3
20 - Rollback complete, monitoring error rates
25 - Error rate decreasing, now at 5%
30 - Service stabilized, error rate normal

Phase 4: Communication

Internal (Rocket.Chat):

Update incident channel every 10 minutes
Tag relevant teams: @backend @frontend @devops
Keep updates brief and factual

External (if customer-facing):

Update status page (if you have one)
Notify #customer-success team
Consider email to affected customers

Template - Internal Update:

⏰ 10:15 UPDATE

Status: INVESTIGATING / MITIGATING / RESOLVED
Impact: [what's affected, how many users]
Actions: [what we're doing]
ETA: [estimated resolution time]
Next update: 10:25

Template - External Update:

We're experiencing issues with [service].

Impact: [what's affected]
Status: [investigating/fixing/monitoring]
Workaround: [if available]
ETA: [if known]

Updates every 15 minutes at status.company.com

Phase 5: Resolution

Service restored:

✅ RESOLVED - 10:45

Timeline:
- 09:58: Issue detected
- 10:02: Incident declared
- 10:20: Mitigation applied
- 10:45: Service fully restored

Root cause: Database connection pool exhaustion
Fix applied: Rolled back deployment, increased pool size
Monitoring: Added alerts for pool utilization

Next steps:
- Post-incident review: Tomorrow 2pm
- Action items: Track in GitHub issue

Close incident:

/incident close "Resolved by rollback + pool scaling"

Phase 6: Post-Incident Review (Within 48 hours)

Schedule meeting with relevant team members

Use template: Post-Incident Review Template

Publish in Docusaurus blog section for team learning

Communication Matrix

Severity	Rocket.Chat	Email	Status Page	Leadership
P0	Immediate	Yes	Yes	Immediately
P1	Immediate	If >1hr	If customer-facing	If >30min
P2	When fixed	No	No	Daily summary
P3	When fixed	No	No	No

Decision Trees

Is This an Incident?

Service impaired? → Yes → Incident
                 → No → Maybe monitoring issue

Customer impact? → Yes → P0 or P1
                → No → P2 or P3

Security related? → Yes → Also alert #security
                 → No → Continue normal process

Should I Rollback?

Recent deployment? → Yes → Rollback
                  → No → Investigate other causes

Error rate >20%? → Yes → Rollback
                → No → Try other mitigation

Can reproduce in staging? → Yes → Fix in code first
                         → No → Rollback to be safe

Should I Escalate?

Stuck >30 min? → Yes → Escalate to lead
              → No → Continue

Security breach? → Yes → Escalate immediately
                → No → Continue

Data loss risk? → Yes → Escalate + get approval before action
               → No → Continue

Customer outage >1hr? → Yes → Escalate + external comms
                     → No → Continue

Tools Quick Reference

Monitoring: [Link to Grafana/monitoring] Logs: [Link to log system] Recent Deployments: Check CI/CD history Server Access: Via Netbird VPN + SSH Emergency Playbooks: infrastructure/playbooks/emergency/

Emergency Contacts

Rocket.Chat:

@oncall - Current on-call engineer
@devops-lead - DevOps team lead
@security-team - Security incidents
@leadership - CEO, CTO (P0 only)

Phone: On-call phone number in #devops channel topic

Post-Incident Checklist

After resolving incident:

Learning from Incidents

Every incident is a learning opportunity:

What worked well?
What didn't work?
What would you do differently?
How can we prevent this?

Document in postmortem, share with team

Track action items and actually complete them

Update runbooks based on what you learned

Step-by-Step Response Process​

Phase 1: Detection & Acknowledgment (0-5 minutes)​

Phase 2: Initial Investigation (5-15 minutes)​

Phase 3: Mitigation (15-30 minutes)​

Phase 4: Communication​

Phase 5: Resolution​

Phase 6: Post-Incident Review (Within 48 hours)​

Communication Matrix​

Decision Trees​

Is This an Incident?​

Should I Rollback?​

Should I Escalate?​

Tools Quick Reference​

Emergency Contacts​

Post-Incident Checklist​

Learning from Incidents​