Post-Incident Review Template
Incident Information
Incident ID: INC-YYYY-MM-DD-NNN Date: YYYY-MM-DD Duration: HH:MM (from detection to resolution) Severity: P0 / P1 / P2 / P3 Services Affected: List services Customer Impact: Yes / No - Brief description
Executive Summary
Brief (2-3 sentences) description of what happened and impact.
Example:
On 2025-11-10 at 09:58, the API service experienced a 45-minute outage affecting 80% of users. Root cause was database connection pool exhaustion triggered by a deployment that increased concurrent requests. Service was restored by rolling back the deployment and increasing the connection pool size.
Timeline
All times in your local timezone (e.g., EST)
| Time | Event |
|---|---|
| 09:55 | Deployment of API v1.3.0 to production |
| 09:58 | Monitoring alerts for API error rate >20% |
| 10:02 | Incident declared, @oncall acknowledged |
| 10:05 | Investigation started, checked logs |
| 10:10 | Root cause identified: DB connection pool exhausted |
| 10:15 | Decision to rollback deployment |
| 10:20 | Rollback completed |
| 10:25 | Error rates improving |
| 10:35 | DB connection pool size increased |
| 10:45 | Service fully restored, incident closed |
Root Cause
Technical explanation of what went wrong:
The API v1.3.0 deployment introduced a new feature that made additional database queries per request. Under normal load, this increased concurrent DB connections from ~15 to ~25. The connection pool was configured for max 20 connections. When connections were exhausted, new requests failed with timeout errors.
Contributing factors:
- Connection pool size not adjusted for new feature
- Load testing didn't simulate realistic concurrent users
- No alerting on DB connection pool utilization
Impact
Users affected: ~80% of active users
Duration: 45 minutes (09:58 - 10:45)
Business impact:
- 1,200 failed requests
- 15 customer support tickets
- Estimated revenue impact: $XXX (if applicable)
Data impact: No data loss or corruption
What Went Well
- Monitoring detected issue within 3 minutes
- On-call engineer acknowledged quickly (4 min)
- Rollback playbook worked smoothly
- Team communication was clear and frequent
- No data loss occurred
What Didn't Go Well
- Load testing didn't catch the issue
- No monitoring on DB connection pool usage
- Deployment during peak hours
- Root cause took 13 minutes to identify
Action Items
| Action | Owner | Due Date | Status |
|---|---|---|---|
| Add DB connection pool monitoring + alerts | @devops-john | 2025-11-15 | ⏳ In Progress |
| Update load testing to simulate concurrent users | @qa-jane | 2025-11-20 | 📋 To Do |
| Create runbook for DB connection issues | @backend-tom | 2025-11-17 | 📋 To Do |
| Review deployment schedule (avoid peak hours) | @devops-lead | 2025-11-12 | 📋 To Do |
| Increase DB connection pool size for all envs | @devops-john | 2025-11-10 | ✅ Done |
Lessons Learned
- Connection pool sizing should be reviewed when adding features that increase DB usage
- Load testing must include concurrent user simulation, not just throughput
- Monitoring gaps exist - we need visibility into DB connection pool usage
- Rollback playbooks work well - continue investing in emergency automation
- Deployment timing matters - consider restricting deployments during peak hours
Process Improvements
What we'll change:
- Add "DB impact review" to PR checklist for backend changes
- Update load testing methodology
- Add connection pool monitoring to all environments
- Document deployment windows (safe vs. risky times)
Related Incidents
Have we seen this before?
- Similar issue in 2024-09: DB timeouts (different root cause)
- Connection pool has been an issue before (see INC-2024-09-15)
Trend: DB connection management is a recurring theme
Appendix
Relevant links:
- Incident channel: #incident-2025-11-10-001
- GitHub PR that caused issue: #123
- Monitoring dashboard: [link]
- Deployment log: [link]
Additional notes:
- Thanks to @backend-tom for quick debugging
- Customer communication handled well by @support-lead
Review completed: YYYY-MM-DD Attendees: List names Published: Link to blog post (for team visibility)
Template Usage Notes
When to complete: Within 48 hours of incident resolution
Who leads: Incident commander (on-call engineer who handled it)
Who attends:
- Incident commander
- Engineers involved in resolution
- Relevant team leads
- Optional: Product/business stakeholders
Format:
- Schedule 30-60 min meeting
- Use this template as agenda
- Focus on learning, not blaming
- Document in writing, share with team
Blameless culture:
- No finger-pointing
- Focus on systems and processes
- "What went wrong" not "who messed up"
- Assume good intent
- Learn and improve
Publishing:
- Share in Docusaurus blog section
- Tag with category: "Incident Reports"
- Make it searchable for future reference
- Celebrate quick recovery and learning