Post-Incident Review Template

Incident Information

Incident ID: INC-YYYY-MM-DD-NNN Date: YYYY-MM-DD Duration: HH:MM (from detection to resolution) Severity: P0 / P1 / P2 / P3 Services Affected: List services Customer Impact: Yes / No - Brief description

Executive Summary

Brief (2-3 sentences) description of what happened and impact.

Example:

On 2025-11-10 at 09:58, the API service experienced a 45-minute outage affecting 80% of users. Root cause was database connection pool exhaustion triggered by a deployment that increased concurrent requests. Service was restored by rolling back the deployment and increasing the connection pool size.

Timeline

All times in your local timezone (e.g., EST)

Time	Event
09:55	Deployment of API v1.3.0 to production
09:58	Monitoring alerts for API error rate >20%
10:02	Incident declared, @oncall acknowledged
10:05	Investigation started, checked logs
10:10	Root cause identified: DB connection pool exhausted
10:15	Decision to rollback deployment
10:20	Rollback completed
10:25	Error rates improving
10:35	DB connection pool size increased
10:45	Service fully restored, incident closed

Root Cause

Technical explanation of what went wrong:

The API v1.3.0 deployment introduced a new feature that made additional database queries per request. Under normal load, this increased concurrent DB connections from ~15 to ~25. The connection pool was configured for max 20 connections. When connections were exhausted, new requests failed with timeout errors.

Contributing factors:

Connection pool size not adjusted for new feature
Load testing didn't simulate realistic concurrent users
No alerting on DB connection pool utilization

Impact

Users affected: ~80% of active users

Duration: 45 minutes (09:58 - 10:45)

Business impact:

1,200 failed requests
15 customer support tickets
Estimated revenue impact: $XXX (if applicable)

Data impact: No data loss or corruption

What Went Well

Monitoring detected issue within 3 minutes
On-call engineer acknowledged quickly (4 min)
Rollback playbook worked smoothly
Team communication was clear and frequent
No data loss occurred

What Didn't Go Well

Load testing didn't catch the issue
No monitoring on DB connection pool usage
Deployment during peak hours
Root cause took 13 minutes to identify

Action Items

Action	Owner	Due Date	Status
Add DB connection pool monitoring + alerts	@devops-john	2025-11-15	⏳ In Progress
Update load testing to simulate concurrent users	@qa-jane	2025-11-20	📋 To Do
Create runbook for DB connection issues	@backend-tom	2025-11-17	📋 To Do
Review deployment schedule (avoid peak hours)	@devops-lead	2025-11-12	📋 To Do
Increase DB connection pool size for all envs	@devops-john	2025-11-10	✅ Done

Lessons Learned

Connection pool sizing should be reviewed when adding features that increase DB usage
Load testing must include concurrent user simulation, not just throughput
Monitoring gaps exist - we need visibility into DB connection pool usage
Rollback playbooks work well - continue investing in emergency automation
Deployment timing matters - consider restricting deployments during peak hours

Process Improvements

What we'll change:

Add "DB impact review" to PR checklist for backend changes
Update load testing methodology
Add connection pool monitoring to all environments
Document deployment windows (safe vs. risky times)

Have we seen this before?

Similar issue in 2024-09: DB timeouts (different root cause)
Connection pool has been an issue before (see INC-2024-09-15)

Trend: DB connection management is a recurring theme

Appendix

Relevant links:

Incident channel: #incident-2025-11-10-001
GitHub PR that caused issue: #123
Monitoring dashboard: [link]
Deployment log: [link]

Additional notes:

Thanks to @backend-tom for quick debugging
Customer communication handled well by @support-lead

Review completed: YYYY-MM-DD Attendees: List names Published: Link to blog post (for team visibility)

Template Usage Notes

When to complete: Within 48 hours of incident resolution

Who leads: Incident commander (on-call engineer who handled it)

Who attends:

Incident commander
Engineers involved in resolution
Relevant team leads
Optional: Product/business stakeholders

Format:

Schedule 30-60 min meeting
Use this template as agenda
Focus on learning, not blaming
Document in writing, share with team

Blameless culture:

No finger-pointing
Focus on systems and processes
"What went wrong" not "who messed up"
Assume good intent
Learn and improve

Publishing:

Share in Docusaurus blog section
Tag with category: "Incident Reports"
Make it searchable for future reference
Celebrate quick recovery and learning

Incident Information​

Executive Summary​

Timeline​

Root Cause​

Impact​

What Went Well​

What Didn't Go Well​

Action Items​

Lessons Learned​

Process Improvements​

Related Incidents​

Appendix​

Template Usage Notes​