Skip to main content

Post-Incident Review Template

Incident Information

Incident ID: INC-YYYY-MM-DD-NNN Date: YYYY-MM-DD Duration: HH:MM (from detection to resolution) Severity: P0 / P1 / P2 / P3 Services Affected: List services Customer Impact: Yes / No - Brief description

Executive Summary

Brief (2-3 sentences) description of what happened and impact.

Example:

On 2025-11-10 at 09:58, the API service experienced a 45-minute outage affecting 80% of users. Root cause was database connection pool exhaustion triggered by a deployment that increased concurrent requests. Service was restored by rolling back the deployment and increasing the connection pool size.

Timeline

All times in your local timezone (e.g., EST)

TimeEvent
09:55Deployment of API v1.3.0 to production
09:58Monitoring alerts for API error rate >20%
10:02Incident declared, @oncall acknowledged
10:05Investigation started, checked logs
10:10Root cause identified: DB connection pool exhausted
10:15Decision to rollback deployment
10:20Rollback completed
10:25Error rates improving
10:35DB connection pool size increased
10:45Service fully restored, incident closed

Root Cause

Technical explanation of what went wrong:

The API v1.3.0 deployment introduced a new feature that made additional database queries per request. Under normal load, this increased concurrent DB connections from ~15 to ~25. The connection pool was configured for max 20 connections. When connections were exhausted, new requests failed with timeout errors.

Contributing factors:

  • Connection pool size not adjusted for new feature
  • Load testing didn't simulate realistic concurrent users
  • No alerting on DB connection pool utilization

Impact

Users affected: ~80% of active users

Duration: 45 minutes (09:58 - 10:45)

Business impact:

  • 1,200 failed requests
  • 15 customer support tickets
  • Estimated revenue impact: $XXX (if applicable)

Data impact: No data loss or corruption

What Went Well

  • Monitoring detected issue within 3 minutes
  • On-call engineer acknowledged quickly (4 min)
  • Rollback playbook worked smoothly
  • Team communication was clear and frequent
  • No data loss occurred

What Didn't Go Well

  • Load testing didn't catch the issue
  • No monitoring on DB connection pool usage
  • Deployment during peak hours
  • Root cause took 13 minutes to identify

Action Items

ActionOwnerDue DateStatus
Add DB connection pool monitoring + alerts@devops-john2025-11-15⏳ In Progress
Update load testing to simulate concurrent users@qa-jane2025-11-20📋 To Do
Create runbook for DB connection issues@backend-tom2025-11-17📋 To Do
Review deployment schedule (avoid peak hours)@devops-lead2025-11-12📋 To Do
Increase DB connection pool size for all envs@devops-john2025-11-10✅ Done

Lessons Learned

  1. Connection pool sizing should be reviewed when adding features that increase DB usage
  2. Load testing must include concurrent user simulation, not just throughput
  3. Monitoring gaps exist - we need visibility into DB connection pool usage
  4. Rollback playbooks work well - continue investing in emergency automation
  5. Deployment timing matters - consider restricting deployments during peak hours

Process Improvements

What we'll change:

  • Add "DB impact review" to PR checklist for backend changes
  • Update load testing methodology
  • Add connection pool monitoring to all environments
  • Document deployment windows (safe vs. risky times)

Have we seen this before?

  • Similar issue in 2024-09: DB timeouts (different root cause)
  • Connection pool has been an issue before (see INC-2024-09-15)

Trend: DB connection management is a recurring theme

Appendix

Relevant links:

  • Incident channel: #incident-2025-11-10-001
  • GitHub PR that caused issue: #123
  • Monitoring dashboard: [link]
  • Deployment log: [link]

Additional notes:

  • Thanks to @backend-tom for quick debugging
  • Customer communication handled well by @support-lead

Review completed: YYYY-MM-DD Attendees: List names Published: Link to blog post (for team visibility)

Template Usage Notes

When to complete: Within 48 hours of incident resolution

Who leads: Incident commander (on-call engineer who handled it)

Who attends:

  • Incident commander
  • Engineers involved in resolution
  • Relevant team leads
  • Optional: Product/business stakeholders

Format:

  • Schedule 30-60 min meeting
  • Use this template as agenda
  • Focus on learning, not blaming
  • Document in writing, share with team

Blameless culture:

  • No finger-pointing
  • Focus on systems and processes
  • "What went wrong" not "who messed up"
  • Assume good intent
  • Learn and improve

Publishing:

  • Share in Docusaurus blog section
  • Tag with category: "Incident Reports"
  • Make it searchable for future reference
  • Celebrate quick recovery and learning