Incident management is the disciplined process for detecting, responding to, and learning from production-impacting events. Organizational readiness — clear severity, roles, communication, and follow-through — matters as…
1. Overview: lifecycle and readiness
Effective response depends on preparedness (runbooks, dashboards, escalation paths) and learning (blameless postmortems, tracked actions). Incidents are normal at scale; the system is how you shorten impact and prevent recurrence.
Cause summary (preliminary OK), customer impact duration, follow-up
Postmortem invite
Schedule learning
Link to doc, attendees, no-blame framing
6. Postmortem / retrospective structure
Section
Intent
Summary
What broke, for whom, how long
Blameless timeline
Factual sequence; no individual blame
Contributing factors
Multiple factors (not a single “root cause”) — people, process, tech, external
What went well / poorly
Honest assessment
Action items
Owner, due date, tracking link (ticket)
Follow-through
Review completion in operational forums
7. SEV1 response sequence (illustrative)
8. Chaos engineering integration
Activity
Goal
Game days
Rehearse incident roles and tooling with controlled scenarios
Fault injection
Validate detection, runbooks, and graceful degradation
Hypothesis-driven experiments
“If we kill X, latency stays within SLO” — ties to SRE error budgets
Chaos is not random breakage in prod without safeguards; it follows steady-state hypotheses and blast-radius limits (see also SRE and observability (blueprint)).
9. Metrics
Metric
Definition
Use
MTTA
Mean time to acknowledge alert
On-call health, routing quality
MTTD
Mean time to detect incident
Monitoring and SLO coverage
MTTR
Mean time to resolve / restore
Operational effectiveness
MTTF
Mean time between failures
Reliability engineering input
Incidents by severity
Count over window
Trend risk and investment
Postmortem completion rate
% incidents with closed actions
Learning culture indicator
10. Anti-patterns
Anti-pattern
Effect
Blame culture
Hides facts; repeats failures
Hero-dependent response
Bus factor; inconsistent outcomes
No postmortems
Same outages recur
Alert fatigue
Real incidents missed; see observability guide for alert design
11. Readiness checklist (before the pager fires)
Area
Question
Runbooks
Is there a first-response doc for top alert types?
Dashboards
Can on-call see golden signals and deploy correlation in one place?
Ownership
Is every critical path service mapped to a team and escalation?