Mean Time to Recovery (MTTR)
MTTR (Mean Time to Recovery)
MTTR measures how quickly a team restores service after an incident or degradation.
Reducing MTTR requires clarity (triage), speed (containment), and safe reversal (rollback readiness).
Reduce MTTR
- Instrument signals that speed diagnosis (logs/metrics/traces).
- Keep incident response runbooks current.
- Define rollback triggers and practice rollbacks.
- Use postmortems to improve controls and prevent repeats.
See also
Incident Response Runbook Rollback Runbook Observability Rollback Readiness Postmortem TemplateFAQ
What is MTTR exactly?
The average time to restore service after an incident or degradation, measured from detection to recovery.
What reduces MTTR the most?
Clear triage, strong observability, and practiced rollback paths with defined triggers.
How do runbooks help?
They reduce cognitive load under pressure and ensure consistent steps, verification, and evidence capture.
How do we measure MTTR fairly?
Use consistent incident severity definitions and timestamps (detect, mitigate, resolve). Avoid mixing unrelated categories.
What’s the fastest improvement?
Improve alert quality + define a simple triage flow + add one safe rollback path.