Metrics

Mean Time to Recovery

A DORA metric measuring the average time to restore service after a production incident or failure, indicating an organization's incident response capability.

What Is Mean Time to Recovery?

Mean time to recovery (MTTR) is a key DORA (DevOps Research and Assessment) metric that measures the average duration between a production service disruption and full service restoration. It captures how quickly an engineering organization can detect, diagnose, and resolve incidents that affect users. MTTR is one of four metrics identified by DORA as critical indicators of software delivery performance, alongside deployment frequency, lead time for changes, and change failure rate.

The metric is sometimes referred to as “mean time to restore” or “mean time to remediate,” reflecting the fact that recovery does not always mean finding and fixing the root cause — it may mean rolling back, switching to a backup, or deploying a temporary workaround that restores service while the root cause is investigated separately.

DORA’s research classifies organizations by their MTTR performance. Elite performers recover in less than one hour. High performers recover in less than one day. Medium performers recover in less than one week. Low performers take more than six months to recover from failures. The gap between elite and low performers is enormous, and MTTR is one of the metrics most strongly correlated with overall organizational performance.

How It Works

MTTR is calculated by averaging the recovery duration across all incidents in a given time period:

MTTR = Sum of (Recovery Time for Each Incident) / Number of Incidents

Example:
  Incident 1: Detected at 2:00 PM, resolved at 2:45 PM → 45 minutes
  Incident 2: Detected at 9:00 AM, resolved at 9:20 AM → 20 minutes
  Incident 3: Detected at 11:00 PM, resolved at 1:30 AM → 150 minutes

  MTTR = (45 + 20 + 150) / 3 = 71.7 minutes

The recovery timeline consists of several distinct phases, each representing an opportunity for improvement:

Incident Timeline:
┌──────────┬───────────┬──────────────┬─────────────┬──────────┐
│  Failure  │ Detection │  Diagnosis   │   Repair    │ Recovery │
│  occurs   │   time    │    time      │    time     │ verified │
└──────────┴───────────┴──────────────┴─────────────┴──────────┘
           |←── Time to Detect ──→|
                      |←─── Time to Diagnose ──→|
                                    |←── Time to Repair ──→|
|←──────────────── MTTR ──────────────────────────────────→|

Time to detect (TTD). How long before the team knows there is a problem. This is influenced by monitoring coverage, alert configuration, and observability tooling. Some teams rely on customer reports, which dramatically increases TTD.

Time to diagnose (TTDx). How long to identify the root cause or determine an effective remediation. This depends on log quality, tracing infrastructure, runbook availability, and on-call engineer experience.

Time to repair (TTR). How long to implement and deploy the fix. This is influenced by CI/CD pipeline speed, deployment mechanisms (rollback capability, feature flags), and approval processes.

Teams track MTTR using incident management platforms like PagerDuty, Opsgenie, Grafana OnCall, or custom integrations that record incident creation and resolution timestamps.

# Example: Calculating MTTR from incident records
from datetime import datetime

incidents = [
    {"start": "2026-03-01T14:00:00Z", "end": "2026-03-01T14:45:00Z"},
    {"start": "2026-03-05T09:00:00Z", "end": "2026-03-05T09:20:00Z"},
    {"start": "2026-03-10T23:00:00Z", "end": "2026-03-11T01:30:00Z"},
]

def calculate_mttr(incidents):
    total_minutes = 0
    for incident in incidents:
        start = datetime.fromisoformat(incident["start"].replace("Z", "+00:00"))
        end = datetime.fromisoformat(incident["end"].replace("Z", "+00:00"))
        duration = (end - start).total_seconds() / 60
        total_minutes += duration
    return total_minutes / len(incidents)

mttr = calculate_mttr(incidents)  # 71.67 minutes

Why It Matters

MTTR is a direct measure of organizational resilience — the ability to absorb and recover from failures quickly.

User impact minimization. Every minute of downtime or degraded service affects users. For revenue-generating services, the cost can be quantified directly: a 30-minute outage for a service processing $10,000 per hour costs $5,000 in direct revenue, plus the harder-to-quantify costs of user frustration, trust erosion, and churn.

Deployment confidence. Teams with low MTTR can deploy more frequently because they know that any issues will be quickly resolved. This creates a virtuous cycle: frequent, small deployments are easier to diagnose when they fail, which further reduces MTTR. Teams with high MTTR deploy cautiously and infrequently, leading to larger, riskier changes.

Engineering culture indicator. MTTR reflects the quality of monitoring, documentation, tooling, and incident response processes. Organizations with low MTTR have invested in observability, automated rollbacks, clear runbooks, and well-practiced on-call rotations. High MTTR often indicates underinvestment in operational readiness.

Predictability for stakeholders. When leadership asks “how quickly can we recover if something goes wrong,” MTTR provides a data-driven answer. This predictability is essential for SLA commitments, customer trust, and capacity planning.

Best Practices

  • Invest in detection speed first. The fastest repair in the world does not help if it takes two hours to realize there is a problem. Implement comprehensive monitoring, set up meaningful alerts (not just CPU utilization, but user-facing metrics like error rates and latency percentiles), and use synthetic monitoring to catch issues before users report them.

  • Build rollback capability into every deployment. The fastest way to recover from a bad deployment is to roll back to the previous version. Ensure your deployment pipeline supports one-click rollbacks, and practice using them regularly so the team is confident in the process during real incidents.

  • Maintain actionable runbooks. Document diagnosis and remediation steps for common failure modes. A well-written runbook can reduce diagnosis time from hours to minutes, especially when the on-call engineer is not the expert for the failing system.

  • Conduct regular incident retrospectives. After every incident, review the timeline and identify where the most time was spent. Was detection slow? Diagnosis difficult? Deployment pipeline blocking the fix? Target improvements at the phase that contributed the most to recovery time.

  • Use feature flags for progressive rollouts. Deploying behind feature flags allows you to expose changes to a subset of users and quickly disable them if issues emerge. This transforms a full-service failure into a minor, instantly recoverable event.

Common Mistakes

  • Reporting only the mean and ignoring distribution. An average MTTR of 30 minutes might include twenty incidents resolved in 5 minutes and one incident that took 10 hours. The mean hides the outlier that caused the most user impact. Track the 90th and 95th percentile alongside the mean to understand worst-case performance.

  • Conflating MTTR with mean time between failures (MTBF). MTTR measures recovery speed. MTBF measures reliability — how often failures occur. Both are important, but improving one does not automatically improve the other. A team can recover quickly (low MTTR) while failing frequently (low MTBF), or vice versa.

  • Measuring MTTR without standardizing incident severity. Averaging the recovery time for a minor logging issue and a total service outage produces a misleading number. Segment MTTR by severity level to get actionable insights. Compare your P1 MTTR to benchmarks for P1 incidents, not your overall average.

Related Terms

Learn More

Related Articles

Free Newsletter

Stay ahead with AI dev tools

Weekly insights on AI code review, static analysis, and developer productivity. No spam, unsubscribe anytime.

Join developers getting weekly AI tool insights.