Chapter 6 of 10

Code Review Metrics That Matter

Learn which code review metrics to track, how to measure review effectiveness, and which vanity metrics to avoid with benchmarks from top teams.

17 min read

Why Measure Code Review?

You cannot improve what you do not measure. This principle applies to code review just as it applies to application performance, test coverage, and deployment frequency. Yet most engineering teams have detailed dashboards for their CI pipeline, their error rates, and their infrastructure costs, while having almost no visibility into their code review process.

This blind spot is costly. Code review is one of the largest contributors to lead time for changes. A PR that takes four hours to write but sits in the review queue for two days doubles the time from code complete to deployment. Multiply that across every PR every engineer opens, and review latency becomes one of the biggest drags on team velocity.

Measurement serves three purposes in code review.

Identifying bottlenecks. Metrics reveal where the process breaks down. Is the problem that reviewers are slow to pick up PRs? That PRs are too large to review efficiently? That a single senior engineer is a bottleneck reviewer for the entire team? Without data, teams rely on anecdotes and gut feelings, which are often wrong.

Tracking improvement. When you make process changes (adopting AI review tools, setting PR size guidelines, redistributing review assignments), metrics tell you whether the changes actually worked. “It feels faster” is not a reliable signal. “Median time to first review dropped from 18 hours to 6 hours after we adopted CodeRabbit” is.

Aligning incentives. When a metric is visible, people pay attention to it. Making review cycle time visible on a team dashboard creates gentle social pressure to respond to reviews promptly, not because anyone is being punished, but because the number is right there. This is the positive side of measurement, and it works remarkably well when applied thoughtfully.

The danger, of course, is measuring the wrong things or using metrics punitively. We will cover both of those failure modes in detail later in this chapter. But first, let us establish which metrics actually matter.

Key Metrics to Track

Six key code review metrics — cycle time, time to first review, throughput, PR size, defect density, review depth

Review Cycle Time

Review cycle time measures the elapsed time from when a PR is opened (or marked ready for review) to when it receives its final approval. This is the single most important code review metric because it directly impacts how fast your team can ship.

What it tells you. High cycle time means code is sitting idle waiting for feedback. This delays deployments, increases the risk of merge conflicts, and forces authors to context-switch back to old work. Low cycle time means the feedback loop is tight and developers can iterate quickly.

How to measure it. Most Git hosting platforms provide the raw data (PR opened timestamp, review submitted timestamp, PR merged timestamp). Tools like LinearB and Graphite calculate this automatically and provide trend analysis. You can also extract it from the GitHub or GitLab API with a simple script.

What to watch for. Track the median, not the average. A few outlier PRs (holiday weeks, complex architectural changes) can skew the average dramatically. Also track the 90th percentile. If your median is 8 hours but your p90 is 72 hours, you have a long tail problem that the median alone would obscure.

Time to First Review

Time to first review measures how long it takes for the first reviewer to leave substantive feedback after a PR is opened. This is distinct from cycle time because it captures the responsiveness of the review process rather than the overall duration.

What it tells you. A long time to first review signals that PRs are sitting in a queue unnoticed. This is the most frustrating part of the review process for authors, and that uncertainty of not knowing when anyone will look at their work can really hurt momentum. Even if the review itself takes multiple rounds, a fast first response keeps the author engaged and productive.

Benchmarks. Google’s engineering practices recommend responding to code review requests within one business day, with the aspiration of responding within a few hours. Teams using Graphite report that stacked PRs and automated review assignment can push time to first review under 2 hours. See our analysis of how to reduce code review time for specific tactics.

Review Throughput

Review throughput measures how many PRs each reviewer handles per week or per sprint. This metric helps you understand the distribution of review work across the team.

What it tells you. Uneven distribution of review work is one of the most common causes of review bottlenecks. If one senior engineer reviews 40% of all PRs while three other qualified reviewers handle 20% each, the senior engineer is both overloaded and a single point of failure. When they take vacation, review cycle time spikes.

How to use it. Plot review throughput per team member over time. If the distribution is heavily skewed, consider implementing round-robin review assignment or using CODEOWNERS files to distribute reviews based on file ownership rather than defaulting to the same trusted reviewers.

Caveats. Raw throughput without context is misleading. Reviewing a 20-line configuration change is not the same as reviewing a 500-line architectural refactor. Pair throughput with PR size and review depth metrics for a complete picture.

PR Size

PR size measures the number of lines changed (additions + deletions) in each pull request. This is a leading indicator that predicts review quality rather than measuring it directly.

What it tells you. SmartBear’s research and Google’s internal data consistently show that review quality degrades sharply as PR size increases. Reviewers examining PRs under 200 lines find defects at a rate of 80-90%. For PRs over 1,000 lines, the defect detection rate drops below 50%. Large PRs get rubber-stamped because reviewers cannot maintain focus across a massive diff.

Benchmarks. The widely cited threshold is 400 lines changed, based on convergent findings from Google, Microsoft, and SmartBear. Elite teams aim for under 200 lines. Very few teams maintain review quality above 1,000 lines.

How to improve it. Teach authors to break large changes into stacked PRs or sequential PRs that can be reviewed independently. Feature flags enable merging partial implementations safely. Tools like Graphite are specifically designed to make stacked PRs practical. For more on this, see code review best practices.

Defect Density

Defect density measures the number of bugs or issues caught during code review, normalized by PR size or by time period. This is the most direct measure of review effectiveness, and also the most nuanced.

What it tells you. A consistently high defect density in review suggests that upstream quality processes (design review, testing, static analysis) may have gaps. A very low defect density might mean reviews are catching nothing, or it might mean the code arriving for review is already high quality because of strong upstream practices.

How to track it. Tag review comments that identify genuine defects (as opposed to style suggestions or nitpicks). Over time, categorize defects by type: correctness bugs, security vulnerabilities, performance issues, missing error handling. This categorization tells you where your review process is strong and where it has blind spots.

Caveats. Never use defect density as a productivity metric for individual developers. “Alice’s PRs had 12 review-caught bugs last quarter while Bob’s had 3” could mean Alice is writing buggier code, or it could mean Alice is tackling harder problems, working in a less-tested part of the codebase, or submitting more PRs. Using this metric to compare individuals will incentivize gaming: authors will submit trivial PRs to keep their defect count low.

Review Depth

Review depth measures the number and substance of review comments per PR. It is a proxy for how thoroughly reviewers are actually examining the code.

What it tells you. PRs that are approved with zero comments are a red flag for rubber-stamping, unless the PR is genuinely trivial (a one-line config change, a documentation fix). A healthy review process generates comments on a meaningful percentage of PRs: questions, suggestions, discussions about tradeoffs.

How to measure it. Count the number of review comments per PR, excluding bot comments and CI status updates. A more sophisticated approach categorizes comments by type (blocking, suggestion, nitpick, question, praise) to distinguish between substantive and superficial engagement.

Benchmarks. There is no universal benchmark for comments per PR because it depends heavily on PR size, team maturity, and code complexity. What matters is the trend. If your team’s comments per PR drops from an average of 4 to an average of 0.5 over several months, something has changed, and probably not for the better.

Benchmarks from High-Performing Teams

Code review benchmarks from Google, Microsoft, and DORA research

Benchmarks from industry leaders provide useful reference points, but they should be interpreted with caution. Your team’s context (size, domain, tech stack, remote vs. co-located) affects what is achievable.

Google. Google’s published engineering practices state that code reviews should be responded to within one business day. Internal data (shared in various conference talks and publications) shows that Google’s median review cycle time is under 4 hours for changes under 100 lines. Google mandates at least one reviewer approval for every change, and their readability review process adds a second layer of review for engineers who have not yet earned readability certification in a given language.

Microsoft. Research published by Microsoft’s empirical software engineering group found that their most effective teams had review turnaround times under 24 hours and PR sizes averaging 200-300 lines changed. They also found that review quality (measured by post-merge defects) correlated more strongly with PR size than with reviewer experience, which reinforces the importance of small PRs.

DORA research. The DORA (DevOps Research and Assessment) program, now part of Google Cloud, has collected data from tens of thousands of engineering teams. Their research categorizes teams into elite, high, medium, and low performers. Elite teams have a lead time for changes of less than one day, which implies that code review (a component of lead time) takes hours rather than days. The 2023 State of DevOps report found that elite teams are 2.5x more likely to have a well-defined code review process than low performers.

Industry surveys. The State of AI Code Review 2026 report found that teams using AI-assisted review tools reduced their median review cycle time by 35-50% compared to purely manual review. The improvement came primarily from faster time to first review (AI provides instant feedback) and fewer review rounds (AI catches issues that would otherwise require a revision cycle).

Vanity Metrics to Avoid

Not every measurable number is a useful metric. Some commonly tracked metrics are actively harmful because they incentivize the wrong behaviors.

Number of PRs merged per developer. This metric sounds like a productivity measure, but it incentivizes splitting work into artificially small PRs, avoiding complex tasks that take longer, and skipping refactoring that does not produce a PR at all.

Lines of code reviewed per reviewer. More lines reviewed is not better. It usually means the reviewer is handling larger PRs, which research shows leads to worse review quality. This metric rewards the exact behavior you should be discouraging.

Approval speed (time from opening PR to approval, without considering rounds). If you measure how fast PRs get approved without accounting for the number of revision rounds or the quality of the review, you incentivize rubber-stamping. A PR that is approved in 30 minutes with zero comments is not faster than a PR that takes 4 hours but produces genuinely better code through thoughtful feedback.

Review rejection rate. Tracking how often reviewers request changes (versus approving on first pass) is not a useful quality signal. A high request-changes rate might indicate thorough review or it might indicate unnecessarily picky reviewers. A low rate might indicate great code quality or rubber-stamping. Without context, the number is meaningless.

How to Collect and Visualize Review Metrics

Getting review metrics does not require an enterprise platform, though dedicated tools make it significantly easier.

DIY approach. GitHub and GitLab expose PR data through their APIs. A weekly cron job that queries the API for merged PRs, extracts timestamps (opened, first review, approved, merged) and size metrics (additions, deletions), and dumps them into a spreadsheet or dashboard gives you the basics. The GitHub GraphQL API is particularly well-suited to this because you can fetch PR data, review data, and comment data in a single query.

Dedicated platforms. LinearB provides a comprehensive engineering metrics dashboard that includes review cycle time, PR size distribution, review throughput per developer, and DORA metrics, all calculated automatically from your Git data. CodeScene takes a different approach, using code-level analysis to identify hotspots (files that change frequently and have high complexity) and correlating them with review metrics to highlight where review effort should be concentrated. Graphite focuses on the developer workflow side, providing stacked PR support and merge queue optimization alongside review metrics.

Visualization best practices. Display metrics as trends over time, not snapshots. A review cycle time of 12 hours is meaningless without context. Is it improving or getting worse? Use percentiles (median, p75, p90) rather than averages. Show distributions, not just central tendencies. And always display metrics at the team level, not the individual level, to avoid creating a surveillance dynamic.

Using Metrics to Improve (Not Punish)

The most important principle of code review metrics is that they exist to improve processes, not to evaluate individuals. When metrics are used punitively, people game them, and gamed metrics are worse than no metrics at all.

Focus on team-level trends. “Our team’s median review cycle time increased from 8 hours to 14 hours over the last month” is a process observation that invites collaborative problem-solving. “Alice’s average review time is 22 hours while the team average is 10 hours” is a personnel evaluation that invites defensiveness, shame, and gaming.

Use metrics to start conversations, not to end them. A spike in review cycle time should prompt questions: Did we have more complex PRs this sprint? Did we lose a reviewer to another project? Is our PR size creeping up? The metric identifies that something changed; the conversation identifies what and why.

Pair lagging indicators with leading indicators. Cycle time is a lagging indicator, so by the time you see it spike, the damage is done. PR size is a leading indicator: if PRs start getting larger, you can predict that cycle time will increase and intervene before it does.

Celebrate improvements. When the team’s metrics improve, acknowledge it. “Our p90 review cycle time dropped from 48 hours to 18 hours this quarter” is worth celebrating and attributing to the specific changes (new review assignment process, PR size guidelines, AI review tools) that caused it.

Set targets, not mandates. “We’d like to get our median cycle time under 8 hours” is a target the team can work toward together. “No PR may be unreviewed for more than 8 hours” is a mandate that will lead to rushed, low-quality reviews at the 7-hour mark.

The Relationship Between Review Metrics and DORA Metrics

DORA metrics (Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Mean Time to Recovery) are the gold standard for measuring engineering team performance. Code review metrics connect to DORA metrics in specific, measurable ways.

Lead Time for Changes. This DORA metric measures the time from code commit to production deployment. Code review is typically the largest human-in-the-loop component of lead time. If your CI/CD pipeline takes 15 minutes and your code review takes 24 hours, review accounts for 99% of your lead time. Reducing review cycle time has a direct, often dramatic impact on lead time for changes. Teams that have cut review cycle time by adopting the strategies in our guide on reducing code review time typically see proportional improvements in lead time.

Change Failure Rate. This metric tracks the percentage of deployments that cause a failure in production. Effective code review should reduce change failure rate by catching defects before they ship. Track the correlation between review depth metrics and change failure rate over time. If you see that weeks with more thorough reviews (more comments per PR, more review rounds) also have lower change failure rates, that is evidence that your review process is adding genuine value.

Deployment Frequency. Slow reviews reduce deployment frequency because changes sit in queue rather than shipping. But the relationship is not always straightforward. Some teams deploy less frequently because they batch changes, not because review is slow. Look at whether individual PRs are blocked on review as the direct measure.

Mean Time to Recovery. This metric measures how quickly you recover from production failures. Code review does not directly affect MTTR, but teams with strong review cultures tend to have better-understood codebases (because reviews spread knowledge) and better rollback procedures (because reviews catch risky changes and flag them for safer deployment strategies).

The DORA research consistently shows that elite teams achieve both speed and stability. Fast reviews do not mean sloppy reviews. Elite teams review quickly because they have invested in the practices that make fast, effective review possible: small PRs, clear coding standards, automated mechanical checks, and skilled reviewers who know what to look for.

Metrics Anti-Patterns (Goodhart’s Law in Code Review)

Code review metrics anti-patterns — Goodhart's Law in action

Goodhart’s Law states: “When a measure becomes a target, it ceases to be a good measure.” This law applies to code review metrics with remarkable predictability.

Optimizing for cycle time at the expense of quality. If you set a hard target of “all PRs reviewed within 4 hours,” you will see reviewers rubber-stamp complex PRs to meet the deadline. The cycle time metric will look great. The code quality will decline. Track cycle time alongside review depth (comments per PR) and downstream quality (change failure rate) to catch this failure mode.

Optimizing for PR size by splitting artificially. Setting a maximum PR size (say, 400 lines) without providing the tooling and practices for meaningful decomposition leads to PRs that are technically small but contextually incomplete. A 300-line PR that implements half a feature and cannot be understood without the other half is worse than a 600-line PR that implements the complete feature coherently. Measure PR size as a guideline, not a gate.

Counting comments as a measure of thoroughness. If you track comments per PR as a review quality metric, you will get more comments, but not necessarily better review. Reviewers will leave trivial nitpicks to inflate their comment count. Instead of raw comment count, periodically audit a sample of review comments for substantive content.

Individual leaderboards. Publishing per-developer metrics (fastest reviewer, most PRs reviewed, fewest bugs in submitted code) creates competition where you want collaboration. The fastest reviewer is not the best reviewer. The developer with the fewest review-caught bugs might be avoiding complex work. Keep metrics at the team level.

Measuring without acting. The opposite of over-optimization is under-utilization. Collecting metrics and displaying them on a dashboard that nobody checks is a waste of engineering effort. If you are going to measure, commit to reviewing the metrics regularly (monthly is a good cadence), discussing what they show, and making process changes based on what you learn.

The right approach to metrics is deliberate and balanced. Choose a small number of metrics (cycle time, PR size, and review depth are a strong starting set). Display them at the team level. Review them monthly. Use them to identify problems and track the impact of solutions. And always remember that the metrics are proxies for what you actually care about: shipping high-quality software sustainably. When the proxy and the goal diverge, trust the goal.

For practical strategies to improve your review metrics, see how to reduce code review time. For industry-wide data on how AI tools are affecting review performance, see the State of AI Code Review 2026.

Frequently Asked Questions

What is a good code review cycle time?

Elite teams aim for a median review cycle time of under 4 hours, and most high-performing teams keep it under 24 hours. If your median exceeds 48 hours, it's a significant bottleneck worth addressing. Tools like LinearB and Graphite can help you measure and optimize this.

Should I track how many bugs code review catches?

Tracking review-caught defects can be useful directionally, but don't optimize for this metric alone. A team that catches zero bugs in review might have excellent upstream practices (strong typing, comprehensive tests) rather than ineffective reviews.

What are DORA metrics and how do they relate to code review?

DORA metrics (Deployment Frequency, Lead Time for Changes, Change Failure Rate, Mean Time to Recovery) measure overall delivery performance. Code review directly impacts Lead Time for Changes — slow reviews increase lead time. DORA research shows elite teams have both fast reviews and low change failure rates.

Continue Learning

Newsletter

Stay ahead with AI dev tools

Weekly insights, no spam.