The GSAx Paradox

In The .930 Mirage, we showed that 89% of save percentage variance is non-repeatable. The year-to-year correlation for NHL goalies with 30 or more games played is r = 0.33. By psychometric standards, that makes save percentage an unreliable measure of individual ability.

The obvious follow-up question is whether a better metric fixes this.

Goals Saved Above Expected is the analytics community’s preferred goaltending metric. It takes every shot a goalie faces, assigns it a probability of becoming a goal based on location, shot type, and context, then compares the actual goals allowed to the expected total. A goalie who allows fewer goals than the model predicts gets a positive GSAx. One who allows more gets a negative.

The xG models that power GSAx are typically built with gradient-boosted trees, a machine learning method that learns which shot features predict goals. MoneyPuck, Evolving Hockey, and Natural Stat Trick all publish their own versions. GSAx is widely considered the best publicly available tool for goaltender evaluation because it controls for shot quality. A goalie facing 25 slot chances deserves a different baseline than one facing 25 point shots.

The logic for testing GSAx is straightforward. Save percentage treats all shots equally. GSAx does not. If the goal is to isolate what the goalie actually contributed, GSAx should be more informative and more stable from year to year.

So we ran the same analysis on GSAx.

We pulled every goalie season from MoneyPuck across the same 18-year window (2007-08 through 2024-25). Same 30-game minimum. Same consecutive-season pair design. 550 pairs total, 160 goalies.

The result was not what we expected.

The Reliability Ladder

Here is the year-to-year correlation for five goaltending metrics, all measured on the same 550 pairs:

Expected save percentage, which reflects the quality of shots a goalie faces (essentially a measure of team defense), is the most stable metric at r = 0.63. That is consistent with what we know: defensive systems are coached, structured, and relatively stable from year to year.

Raw save percentage comes next at r = 0.33. This is the number from The .930 Mirage. It is moderately stable because it blends two things: the goalie’s individual performance and the team defense in front of them.

GSAx per 60 minutes sits at r = 0.14.

Delta save percentage, which is simply save percentage minus expected save percentage, also sits at r = 0.14. This is not a coincidence. GSAx/60 and delta SV% are measuring the same thing: the goalie’s contribution after removing the team effect. They agree almost perfectly on how unreliable that contribution is.

The pattern holds at every games-played threshold we tested (25, 30, 40) and in 5-on-5 data only, where the goalie-isolated metrics drop even further to r = 0.08.

Why the “Better” Metric Is Less Reliable

This is the paradox. GSAx is a better measure of what a goalie did this season. But it is a worse predictor of what that goalie will do next season.

The explanation is straightforward once you see the variance decomposition.

Save percentage is the sum of two components: team defense (measured by expected save percentage) and the goalie’s individual value-add (measured by delta save percentage). The team component is stable (r = 0.63). The goalie component is weak (r = 0.14).

When you look at raw save percentage, you are looking at both components blended together. The stable team effect pulls the year-to-year correlation up. The noisy goalie effect pulls it down. The net result is r = 0.33. Not great, but not terrible.

GSAx strips out the team effect. That is exactly what it is designed to do, and it does it well. But the thing it isolates, the goalie’s personal contribution above or below expectations, turns out to be the part of goaltending that barely persists from year to year.

The “better” metric is less reliable because it is doing its job. And the thing it reveals is that goaltender-specific performance, the part that is truly about the goalie and not the defense, has weak year-to-year persistence.

The Cloud

Each dot is one goalie measured in consecutive seasons. If GSAx were a reliable individual trait, the dots would cluster along the diagonal. A goalie who posted +30 this year should post something in that neighborhood next year.

Instead, the cloud is nearly circular. A goalie who saved 30 goals above expected this season is roughly as likely to post a negative GSAx next year as to repeat the performance.

What Happened to the Best

Of the 13 best GSAx seasons in our dataset that had a follow-up year, 5 were followed by a negative GSAx. The median follow-up was +7.4, about one-fifth of the original performance. Only two goalies (Shesterkin and Hellebuyck) showed anything resembling persistence at elite levels, and both still regressed substantially.

Juuse Saros posted the highest single-season GSAx in our 18-year dataset (+46.7 in 2021-22). The next year: -3.0. Tim Thomas posted +39.6 in 2009-10. The next year: -2.7. Braden Holtby posted +35.3 in 2015-16. The next year: -4.4.

These are not injuries. These are not aging curves. These are the natural consequences of a metric with r = 0.14. Most of the extreme values are driven by factors that do not repeat.

The Prediction Contest

We ran a simple competition. Given a goalie’s Year 1 numbers, predict his Year 2 save percentage. Four methods. One winner.

Guessing the league mean for every goalie (no individual information at all) produced MAE = 0.00932.

Using raw Year 1 save percentage was actually worse: MAE = 0.01048. Knowing a goalie’s save percentage and using it at face value makes your prediction 12.5% less accurate than ignoring it entirely. This is The .930 Mirage in one number.

Shrinking Year 1 save percentage 67% toward the league mean improved prediction by 6.5% over the baseline. This is the empirical Bayes approach from The .930 Mirage, and it works.

Using Year 1 GSAx/60 was barely better than guessing: 0.5% improvement. Despite being the “better” metric, it contains almost no usable signal for next-year prediction.

The best single predictor of a goalie’s next-year save percentage was the expected save percentage of the defense in front of him: 7.1% improvement. Knowing who the goalie plays behind tells you more about his future save percentage than knowing anything about his individual performance.

What We Can and Cannot Say

The finding is narrow and specific. We are not saying GSAx is a bad metric. It is the best available tool for evaluating what a goalie did in a given season. If you want to know who the best goalie was in 2023-24, GSAx is the right place to look.

We are saying that GSAx, precisely because it isolates individual goaltending performance, reveals how volatile that performance is. The year-to-year signal in goaltender-specific play is extremely weak. This has implications for contract decisions, trade valuations, and playoff expectations built on one great regular season.

We are also not saying goalies are interchangeable. Some goalies, notably Shesterkin and Hellebuyck, show evidence of sustained above-average GSAx across multiple seasons. True talent differences exist. They are just much smaller than single-season GSAx numbers suggest, and they require multiple seasons of data to separate from noise.

Three caveats matter:

MoneyPuck’s xG model has its own limitations. If the model systematically misattributes team defense as goalie skill (or vice versa), our decomposition inherits that error. Different xG models would produce slightly different reliability estimates.
Survivorship bias inflates our estimates. Goalies who post terrible numbers get replaced. The ones who remain in our consecutive-season pairs were good enough to keep their jobs, which means the true population reliability is likely even lower than r = 0.14.
Goaltending may be genuinely more volatile than skating. The position faces fewer events (1,500-2,000 shots per season versus thousands of shift-level observations for skaters), amplifying the role of randomness. Some of what we are calling “low persistence” may simply be an irreducible feature of the position.

Two additional notes for precision. When we filter to goalies who stayed on the same team across consecutive seasons, GSAx reliability rises to r = 0.23. Goalies who changed teams show essentially zero correlation (r = -0.10). Some of the volatility we measure reflects system changes, not pure performance instability. We also tested only single-season GSAx. Multi-year rolling averages would likely show higher reliability, and some analysts use two-to-three-year windows specifically for this reason. Our finding applies to the single-season number, which is how most contract and trade discussions frame the metric.

The Bottom Line

Everyone says “use GSAx, not save percentage.” They are right about measuring the present. But the metric they prefer is worse at predicting the future, because the predictable part of goaltending is not the goalie. It is the defense.

The .930 mirage is not just about save percentage. It extends to the metric that was supposed to fix it.

Data: MoneyPuck goalie data, 2007-08 through 2024-25 (18 seasons). 550 consecutive-season pairs, 30+ games played minimum. Analysis and interactive figures by BTM Analytics.

This is Part 2 of the goaltender reliability series. Part 1: The .930 Mir