The Metrics Shelf - Hockey Analytics Metrics Explained

Every sport has a jargon problem. Not because the terms are complicated, most of them describe simple ideas, but because the ecosystems grew fast, the names are opaque, and nobody bothers to tell you when a metric stops being useful.

This page is a reference. It covers the major metrics used in modern sports analytics, organized by sport and by the question each metric answers. For every metric, you'll get four things: what it measures, how it's calculated, why it matters, and, critically, where it falls short. Every metric has blind spots. Knowing them is the difference between analysis and superstition.

The entries are written for someone who watches sport and wants to understand what the numbers mean, not for someone who already builds models. Hockey is where this reference starts. It has the deepest public analytics ecosystem outside of baseball, but it will expand to other sports as Beyond the Metric covers them.

Hockey

20 metrics across 6 tiers - from Corsi to playoff probability models

Tier 1 - Possession and Shot Volume

Who is controlling play?

These are the foundational metrics. They count shot attempts - not goals, not wins, just how often each team is putting pucks toward the net. The premise is simple: teams that consistently generate more shot attempts than they allow are, on average, better. It's not a perfect proxy for quality, but it's a remarkably durable one.

Corsi (CF, CA, CF%)

The broadest measure of puck possession - and the starting point for almost everything else in hockey analytics.

What it answers: Who is controlling play? Data source: Shot-level play-by-play (NHL, MoneyPuck, Natural Stat Trick)

Corsi counts all shot attempts directed at the net. That includes goals, shots on goal (saves), missed shots (wide or high), and blocked shots. If a player or team directs a puck toward the opposing net with intent to score, it's a Corsi event.

The name comes from Jim Corsi, a former NHL goaltending coach. He didn't invent the statistic - that credit goes to the blogging community in the late 2000s, particularly Tim Barnes (writing as Vic Ferrari) and the analysts at Hockey Prospectus [1] - but his name stuck as a label for the concept.

CF (Corsi For): Shot attempts by your team while you're on the ice (or, at team level, while the team is playing). CA (Corsi Against): Shot attempts by the opposing team. CF% (Corsi For Percentage): CF / (CF + CA) × 100. Above 50% means you're generating more shot attempts than you're allowing. C+/- (Corsi Plus-Minus): CF minus CA. The raw differential rather than the ratio.

At the team level, CF% is typically calculated at 5-on-5 to strip out special teams effects. Most public data sources default to 5v5 Corsi unless stated otherwise.

How it's calculated: For an individual player at 5v5, you identify every shift that player is on the ice, then count all shot attempts (for and against) that occur during those shifts. You sum across all shifts to get CF and CA, then compute CF%. At the team level, you simply count all 5v5 shot attempts for and against across the full game or season.

The critical thing to understand is that CF is not the player's shot attempts - it's the team's shot attempts while that player is playing. This is an on-ice metric, not an individual metric. A player can have a high CF% without personally shooting much, because their linemates are generating chances while they're on the ice together.

Why it matters: Corsi works because shot attempt differential is the most stable and predictive team-level metric available from standard play-by-play data [2]. Teams that sustain high CF% over a season tend to win more games. The correlation isn't perfect, but it's stronger than goal differential over small samples because goals are rare and noisy. A team can get lucky or unlucky over 10 games, but it's very hard to consistently generate more shot attempts than you allow without being genuinely better at controlling play.

At the player level, CF% is a useful first screen. Players who consistently drive positive shot attempt differentials are, on average, contributing to team success. It's not the whole story, but it's a reasonable place to start.

Where it falls short: Corsi treats all shot attempts as equal. A one-timer from the slot and a wrist shot from the blue line both count as one Corsi event, but they are obviously not the same thing. This is the fundamental limitation: Corsi measures volume, not quality. A team can have a great CF% by generating a lot of low-danger shots from the perimeter while allowing fewer but more dangerous chances against. This is why Tier 2 metrics (expected goals, scoring chances) exist - to add a quality filter that Corsi lacks.

Corsi is also confounded by deployment. A player who starts 70% of their shifts in the offensive zone will naturally accumulate more CF than a player starting 70% in the defensive zone, regardless of talent. Score effects matter too: teams protecting a lead tend to sit back and allow more shot attempts, depressing their CF% without necessarily playing badly.

Finally, Corsi tells you nothing about what's happening away from the shot. Forechecking pressure, neutral zone structure, defensive positioning - none of these register unless they eventually produce or prevent a shot attempt.

Related: Fenwick (Corsi minus blocked shots), Relative metrics (comparison to teammates), Score and venue adjustments (correcting for context).

Fenwick (FF, FA, FF%)

Corsi with blocked shots removed - a slightly filtered alternative.

What it answers: Who is controlling play (excluding blocks)? Data source: Shot-level play-by-play

Fenwick counts unblocked shot attempts: goals, saves, and missed shots. It excludes blocked shots from the count entirely. The metric is named after Matt Fenwick, a blogger who proposed it as a refinement of Corsi.

FF% (Fenwick For Percentage): FF / (FF + FA) × 100. The interpretation is the same as CF% - above 50% means you're generating more unblocked attempts than you're allowing.

Why it exists: The original argument was that blocked shots are partly a function of the blocking team's effort and positioning rather than the shooting team's offensive quality. If a defenseman blocks a point shot, Corsi counts that as a shot attempt for the offensive team, but the shot never actually challenged the goaltender. Fenwick removes those events to get a "cleaner" measure of shots that at least had a chance to reach the net.

Where it falls short: In practice, Fenwick and Corsi are almost identical. The correlation between CF% and FF% at the team level is extremely high (r > 0.95 across most samples) [3]. The blocked shots Fenwick removes represent a small fraction of total events, and removing them doesn't meaningfully change team or player rankings in most cases.

The theoretical argument for Fenwick - that blocks reflect the defending team's skill rather than the attacking team's quality - is debatable. Teams that generate lots of blocked shot attempts are often generating them because they're controlling possession and cycling in the offensive zone. Whether you include or exclude the blocks, the underlying signal is similar.

Most modern analysts default to Corsi rather than Fenwick simply because the larger sample size (more events per game) provides slightly more stability. Fenwick isn't wrong, but it doesn't add enough to justify its own existence for most use cases.

Related: Corsi (the broader count including blocks), Expected Goals (quality-weighting rather than filtering).

Relative Metrics (CF% Rel, FF% Rel)

How does this player compare to the rest of their team?

What it answers: Is the team better or worse when this player is on the ice? Data source: Shot-level play-by-play with shift data

Relative metrics compare a player's on-ice numbers to the team's numbers when that player is off the ice. The most common is CF% Rel (Corsi For Percentage Relative):

CF% Rel = Player's on-ice CF% − Team's CF% with the player off the ice.

A CF% Rel of +3.0 means the team's shot attempt share is three percentage points higher when this player is playing than when they're sitting on the bench. A CF% Rel of −2.0 means the team is worse with them on the ice.

Why it matters: Relative metrics are an attempt to control for team quality. A player on a dominant team might have a CF% of 55% - impressive in a vacuum - but if the team runs a 54% CF% without them, that player's contribution is only +1.0. Conversely, a player with a 48% CF% on a bad team might have a CF% Rel of +4.0, meaning they're a bright spot on an otherwise struggling roster.

This is particularly useful for comparing players across teams. Raw CF% is heavily influenced by team context; relative metrics partially strip that context away.

Where it falls short: Relative metrics have a serious structural problem called the "Gretzky problem" or "quality of teammates" confound. When a star player is on the bench, who replaces them? Fourth-line players. So the "team without" baseline is partly measuring the absence of the star and partly measuring the presence of weaker replacements. The better a player is, the worse their "off-ice" number tends to be, which inflates their relative metric.

The reverse is also true: a player who shares the ice with elite linemates will have inflated on-ice numbers, but those same elite linemates will boost the team's off-ice numbers when the player in question is sitting (because the stars are still playing). This can compress or distort relative metrics in unintuitive ways.

Usage effects also distort relative metrics. If a player takes the majority of offensive zone starts, their on-ice CF% is inflated by deployment, not talent. The "off-ice" sample may have a very different zone-start distribution, making the comparison misleading.

Relative metrics are a useful directional tool, but they shouldn't be treated as a clean isolation of individual contribution. More sophisticated methods like RAPM (Tier 3) are designed to address these confounds.

Related: Corsi (the base metric), RAPM (a regression-based approach to isolating individual impact).

───────

Score and Venue Adjustments

Why raw shot counts lie, and how to correct for game context.

What it answers: What would these numbers look like if we removed the distortions caused by score state and rink location? Data source: Shot-level play-by-play with game state tags

Raw shot attempt counts are systematically distorted by two well-documented effects:

Score effects: Teams behave differently depending on the score. A team with a two-goal lead in the third period tends to sit back, cede territorial control, and protect the lead. Their CF% drops - not because they've gotten worse, but because their strategy has changed. A trailing team does the opposite: pushes aggressively, generates more shot attempts, and takes risks. If you don't account for this, you'll penalize teams that hold leads and reward teams that spend time chasing the game.

The adjustment works by weighting each shot attempt based on the game state in which it occurred. Shot attempts generated while tied get a weight of 1.0 (the reference state). Shot attempts generated while leading get an upward adjustment (they were harder to generate given the defensive posture), and shot attempts while trailing get a downward adjustment. The specific weights are derived empirically from league-wide data.

Venue effects (rink bias): Not all NHL arenas record shot attempts the same way. The official scorers who log shots, misses, and blocks are humans, and they vary systematically by building [3]. Some arenas record significantly more total shot events per game than others. This means a team's home CF might be inflated simply because their arena scorer is generous, or deflated because an away arena scorer is stingy.

The venue adjustment uses a similar weighting approach: expected shot counts per venue are estimated from multi-year averages, and each event is weighted to account for the building it occurred in.

Why it matters: Without these adjustments, you're comparing apples to oranges. A team that protects a lot of leads will look worse in raw CF% than they actually are. A player who logs heavy minutes while trailing will look better. Adjusted metrics remove these distortions and give you a cleaner picture of underlying quality.

Where it falls short: Score adjustments assume that teams respond to game state in uniform ways. In reality, coaching philosophies differ - some coaches maintain an aggressive posture with a lead, others turtle. The adjustment applies a league-average correction, which may over- or under-correct for specific teams.

Venue adjustments are also approximations. Scorer tendencies can shift over time (new personnel, new instructions), and the corrections are based on multi-year estimates that may lag behind changes. The adjustment improves accuracy on average but can introduce noise for individual games.

Most modern public analytics sites (Natural Stat Trick, Evolving Hockey) offer both raw and adjusted versions. Use the adjusted numbers for comparing teams and players across contexts. Use raw numbers only when you're looking at a specific, controlled situation (e.g., a single game at 5v5 while tied).

Related: Corsi (the raw metric being adjusted), Expected Goals (which implicitly handles some of these issues through its model features).

Tier 2 - Shot Quality

Are those shots actually dangerous?

Shot volume tells you who's controlling play. Shot quality tells you whether that control is translating into danger. A slot one-timer and a point shot from the blue line both count as one Corsi event, but they are not the same thing. These metrics try to weight shots by how likely they are to go in.

Expected Goals (xG)

The probability a shot becomes a goal, based on location, type, and context.

What it answers: How dangerous was that shot - and how dangerous is this team's overall shot profile? Data source: Shot-level play-by-play with location and contextual features (MoneyPuck, Evolving Hockey)

Expected goals assigns every shot attempt a probability of scoring based on the characteristics of the shot. A wrist shot from the high slot might be worth 0.06 xG (a 6% chance of scoring). A one-timer from the low slot off a cross-ice pass might be worth 0.35 xG. A shot from beyond the blue line might be 0.01 xG. Sum up all the xG values for a team's shots and you get their total expected goals - a measure of how much offensive danger they created, independent of whether the puck actually went in.

How it's calculated: xG models are built by training a statistical model (typically logistic regression or gradient-boosted trees) on hundreds of thousands of historical shots [5, 6]. The model learns which features predict goal scoring and assigns probabilities to new shots based on those features. Common features include shot distance from the net, shot angle, shot type (wrist, slap, backhand, tip, deflection), whether the shot was a rebound (a shot following another shot within a short window), whether the shot was off a rush (an odd-man attack or fast-break), and game state (even strength, power play, shorthanded).

More advanced models (see the next entry on pre-shot movement) add features like passing sequences before the shot, lateral puck movement across the slot, and time since the last event. But the core idea is the same: use shot characteristics to estimate the probability of a goal.

Why it matters: xG is the most important single advancement in hockey analytics since Corsi. It solves the fundamental problem of shot volume metrics - that not all shots are equal - by weighting each shot by its actual danger. This allows you to separate shot quality from shot volume and to identify teams and players who are generating (or allowing) genuinely dangerous chances versus those who are padding their numbers with low-quality attempts.

At the team level, xGF% (the share of total expected goals belonging to your team) is a stronger predictor of future success than CF% [7]. Teams that sustain high xGF% are creating more quality offense and allowing less quality defense, which translates to real goal scoring more reliably than raw shot attempt volume.

At the player level, individual xG (ixG) tells you how dangerous a player's own shot attempts are, and on-ice xGF% tells you how the overall quality of play changes when they're on the ice.

Where it falls short: xG models are only as good as their inputs. Shot location is the dominant feature, so in practice, xG is largely a proxy for "how close to the net was the shot." It captures the most important factor in goal scoring but misses contextual elements that aren't in the data - the quality of the passing play, the goaltender's positioning, whether the shooter was screened, the defensive structure at the moment of the shot.

Small-sample xG is unreliable. In a single game, the "better" team by xG can easily lose because a few bounces went the wrong way. xG is a probabilistic expectation, not a guarantee. Over a full season, the noise washes out. But in small samples - a playoff series, a single month - actual results can deviate substantially.

Different models give different answers. There is no single "correct" xG model, so two analysts using different models can reach different conclusions about the same player or team. The divergence is usually small for clear cases (elite finishers, poor shooters) but can be meaningful for players in the middle of the distribution.

Key derived metrics: xGF% (expected goals share), xG differential (xGF − xGA), individual xG (ixG, a player's own shot quality), xG/60 (rate-adjusted), and Goals − ixG (shooting talent - the gap between actual goals and expected goals, measuring finishing ability).

Related: Corsi (shot volume without quality weighting), Scoring Chances (simpler zone-based approach), GSAx (xG applied to goaltender evaluation).

Scoring Chances and High-Danger Chances

Zone-based shot classification - a simpler approach to "which shots matter."

What it answers: How many shots came from dangerous areas of the ice? Data source: Shot-level play-by-play with location data (Natural Stat Trick)

Scoring chances classify shot attempts into tiers based on where on the ice they originate. Instead of assigning every shot a precise probability (like xG), you draw zones on the ice and label shots from certain areas as "scoring chances" or "high-danger chances."

The most widely used classification comes from Natural Stat Trick [19]: High-danger chances (HDC) originate from the inner slot area - roughly the region between the faceoff dots and within 15-20 feet of the net. Medium-danger chances come from the mid-slot and areas just outside the inner zone. Low-danger chances are shots from the point, below the goal line, or wide angles.

HDCF% (High-Danger Chances For Percentage) works the same way as CF%: HDCF / (HDCF + HDCA) × 100. It tells you who's generating more high-quality chances.

Why it matters: Scoring chances predate xG and offer a simpler, more transparent alternative. You don't need to trust a model's feature weights - you just need to agree on the zone definitions. This makes scoring chances easier to explain to coaches and general audiences. A coach who doesn't trust "expected goals" will often respond to "they're generating 55% of the high-danger chances" because it maps directly to what they see on film: slot shots, net-front tips, cross-crease plays.

HDCF% is also a strong predictor of team success. Teams that dominate high-danger chance share tend to be very good, because goals come overwhelmingly from the inner slot [5]. The relationship between HDC share and goal share is tighter than the relationship between raw shot share and goal share.

Where it falls short: Zone-based classification is a blunt instrument compared to xG. All shots from the high-danger zone are treated as equally dangerous, but a screened one-timer off a cross-ice pass and an unscreened wrister from the same spot are very different. xG handles this by incorporating shot type, rush context, and passing history. Scoring chances flatten those distinctions.

The zone boundaries are also arbitrary. Where exactly does "high danger" end and "medium danger" begin? Different definitions produce different results. Natural Stat Trick's zones are the most widely used, but they're a convention, not a ground truth.

Finally, scoring chances share the same fundamental limitation as all shot-based metrics: they only capture events at the moment of the shot. The passing sequences, off-puck movement, and defensive breakdowns that create the chance are invisible.

Related: Expected Goals (a model-based approach to the same problem), Corsi (volume without quality filtering).

Pre-Shot Movement and Modern xG

What separates the best expected goals models from the rest.

What it answers: How do passing and puck movement before the shot affect danger? Data source: Enhanced shot-level data with passing sequences (MoneyPuck)

Early xG models were built almost entirely on shot location and type. Modern xG models add features that capture what happened in the seconds before the shot - and this is where the real separation in model quality occurs.

The most important pre-shot features are: cross-ice passes (lateral passes across the slot that force the goaltender to move laterally, dramatically increasing the chance of a goal) [9], rebound shots (shots following another shot within 2-3 seconds, where the goaltender hasn't recovered), rush chances (shots on odd-man attacks or breakaways), and passing sequences (the number and type of passes in the buildup, which correlate with defensive disorganization).

MoneyPuck's xG model, which is the most widely used public model, incorporates many of these features. The result is a model that captures not just where the shot came from but how it was created - and shots created through east-west passing, quick rebounds, or odd-man rushes are assigned substantially higher xG values than shots of the same type from the same location without those features.

Why it matters: Pre-shot movement features are what elevated xG from "shot location heatmap" to "actual measure of offensive danger." A team that generates the same number of slot shots as another team but does so through cross-ice passing plays rather than straight-on wrister shots is creating meaningfully more danger. Modern xG captures this; early models didn't.

This also matters for player evaluation. Playmakers who create high-xG chances through passing show up in modern xG models in ways they didn't in older ones. A player who doesn't shoot much but consistently generates cross-ice feeds to the slot is creating real offensive value that older metrics would miss.

Where it falls short: Even the best xG models still miss context that a human observer can see. Goaltender positioning at the moment of the shot, the presence or absence of screens, the speed of the play, defensive pressure on the shooter - these factors affect goal probability but are difficult or impossible to extract from play-by-play data. Tracking data (player and puck location at high frequency) could eventually address this, but public tracking data remains limited.

Pre-shot movement features also create model complexity that can reduce interpretability. When a model has 30+ features, it becomes harder to explain why a particular shot was rated at 0.22 xG versus 0.18 xG. This matters less for team-level analysis but matters more when you're trying to communicate findings to non-technical audiences.

Related: Expected Goals (the base concept), Scoring Chances (the simpler alternative that skips pre-shot context entirely).

Tier 3 - Individual Contributions

Who's actually driving this?

Tiers 1 and 2 describe what's happening - shot volume, shot quality, territorial control. Tier 3 asks: which players are responsible? This is where hockey analytics gets genuinely difficult, because hockey is a fluid, continuous, five-on-five sport where individual contributions are deeply entangled with linemate and opponent effects. The metrics here attempt to isolate individual impact, with varying degrees of success.

───────

On-Ice vs. Individual Metrics

The most important distinction in hockey analytics - and the one most often ignored.

What it answers: Did this player do something, or did something happen while they were watching? Data source: Shot-level play-by-play with on-ice player tracking

Every hockey metric applied to a player falls into one of two categories, and confusing them is the single most common mistake in hockey analytics discussion.

On-ice metrics measure what happens when a player is on the ice, regardless of who actually does it. A player's on-ice CF% includes every shot attempt by both teams during their shifts - their linemates' shots, their opponents' shots, shots they had nothing to do with. If McDavid's linemate fires a one-timer while McDavid is at the far blue line, that counts in McDavid's on-ice CF.

Individual metrics measure events the player personally generates. Individual Corsi (iCF) counts only the player's own shot attempts. Individual expected goals (ixG) sums the xG values only on the player's own shots. Individual hits, individual giveaways - these are all personal actions attributed directly to one player.

The distinction matters because hockey is a team sport played on a shared ice surface with constant interaction. A player with elite on-ice numbers might be personally excellent, or they might be riding excellent linemates. A player with poor on-ice numbers might be personally fine but deployed in impossible situations. You can't tell the difference without understanding which type of metric you're looking at.

The practical test: If someone quotes you a stat about a player, ask yourself: does this measure what the player did, or what happened while the player was playing? On-ice CF% is the latter. Individual points are the former. Both are useful information, but they answer fundamentally different questions, and treating them as interchangeable is where most bad hockey analysis starts.

Related: RAPM (the regression-based attempt to isolate individual contribution from on-ice data), Relative metrics (a cruder version of the same idea).

Regularized Adjusted Plus-Minus (RAPM)

The best available attempt is to attempt is to isolate what a single player contributes and why it's still imperfect.

What it answers: Controlling for teammates, opponents, and context, how much does this player move the needle? Data source: Shift-level play-by-play with all 10 skaters identified per shift

RAPM is a regression-based method for estimating each player's individual contribution to team performance [10]. The core idea: for every shift in a season, you know which five skaters were on the ice for each team, and you know the outcome (goals for/against, shot attempts for/against, or expected goals for/against). By running a regression across thousands of shifts with every player as a variable, you can estimate each player's individual effect while statistically controlling for who they played with and against.

How it works (simplified): Imagine a giant spreadsheet where each row is a shift. The columns are every player in the league, coded as +1 (on the ice for the home team), −1 (on the ice for the away team), or 0 (not on the ice). The dependent variable is the outcome per shift (e.g., goal differential). The regression finds the coefficient for each player - their estimated marginal contribution per shift, controlling for every other player on the ice.

The "regularized" part is critical. With 700+ players and complex collinearity (linemates who almost always play together), an unregularized regression produces wild, unstable estimates. Ridge regression (the most common regularization) shrinks coefficients toward zero [11], producing more conservative but more stable estimates. The tradeoff is that RAPM is biased toward the mean - true stars are slightly underrated, true liabilities are slightly underrated in the other direction.

Why it matters: RAPM is the most principled publicly available method for isolating individual player value. It addresses the biggest limitation of on-ice metrics: teammate and opponent confounding. A player who looks great in raw CF% because they play with elite linemates will have a more modest RAPM estimate because the model attributes some of that performance to the linemates. A player who looks bad because they face top competition will get credit for the difficulty of their assignment.

Where it falls short: RAPM requires large sample sizes to produce stable estimates. A single season (~1,200 minutes for a regular player) is marginal. Two or three seasons is better. For young players with limited data, RAPM estimates are heavily shrunk toward zero and tell you very little.

The model is also sensitive to specification choices. Which outcome variable do you use - goals, shots, expected goals? How do you handle score state and manpower? What regularization strength? Different choices produce different player rankings, particularly in the middle of the distribution where signal is weakest.

Most importantly, RAPM isolates correlation, not causation. It tells you that outcomes were better when Player X was on the ice, controlling for teammates and opponents. It doesn't tell you why. A player could have a high RAPM because they're personally excellent, or because they're deployed in favorable situations that the model doesn't fully capture (zone starts, quality of competition beyond the opponent coefficients).

Related: GAR/WAR (which builds on RAPM), On-ice vs. individual metrics (the conceptual foundation), QoC/QoT (the context RAPM tries to control for).

Goals Above Replacement (GAR) / Wins Above Replacement (WAR)

The all-in-one number that tries to capture everything a player does.

What it answers: How many goals (or wins) is this player worth compared to a replacement-level player? Data source: Multiple input models combined (Evolving Hockey is the primary public source)

GAR and WAR are composite metrics that attempt to roll up a player's entire contribution - offense, defense, shooting, penalties, special teams - into a single number. WAR converts GAR into wins using the approximate relationship between marginal goals and marginal wins (~5.5 goals per win).

The best-known public implementation is Evolving Hockey's model [12], which combines: even-strength offense (RAPM-based, measuring contribution to team xGF), even-strength defense (RAPM-based, measuring contribution to team xGA), shooting talent (goals above expected from individual shots), penalties drawn minus taken, and power play and penalty kill contributions.

Each component estimates the player's value in goals above a "replacement level" baseline - the performance you'd expect from a freely available minor-league or waiver-wire player filling that roster spot. A forward with 15 GAR contributed roughly 15 more goals than a replacement-level forward would have in the same role.

Why it matters: GAR/WAR provides a common currency for comparing players across positions and skill sets. A defenseman who contributes primarily through defensive suppression can be compared to a forward who contributes primarily through goal scoring, because both are expressed in the same units. This is invaluable for roster construction and contract evaluation: if Player A costs $6M and provides 12 WAR, and Player B costs $4M and provides 10 WAR, Player B is the better value.

Where it falls short: GAR/WAR inherits every limitation of its component models. The even-strength components are RAPM-derived, so they carry RAPM's sample size and stability issues. The shooting component captures finishing talent but is extremely noisy - it takes 3-5 seasons to reliably identify an above-average finisher [8]. The special teams components are measured with less precision because power play and penalty kill time are smaller samples.

The "replacement level" baseline is an estimate, not a measurement, and it varies by position and implementation. Small changes in how you define replacement level shift every player's GAR value.

Most critically, GAR/WAR presents a false precision. A player listed at 8.2 GAR versus 7.4 GAR is not meaningfully different - the confidence intervals around those estimates are wide enough to overlap substantially. But because GAR produces a single number with a decimal point, it invites comparison at a precision the model cannot support. The correct use of GAR is to identify tiers (elite, above-average, average, below-average, replacement) rather than to rank players to the tenth of a goal.

Related: RAPM (the engine under the hood for the EV components), xG (the basis for expected goals components), QoC/QoT (context that affects how you interpret the results).

Quality of Competition / Quality of Teammates (QoC / QoT)

The context behind the numbers - who are you playing with and against?

What it answers: Is this player facing tough opponents and carrying weak linemates, or coasting against soft competition with elite support? Data source: TOI-weighted teammate and opponent data from play-by-play

QoC and QoT measure the average quality of the players someone shares the ice with (teammates) and plays against (opponents), typically weighted by shared ice time.

The most common implementations use TOI-weighted average opponent or teammate CF%, xGF%, or GAR. If a forward's most common opponents are other teams' top lines, their QoC will be high. If they're primarily matched against bottom-six forwards, their QoC will be low.

Why it matters: QoC and QoT are context metrics. They don't measure player quality directly - they measure the environment in which a player's results were produced. A player with strong on-ice results despite facing elite competition is more impressive than a player with the same results against weaker opponents. A player with weak results who plays exclusively with poor linemates deserves more slack than a player who struggles despite elite support.

This is particularly important in the NHL, where coaches exercise significant control over matchups. A first-line center deployed primarily against opposing top lines has a very different job than a third-line center sheltered from tough matchups. Ignoring this context when evaluating their numbers is comparing different jobs and calling it a talent comparison.

Where it falls short: QoC effects are smaller than most people think. Research has consistently shown that the spread of QoC across players within a team is relatively narrow [13] - even "tough" matchups involve opponents who are, at most, slightly above average, while "easy" matchups involve opponents who are slightly below average. The difference in expected outcomes between facing a team's first line versus their third line is smaller than the narrative suggests, because talent differences between NHL lines (within the same team) are modest in a league with a salary cap and parity.

QoT effects are larger and more important. Who you play with matters more than who you play against, because you share the ice with your linemates for entire shifts, while opponents rotate through. But QoT is also hard to use cleanly because of the chicken-and-egg problem: a player with high QoT might look good because of their teammates, or their teammates might have high QoT partly because of them.

The measurement of QoC and QoT is also imprecise. Using CF% or xGF% to represent opponent or teammate quality introduces circular reasoning - you're using a team-influenced metric to characterize the very context you're trying to control for.

Related: RAPM (which statistically controls for QoC and QoT rather than measuring them separately), Relative metrics (a cruder attempt at context adjustment).

Tier 4 - Goaltending

How do we measure the last line?

Goaltender evaluation is the hardest problem in hockey analytics. The sample sizes are smaller (one goaltender per team, ~1,500–2,000 shots per season), the variance is higher (goaltending performance fluctuates dramatically), and the confounding is severe (a goaltender's numbers are heavily influenced by the defense in front of them). These metrics represent the best available tools, but they're all noisier than their equivalents for skaters.

Save Percentage

The most familiar goaltending stat - and one of the most misleading.

What it answers: What fraction of shots on goal did the goaltender stop? Data source: Standard box score (any source)

Save percentage (SV%) is simple: saves divided by shots on goal. A goaltender who faces 30 shots and stops 27 has a .900 SV%. One who stops 28 has a .933 SV%.

Why it's used: SV% is universally available, instantly understood, and tracks the most important thing a goaltender does - stop pucks. Over a full career, SV% does separate elite goaltenders from average ones. The all-time greats tend to have career SV% above .920 in the modern era; below .905 is typically replacement level.

Where it falls short: SV% treats all shots as equal, which is its fatal flaw. A goaltender facing 35 shots from the perimeter will post a higher SV% than one facing 25 shots from the slot, even if the second goaltender is actually better. Since teams differ dramatically in the volume and quality of shots they allow, SV% is heavily contaminated by team defense.

The problem is severe enough that SV% is one of the least stable goaltending metrics from season to season. The year-over-year correlation for starting goaltenders' SV% is surprisingly low [14] - much lower than you'd expect for a metric measuring individual skill. A significant fraction of the variance in SV% is driven by shot quality, which is a team defense property, not a goaltender property.

This doesn't mean SV% is useless. Over very large samples (multiple seasons, thousands of shots), the team defense effects average out somewhat and true skill differences emerge. But in small samples - a single season, a playoff series - SV% is as much a report card on the defense as it is on the goaltender.

Related: GSAx (expected goals saved above expected - the quality-adjusted version), HDSV% (save percentage on high-danger shots only).

Goals Saved Above Expected (GSAx)

Expected goals applied to goaltenders - the best available quality-adjusted measure.

What it answers: How many goals did this goaltender save (or allow) relative to what an average goaltender would have done facing the same shots? Data source: Shot-level xG data (MoneyPuck, Evolving Hockey)

GSAx compares the goals a goaltender actually allows to the expected goals generated by the shots they faced. If a goaltender faces shots totaling 150 xG over a season and allows only 130 goals, their GSAx is +20 - they saved 20 more goals than an average goaltender would have on the same workload. If they allowed 160 goals, their GSAx is −10.

How it works: For each shot the goaltender faces, the xG model assigns a probability. Sum those probabilities and you get the total expected goals against - what an average NHL goaltender would be expected to allow facing those exact shots. Subtract the actual goals allowed from the expected goals, and you get GSAx.

Why it matters: GSAx is the best publicly available metric for goaltender evaluation because it controls for shot quality. A goaltender who faces predominantly high-danger chances will have a lower SV% than one facing point shots, even if they're equally skilled. GSAx accounts for this by judging each goaltender against the quality of their specific workload.

This makes GSAx far more useful than SV% for comparing goaltenders across teams. A goaltender with a .915 SV% behind a porous defense might have a higher GSAx than one with a .925 SV% behind an elite defensive corps, because the first goaltender was facing harder shots.

Where it falls short: GSAx inherits the limitations of the xG model it's built on. If the xG model doesn't account for screens, deflections, or goaltender positioning, then the "expected" baseline is imperfect, and some of what GSAx attributes to the goaltender is actually shot context the model missed.

GSAx is also noisy. Goaltenders face fewer shots than skaters take, and the goal/no-goal outcome on each shot is high-variance. A single season of GSAx has wide confidence intervals. Goaltender performance legitimately fluctuates more than skater performance year-over-year, and GSAx reflects this. It takes 2-3 seasons of data to form stable estimates.

The "average goaltender" baseline also matters. If the model's baseline goaltender is calibrated incorrectly - say, it overestimates the average save rate on high-danger chances - then GSAx will systematically mis-rate goaltenders who face lots of high-danger shots. Different xG models produce different GSAx rankings for the same goaltenders.

Related: Save Percentage (the raw version), xG (the model underlying GSAx), High-Danger Save % (a simpler quality-filtered approach).

High-Danger Save Percentage (HDSV%)

Save percentage on the shots that matter most.

What it answers: How well does this goaltender stop shots from the inner slot? Data source: Shot-level data with location classification (Natural Stat Trick)

HDSV% filters save percentage to include only shots from the high-danger zone - the inner slot area where most goals are scored. Instead of measuring performance across all shots (many of which are low-probability), HDSV% focuses on the shots where the goaltender's skill matters most.

Why it matters: The argument for HDSV% is that goaltenders are essentially interchangeable on low-danger shots. A screened one-timer from the slot tests a goaltender's positioning, reflexes, and read; a wrist shot from the point does not. By isolating the high-danger shots, HDSV% should provide a cleaner signal of actual goaltending skill.

HDSV% is also more intuitive than GSAx for a non-technical audience. A coach or scout can understand "he stops 83% of high-danger chances" more easily than "his GSAx is +12.4."

Where it falls short: HDSV% shares the zone-classification limitations of scoring chances - all shots from the high-danger zone are treated equally, even though a cross-crease one-timer and an unscreened wrister from the same spot are very different challenges.

The sample size problem is more severe than for regular SV% because you're filtering to a subset of shots. A starting goaltender might face 500–700 high-danger shots per season. At that sample size, random variation is substantial - a few lucky bounces or unlucky deflections can swing HDSV% by several points.

Research has found that HDSV% is only modestly more stable year-over-year than overall SV% [14, 15]. It's better, but not dramatically better. Goaltending skill on high-danger shots is real, but the metric is noisy enough that single-season HDSV% should be interpreted cautiously.

Related: GSAx (the model-based alternative), Save Percentage (the unfiltered version), Scoring Chances (the zone definitions used to classify "high-danger").

Tier 5 - Special Teams and Situations

Where does context matter?

Special teams (power play and penalty kill) and situational metrics address specific game contexts that play by different rules than 5v5 hockey. These metrics are more volatile and noisier than their even-strength counterparts, but they matter because special teams can be decisive - a power play that converts at 25% versus 15% is a massive competitive advantage.

Power Play and Penalty Kill Metrics

Measuring performance in hockey's most structured situations.

What it answers: How effective is this team (or player) with a man advantage or disadvantage? Data source: Standard play-by-play with manpower state tags

PP% (Power Play Percentage): Goals scored on the power play divided by total power play opportunities. League average is typically 20–22%. PK% (Penalty Kill Percentage): 1 minus (power play goals allowed / times shorthanded). League average mirrors PP% at 78–80%.

At the player level, PP points/60 and PK metrics (CA/60, xGA/60 while shorthanded) measure individual contributions in these contexts.

Why it matters: Special teams are high-leverage situations. A team that converts 5% above league average on the power play and kills 5% above average on the penalty kill gains roughly 15-20 goals per season from special teams alone - enough to be the difference between a playoff team and a lottery team. Special teams success is also more coachable than even-strength play, making it a legitimate evaluation of coaching effectiveness.

Where it falls short: Special teams metrics are extremely volatile. PP% is heavily influenced by shooting percentage, which is high-variance in small samples. A team can have an elite power play setup - good entries, clean formations, dangerous shot generation - and still convert at 15% for a month because the puck isn't going in. Conversely, a mediocre power play can ride a hot streak to 30%+ for a stretch.

The sample sizes are small. A team gets roughly 200–250 power play minutes per season, compared to ~3,000+ even-strength minutes. This makes power play metrics much noisier and less predictive than even-strength metrics. PP xGF/60 (expected goals generated per 60 minutes on the power play) is more stable than PP% because it removes the shooting percentage variance, but even it is limited by sample size.

For individual players, power play deployment is heavily concentrated - the same 5–7 players get the majority of PP time. This means the players with the largest PP samples are a self-selected group (the best offensive players), making it difficult to evaluate power play "skill" separately from general offensive skill.

Related: Expected Goals (quality-adjusting special teams shot data), Score Effects (how game state interacts with special teams usage).

Score Effects

Why the score changes how teams play - and why it matters for evaluation.

What it answers: How does leading, trailing, or being tied change a team's behavior and statistics? Data source: Play-by-play with game state tags

Score effects describe the systematic changes in team behavior based on the score. They are not a "metric" per se but a phenomenon that affects the interpretation of every other metric.

The pattern is well-documented and consistent [4]: Leading teams reduce aggression, protect structure, and allow more shot attempts against. Their CF% drops. This is strategic, not a sign of collapse. Trailing teams push forward, take more risks, and generate more shot attempts. Their CF% rises. This is desperation, not a sign of dominance. Tied score is the neutral state where both teams play their normal game. Most analysts consider tied-score play the best reflection of true team quality.

The magnitude of score effects is substantial. A team with a two-goal lead in the third period will see their CF% drop by 5-8 percentage points compared to tied play. This means a team that builds lots of leads (i.e., a good team) will have a systematically depressed raw CF% compared to their true quality.

Why it matters: If you ignore score effects, you will systematically misjudge teams. Good teams that frequently lead look worse in raw CF% than they actually are. Bad teams that frequently trail look better than they are. Score-adjusted metrics (see Tier 1) correct for this, and any serious analysis should use them.

Where it falls short: Score adjustments assume uniform behavioral responses to game state. In reality, coaching philosophy varies. Some teams maintain an aggressive posture with a lead; others turtle. The adjustment applies a league-average correction that may overcorrect for some teams and undercorrect for others.

Related: Score and Venue Adjustments (the Tier 1 corrections that account for this phenomenon).

Zone Entries and Exits

Tracking how teams move the puck through the neutral zone - the most labor-intensive metric in public analytics.

What it answers: How effectively does a team or player transition through the neutral zone? Data source: Manual tracking (typically by analysts charting games from video)

Zone entries measure how a team brings the puck into the offensive zone. The three categories are: carry-ins (a player skates the puck across the blue line with possession), dump-ins (the puck is shot into the zone without possession), and passes (the puck is passed across the blue line to a teammate). Zone exits work the same way in the defensive zone.

The consistent research finding is that carry-in entries produce significantly more shot attempts and expected goals per entry than dump-ins [16]. A team that enters the zone with possession 55% of the time will generate more offense than a team that dumps and chases 70% of the time, even if the second team enters the zone more often.

Why it matters: Zone entries capture an aspect of play - neutral zone transition - that standard shot-based metrics miss entirely. CF% tells you the result (who had more shot attempts) but not the process (how the puck got there). A team can have a mediocre CF% despite excellent zone entries if their in-zone execution is poor, or vice versa. Zone entry data adds a layer of process information that helps diagnose why a team's results look the way they do.

At the player level, zone entry and exit rates are among the best available measures of individual transition ability. A forward who carries the puck into the zone at a high rate is providing real value that doesn't show up in shot or goal metrics.

Where it falls short: Zone entries are typically tracked manually, which means they are labor-intensive, inconsistent across trackers, and available for only a subset of games. The NHL's tracking data could eventually automate this, but public zone entry data remains spotty and tracker-dependent.

The metric is also incomplete on its own. Knowing that a player carries the puck in doesn't tell you what happens next. A carry-in that leads to a cycle and three shots is more valuable than one that leads to an immediate turnover, but the entry itself is counted the same way.

Related: CF% (the outcome that zone entries help explain), Expected Goals (which captures what happens after the entry).

Tier 6 - Team-Level Models

How does it all add up?

These metrics zoom out from individual events to estimate team strength, predict standings, and model playoff outcomes. They're less about measuring what happened and more about modeling what should have happened - or what's likely to happen next.

Pythagorean Expectation

Estimating wins from goal differential - borrowed from baseball.

What it answers: Based on how many goals a team scored and allowed, how many games should they have won? Data source: Seasonal goal totals (any source)

Pythagorean expectation estimates a team's "expected" winning percentage using only goals for and goals against:

Win% = GF² / (GF² + GA²)

The exponent (2 in the original formula; closer to 2.05-2.15 in hockey-specific calibrations) can be tuned to fit observed results. The idea comes from Bill James's baseball work [17] and was adapted for hockey [18].

Why it matters: Pythagorean expectation identifies teams whose actual record deviates from their goal-scoring profile. A team with 240 goals for and 210 goals against "should" win about 56% of their games. If they actually won 60%, they may have been lucky in close games (strong in overtime, good in one-goal games). If they won only 50%, they may have been unlucky.

This is useful for prediction. Teams that significantly exceed their Pythagorean expectation tend to regress the following season [18] - they won more close games than expected, and that tends to normalize. Teams that underperformed their Pythagorean expectation often improve.

Where it falls short: The model is deliberately simple. It uses only goals and ignores shot quality, special teams efficiency, goaltending, roster changes, and every other factor that might explain why a team wins or loses. It's a baseline, not a complete model.

It also assumes all goals are equal, which they're not. A team that wins a lot of 5-4 games has a different profile than one that wins 2-1 games. The Pythagorean formula doesn't distinguish between these - it only sees totals.

Related: Standings points models (a more sophisticated version of the same idea), Expected Goals (the quality-adjusted input that could replace raw goals in the formula).

Standings Points Models

Why actual standings points are noisy and how to smooth them.

What it answers: How good is this team, really, once you strip away the noise in the standings? Data source: Game-level results combined with underlying performance metrics

Standings points models estimate a team's "true" quality by combining results (wins, losses, overtime losses) with process metrics (CF%, xGF%, goal differential) and adjusting for schedule strength, injuries, and random variance.

The premise is that the NHL standings are noisy. The three-point game problem (games that go to overtime produce three total standings points instead of two) distorts the standings. Overtime and shootout results are high-variance. Teams that are strong in one-goal games may be genuinely clutch or may simply be lucky. Standings points models try to see through this noise to estimate the team's underlying quality.

Why it matters: These models are the best available tool for answering "is this team actually good?" - especially during the season when sample sizes are still growing. A team that's 15–10-5 might look mediocre in the standings but have elite underlying numbers (high xGF%, strong CF%, dominant high-danger chance share). A standings points model would flag this team as better than their record suggests - a likely regression upward. Conversely, a team riding unsustainable shooting percentage or goaltending might have a glossy record that the model sees through.

Where it falls short: These models are only as good as the metrics they use as inputs. If the underlying process metrics are wrong (or if they miss an important dimension of team quality), the model's adjustments will be wrong too. And "standings points model says Team X is better than their record" is a probabilistic statement, not a guarantee - sometimes teams really are as bad as they look, even if the underlying numbers disagree.

Related: Pythagorean Expectation (the simplest version of this idea), Expected Goals (a key input to most modern models).

Playoff Probability Models

What goes into projecting postseason outcomes.

What it answers: What are the odds this team makes the playoffs - and how far can they go? Data source: Current standings, remaining schedule, team strength estimates, and simulation

Playoff probability models simulate the remainder of a season (or a playoff bracket) thousands of times, using team strength estimates and the remaining schedule to project outcomes. Each simulation randomly assigns game results weighted by the estimated probability of each team winning, then tracks who makes the playoffs, wins each round, and wins the championship. The final output is a probability for each outcome: "Team X has a 73% chance of making the playoffs, a 12% chance of reaching the conference final, and a 4% chance of winning the Cup."

The best public models (Money Puck, Dom Luszczyszyn's model at The Athletic, Evolving Hockey) use different inputs - some emphasize underlying metrics like xGF%, others lean more on actual results, and some incorporate roster-level projections - but the simulation framework is similar.

Why it matters: Playoff probability models are useful for answering "how alive is this team?" during the season and "who's favored?" entering the playoffs. They're also a tool for evaluating front office decisions: if a team trades a first-round pick for a rental player, how much does that actually move their Cup probability? If the answer is "from 5% to 7%," that informs whether the trade was worth it.

Where it falls short: All playoff probability models are sensitive to their input assumptions. If the model overestimates a team's quality (or a goaltender's performance), the projections will be systematically optimistic. Injuries, which are inherently unpredictable, can invalidate projections overnight.

The models also tend to underestimate variance in the playoffs. The NHL postseason is a best-of-seven format where goaltending can swing any series. Models that focus on expected goals and shot metrics may underweight the impact of a hot goaltender - a known source of model error in playoff prediction [20].

Finally, these models are better at estimating ranges than point predictions. "Team X has a 60% chance of winning this series" is meaningful. "Team X will win in 6 games" is not something the model is designed to tell you.

Related: Pythagorean Expectation (a simpler win estimation method), Standings Points Models (the team strength estimates that feed into playoff projections).

This is a living document. New sports, new entries, and updates to existing ones will be added as Beyond the Metric expands. If something is missing or wrong, reach out at info@beyondthemetric.ca.

A note on sources: The metric definitions here are drawn from the public analytics communities in each sport, including the work of the people who originally developed most of them. Where there are meaningful disagreements about methodology or interpretation, I've tried to note them. This is a reference, not an argument.

References

[1] Barnes, T. (Vic Ferrari). (2007-2009). Timeonice.com. The original public source for on-ice shot attempt data, instrumental in popularizing Corsi and Fenwick metrics in the hockey analytics blogging community.

[2] Vollman, R. (2016). Stat Shot: The Ultimate Guide to Hockey Analytics. ECW Press. Comprehensive reference covering the predictive validity of shot attempt metrics and their relationship to winning.

[3] Vollman, R. (2014). Hockey Abstract. Self-published. Includes systematic analysis of rink bias / venue effects and the high correlation between Corsi and Fenwick variants.

[4] Tulsky, E. (2013). Score effects and how to adjust for them. Broad Street Hockey / SB Nation. One of several systematic treatments of score-state bias in shot metrics.

[5] Macdonald, B. (2012). An expected goals model for evaluating NHL teams and players. Proceedings of the MIT Sloan Sports Analytics Conference. Early formal expected goals model using shot location and type features in a logistic regression framework.

[6] MoneyPuck.com. Expected goals model documentation. The most widely used public xG model, incorporating pre-shot movement, passing sequences, rebound and rush indicators, and shot type/location features using gradient-boosted methods.

[7] Corsica Hockey (Perry, E.). (2015-2018). Among the first public analytics sites to demonstrate that xG-based metrics outperform shot volume metrics in predicting future goal differential and team success.

[8] Luce, T., & Fischer, B. (2017). Stabilization rates for on-ice and individual shooting metrics. Analysis demonstrating that individual shooting percentage (finishing talent) requires 3-5 seasons of data to reliably separate signal from noise, and that the between-player variance in finishing skill is modest relative to shot generation.

[9] Ryder, A. (2004). Shot quality. Hockey Analytics. Early systematic analysis of shot danger based on location, type, and pre-shot context, including the elevated goal probability associated with lateral passes across the slot.

[10] Macdonald, B. (2011). A regression-based adjusted plus-minus statistic for NHL players. Journal of Quantitative Analysis in Sports, 7(3). The foundational paper adapting adjusted plus-minus regression methods to hockey, establishing the framework used by subsequent public RAPM models.

[11] Sill, J. (2010). Improved NBA adjusted +/- using regularization and out-of-sample testing. Proceedings of the MIT Sloan Sports Analytics Conference. Demonstrated that ridge regression (L2 regularization) substantially improves the stability and out-of-sample predictive accuracy of adjusted plus-minus estimates. The approach was subsequently adopted by hockey RAPM implementations.

[12] Younggren, J., Younggren, E., & Shuckers, M. (2019-present). Evolving Hockey (evolving-hockey.com). The primary public source for Goals Above Replacement (GAR) and Wins Above Replacement (WAR) in hockey, combining RAPM-based even-strength components with shooting, penalty, and special teams contributions.

[13] Tulsky, E. (2013). Is quality of competition really that important? Broad Street Hockey / SB Nation. Demonstrated that the spread of quality of competition faced by players within a team is narrower than commonly assumed, and that quality of teammates has a larger measurable effect on player results than quality of opponents.

[14] Vollman, R. (2016). Stat Shot. Includes analysis of year-over-year goaltender save percentage stability, documenting the low persistence of SV% relative to what would be expected if the metric primarily reflected individual skill.

[15] Thomas, A.C. (2006). The impact of puck possession and location on ice hockey strategy. Journal of Quantitative Analysis in Sports, 2(1). Early quantitative analysis of shot location value and its implications for goaltender evaluation, contributing to the framework that would underpin quality-adjusted goaltending metrics.

[16] Tulsky, E., Sturt, G., Smyth, D., & Zona, J. (2013). Zone entries, shot generation, and the impact of individual players. Proceedings of the MIT Sloan Sports Analytics Conference. Demonstrated that controlled zone entries (carry-ins) produce approximately twice the shot attempts and expected goals per entry compared to dump-ins, establishing zone entry tracking as a meaningful individual player evaluation tool.

[17] James, B. (1983). The Bill James Baseball Abstract. Self-published. Introduced the Pythagorean expectation formula for estimating team win percentage from runs scored and allowed, later adapted across multiple professional sports.

[18] Dayaratna, K., & Miller, S. (2013). The Pythagorean won-loss formula and hockey: A statistical justification for using the classic baseball formula as an evaluative tool in hockey. The Hockey Research Journal, 1(1). Formal statistical validation of the Pythagorean expectation applied to hockey, calibrating the optimal exponent and documenting the regression-to-the-mean pattern for teams that exceed or fall short of their expected win total.

[19] Natural Stat Trick (naturalstattrick.com). The primary public source for scoring chance classifications in hockey, using a zone-based system that categorizes shots into low-, medium-, and high-danger tiers based on shot location.

[20] Sznajder, M. (2019). Hockey Analytics: The Definitive Guide. Independently published. Includes analysis of goaltending variance in playoff contexts and the limitations of shot-based models in predicting short-series outcomes.