Statistical modeling and visualization applied to NHL data — goal probability heatmaps via GAM, cross-era Avalanche championship comparisons using z-score normalization, and Corsi % player trajectory curves for McDavid vs. MacKinnon.
Scraped 2021 Colorado Avalanche season data via nhlapi · Grade: 100%
All shot events for the 2021 Avalanche season were pulled from the NHL API — including shots, missed shots, blocked shots, and goals. These were fed into a Generalized Additive Model (GAM) with a binomial family to predict goal probability by ice coordinate. The resulting probability surface was overlaid on a rink diagram using the sportyR package.
## Pull all Avalanche game IDs for 2021
games <- nhlapi::nhl_schedule_seasons(seasons = 2021, teamIds = 21)[[1]]$dates
game_ids <- dplyr::bind_rows(games$games) %>% dplyr::pull(gamePk)
all_game_feeds <- lapply(game_ids, nhlapi::nhl_games_feed)
## Flatten plays, filter to shot events
plays <- purrr::map_df(all_game_feeds, function(x) {
return(x[[1]][["liveData"]][["plays"]][["allPlays"]])
}) %>%
dplyr::filter(result.event %in% c("Shot", "Missed Shot", "Blocked Shot", "Goal")) %>%
dplyr::mutate(goal = ifelse(result.event == "Goal", T, F),
x.coordinate = -abs(coordinates.x)) # mirror to offensive zone
## Fit GAM (binomial — predicts goal probability)
goals_gam <- gam(goal ~ s(x.coordinate) + s(coordinates.y),
family = binomial, data = plays)
## Build prediction surface and plot on rink
y <- seq(-42, 42, length = 50)
x <- seq(-98, 0, length = 50)
plays_predict_data <- expand.grid(x.coordinate = x, coordinates.y = y) %>%
mutate(goal_prob = predict(goals_gam, ., type = "response"))
p <- geom_hockey('nhl', display_range = "offense")
p + geom_tile(data = plays_predict_data,
aes(x = x.coordinate, y = coordinates.y, fill = goal_prob), alpha = .5) +
scale_fill_distiller("prob", palette = "Spectral", direction = -1, limit = c(0, .125)) +
coord_fixed() + labs(title = "Goal Probability Heatmap — Colorado Avalanche 2021")
Z-score normalization to compare the 1995, 2000, and 2021 Avalanche against their respective league averages
To fairly compare championship-caliber Avalanche teams across different eras, each season's stats were normalized against that year's league average using z-scores. This controls for how the pace and style of hockey has evolved over 25+ years.
| Season | Points (z-score) | Goals/Game (z-score) | GA/Game (z-score) |
|---|---|---|---|
| 1995 | +1.16 | +1.83 | −0.45 |
| 2000 | +1.81 | +1.40 | −1.13 |
| 2021 | +1.38 | +1.51 | −0.70 |
The 2000 team showed the strongest point differential (+1.81σ above league average) while being the most defensively dominant (−1.13σ in goals-against). The 2021 team sat between both historic champions on all three dimensions — a strong indication of championship caliber even before the playoffs.
NBA, NFL, and NHL datasets · Grade: 98%
🏀 NBA — Offensive Rating by Team
Computed average offensive rating for each NBA team using grouped summaries on a professor-provided dataset from NBA.com/stats.
nba_stats <- readRDS("nba_stats.rds")
average_team_offrating <- nba_stats %>%
dplyr::group_by(team_name) %>%
dplyr::mutate(off_rating = as.numeric(off_rating)) %>%
dplyr::summarize(mean(off_rating))
🏈 NFL — Starting Field Position After Kickoff
Isolated all plays immediately following a kickoff, then built a density plot to identify the modal starting field position for NFL drives.
kickoffs <- which(nfl_stats$play_type == "kickoff")
kickoff_df <- nfl_stats[kickoffs + 1, ]
ggplot(data = kickoff_df, aes(x = yardline_100)) + geom_density()
🏒 NHL — McDavid vs. MacKinnon Corsi % Trajectories
Built a custom scraper to pull advanced stats from Hockey Reference for both players, then modeled Corsi % as a smooth function of age — enabling a direct generational comparison of defensive zone possession dominance.
## Custom scraper (scrapes Hockey Reference career advanced stats)
scrape_nhl_player_stats <- function(player_link, type = "basic", career = FALSE) { ... }
MacKinnon_stats <- scrape_nhl_player_stats(MacKinnon_url, type = "advanced")
McDavid_stats <- scrape_nhl_player_stats(McDavid_url, type = "advanced")
df_advanced <- bind_rows(MacKinnon_stats, McDavid_stats)
## Plot Corsi % trajectory by age for each player
ggplot(data = df_advanced,
aes(x = as.numeric(age), y = as.numeric(cf_percent), col = team)) +
geom_smooth() +
scale_color_brewer("", palette = "Set1") +
labs(x = "Age", y = "Corsi Percentage")
The GAM heatmap revealed that slot shots and sharp-angle plays close to the crease show dramatically higher goal probability than perimeter attempts.
Z-score normalization showed the 2000 Avalanche were the most dominant relative to their era despite lower raw numbers — they outpaced peers by +1.81σ in points.
Corsi % curves allowed a direct comparison between two of the greatest active players despite different systems, linemates, and seasons.
Final Project: 100% · Final Exam: 98% — Evaluated across NBA, NFL, and NHL datasets demonstrating breadth across web scraping, statistical modeling, GAM, and multi-sport visualization.