About Skills Projects Resume Contact
← Back to Projects
02 — Sports Analytics

NHL Sports Analytics
& Modeling

R / RStudio ggplot2 GAM Modeling nhlapi Web Scraping Final: 100% Exam: 98%

Statistical modeling and visualization applied to NHL data — goal probability heatmaps via GAM, cross-era Avalanche championship comparisons using z-score normalization, and Corsi % player trajectory curves for McDavid vs. MacKinnon.

Final Project — Goal Probability Heatmap

Scraped 2021 Colorado Avalanche season data via nhlapi · Grade: 100%

All shot events for the 2021 Avalanche season were pulled from the NHL API — including shots, missed shots, blocked shots, and goals. These were fed into a Generalized Additive Model (GAM) with a binomial family to predict goal probability by ice coordinate. The resulting probability surface was overlaid on a rink diagram using the sportyR package.

## Pull all Avalanche game IDs for 2021
games    <- nhlapi::nhl_schedule_seasons(seasons = 2021, teamIds = 21)[[1]]$dates
game_ids <- dplyr::bind_rows(games$games) %>% dplyr::pull(gamePk)
all_game_feeds <- lapply(game_ids, nhlapi::nhl_games_feed)

## Flatten plays, filter to shot events
plays <- purrr::map_df(all_game_feeds, function(x) {
    return(x[[1]][["liveData"]][["plays"]][["allPlays"]])
}) %>%
  dplyr::filter(result.event %in% c("Shot", "Missed Shot", "Blocked Shot", "Goal")) %>%
  dplyr::mutate(goal         = ifelse(result.event == "Goal", T, F),
                x.coordinate = -abs(coordinates.x))   # mirror to offensive zone

## Fit GAM (binomial — predicts goal probability)
goals_gam <- gam(goal ~ s(x.coordinate) + s(coordinates.y),
                 family = binomial, data = plays)

## Build prediction surface and plot on rink
y <- seq(-42, 42, length = 50)
x <- seq(-98,  0, length = 50)
plays_predict_data <- expand.grid(x.coordinate = x, coordinates.y = y) %>%
  mutate(goal_prob = predict(goals_gam, ., type = "response"))

p <- geom_hockey('nhl', display_range = "offense")
p + geom_tile(data  = plays_predict_data,
              aes(x = x.coordinate, y = coordinates.y, fill = goal_prob), alpha = .5) +
  scale_fill_distiller("prob", palette = "Spectral", direction = -1, limit = c(0, .125)) +
  coord_fixed() + labs(title = "Goal Probability Heatmap — Colorado Avalanche 2021")

Final Project — Cross-Era Championship Comparison

Z-score normalization to compare the 1995, 2000, and 2021 Avalanche against their respective league averages

To fairly compare championship-caliber Avalanche teams across different eras, each season's stats were normalized against that year's league average using z-scores. This controls for how the pace and style of hockey has evolved over 25+ years.

SeasonPoints (z-score)Goals/Game (z-score)GA/Game (z-score)
1995+1.16+1.83−0.45
2000+1.81+1.40−1.13
2021+1.38+1.51−0.70

The 2000 team showed the strongest point differential (+1.81σ above league average) while being the most defensively dominant (−1.13σ in goals-against). The 2021 team sat between both historic champions on all three dimensions — a strong indication of championship caliber even before the playoffs.

Final Exam — Multi-Sport Analysis

NBA, NFL, and NHL datasets · Grade: 98%

🏀 NBA — Offensive Rating by Team

Computed average offensive rating for each NBA team using grouped summaries on a professor-provided dataset from NBA.com/stats.

nba_stats <- readRDS("nba_stats.rds")
average_team_offrating <- nba_stats %>%
  dplyr::group_by(team_name) %>%
  dplyr::mutate(off_rating = as.numeric(off_rating)) %>%
  dplyr::summarize(mean(off_rating))

🏈 NFL — Starting Field Position After Kickoff

Isolated all plays immediately following a kickoff, then built a density plot to identify the modal starting field position for NFL drives.

kickoffs   <- which(nfl_stats$play_type == "kickoff")
kickoff_df <- nfl_stats[kickoffs + 1, ]
ggplot(data = kickoff_df, aes(x = yardline_100)) + geom_density()

🏒 NHL — McDavid vs. MacKinnon Corsi % Trajectories

Built a custom scraper to pull advanced stats from Hockey Reference for both players, then modeled Corsi % as a smooth function of age — enabling a direct generational comparison of defensive zone possession dominance.

## Custom scraper (scrapes Hockey Reference career advanced stats)
scrape_nhl_player_stats <- function(player_link, type = "basic", career = FALSE) { ... }

MacKinnon_stats <- scrape_nhl_player_stats(MacKinnon_url, type = "advanced")
McDavid_stats   <- scrape_nhl_player_stats(McDavid_url,   type = "advanced")
df_advanced     <- bind_rows(MacKinnon_stats, McDavid_stats)

## Plot Corsi % trajectory by age for each player
ggplot(data = df_advanced,
       aes(x = as.numeric(age), y = as.numeric(cf_percent), col = team)) +
  geom_smooth() +
  scale_color_brewer("", palette = "Set1") +
  labs(x = "Age", y = "Corsi Percentage")

Key Findings

🔥

High-Danger Zones

The GAM heatmap revealed that slot shots and sharp-angle plays close to the crease show dramatically higher goal probability than perimeter attempts.

📅

2000 Era Dominance

Z-score normalization showed the 2000 Avalanche were the most dominant relative to their era despite lower raw numbers — they outpaced peers by +1.81σ in points.

📈

Generational Comparison

Corsi % curves allowed a direct comparison between two of the greatest active players despite different systems, linemates, and seasons.

Technology Stack

R 4.x RStudio ggplot2 dplyr purrr mgcv (GAM) nhlapi sportyR rvest xml2 plotly janitor

Results

Final Project: 100%  ·  Final Exam: 98% — Evaluated across NBA, NFL, and NHL datasets demonstrating breadth across web scraping, statistical modeling, GAM, and multi-sport visualization.

Interested in this project or want to connect?

[email protected] LinkedIn ← All Projects