Session 1: Korey Stringer organized session

Cumulative Impact Load, Injury and Cartilage Metabolism in Collegiate Women’s Basketball Athletes

Julie Burland, University of Connecticut

Abstract: Purpose: Impact loading alterations have been associated with cartilage metabolism changes before and after joint injury; however, the relationship and variation in these metrics over an athletic season is unclear. We aimed to establish whether cumulative impact load and serum biomarkers are related to lower extremity injury and evaluate any impact load and cartilage biomarker relationships. Methods: Eleven collegiate women’s basketball athletes (height:1.86 meters, mass:82.0 kilograms, age:20.54 years) participated in a prospective case-series evaluating lower extremity impact load, serum cartilage biomarkers, and injury incidence. Cartilage synthesis (CPII, CS846) and degradation (C2C) biomarkers were evaluated at 6 season timepoints. Impact load metrics (cumulative bone stimulus, impact intensity) were collected during practices using inertial measurement units secured to the distal medial tibiae. Injury was defined as restriction of participation for 1 or more days beyond day of initial injury. Cumulative impact load metrics were calculated over the week prior to any documented injury and blood draws for analysis. Point biserial and Pearson product moment correlations were utilized to determine the relationship between impact load metrics, serum biomarkers, and injury. Results: Greater medium range (6-20g’s) cumulative impact intensities during week 5 for both limbs (r=0.674, p=0.023) and high range (20-200g’s) during week 8 for both limbs (0.672, p=0.024) were associated with injury. Greater cumulative bone stimulus was associated with increased CPII levels in the post-season for right (r=0.694, p=0.026) and left (r=0.747, p=0.013) limbs. Greater CS846 levels at off-season-1 (r=0.729, p=0.017), and at the beginning of the competitive season (r=0.645, p=0.044) were associated with season long injury incidence. Conclusions: Higher impact intensities (6-200g’s) early in the season were associated with subsequent injury. Increased cartilage synthesis at various timepoints was related to increased cumulative bone stimulus metrics and season-long injury incidence. These relationships should be further evaluated to determine predictive capabilities in larger cohorts.

Session 2: Olympic Sports

Four (three) years later: A look back at the sport climbing competition format and scoring system at Tokyo 2020 (2021)

Gregory J. Matthews, Department of Mathematics and Statistics, Loyola University Chicago

Abstract: Sport climbing, which made its Olympic debut at the 2020 Summer Games in Tokyo, generally consists of three separate disciplines: speed climbing, bouldering, and lead climbing. However, the International Olympic Committee only allowed one set of medals each for men and women in sport climbing at Tokyo 2020. As a result, the governing body of sport climbing, rather than choosing only one of the three disciplines for the 2020 Summer Olympics, decided to create a competition combining all three disciplines. In order to determine a winner, a combined scoring system was created using the product of the ranks across the three disciplines to determine an overall score for each climber. In this work, the rank-product scoring system of sport climbing is evaluated through simulation to investigate its general features, specifically, the advancement probabilities and scores for climbers given certain placements. Additionally, analyses of historical climbing results are presented and real examples of violations of the independence of irrelevant alternatives are illustrated. Finally, this work finds evidence that the competition format at Tokyo 2020 is putting speed climbers at a disadvantage.

Athlete Recruitment and the Myth of the Sophomore Peak

Monnie McGee, Southern Methodist University

Abstract: Conventional wisdom dispersed by fans and coaches in the stands at almost any high school track meet suggests female athletes typically peak around 10th grade or earlier (15 years of age), particularly for distance runners, and male athletes continuously improve. Given that universities in the United States typically recruit track and field athletes from high school teams, it is important to understand the age of peak performance at the high school level. Athletes are often recruited starting in their sophomore year of high school and individuals develop at different rates during adolescence; however, the individual development factor is usually not taken into account during recruitment. In this study, we curate data on event times for high school track and field athletes from the years 2011 to 2019 to determine the trajectory of fastest times for male and female athletes in the 200m, 400m, 800m, and 1600m races. We show, through visualizations and models, that, for most athletes, the sophomore peak is a myth. Performance is mostly dependent on the individual athlete. That said, the trajectories cluster into four or five types, depending on the race distance. We explain the significance of the types for future recruitment.

Optimizing Team Selection & Strategy for the Swimming Mixed Medley Relay

Lulu Zheng, Yale University

Abstract: Despite having some of the world’s best swimmers, the US placed 5th in the 4x100 mixed medley relay at the 2021 Tokyo Summer Olympics. Articles came out after the event criticizing the United States’ strategy and lineup for the relay, so this project aims to pin down the optimal relay combination for the 4x100 mixed medley relay for the United States at the 2024 Paris Summer Olympics. Coming directly from the US Olympic Committee, the data used in this project include individual swim data for the 100 back, 100 breast, 100 free, and 100 fly events for men and women from 2016 to 2022. The first portion of the analysis fits a multiple linear regression model for each of the eight swim events, with the predictors in each model consisting of the swimmer age, session description, meet type, relay indicator, swim meet year, and unique swimmer ID, and the outcome variable being swim time in seconds. The results of the models confirm an age curve for swimmers and indicate that swimmers tend to swim faster in finals and Olympic Games compared to prelims and Olympic Trials. The second portion of the analysis uses the models to predict swim times for the 2021 and 2024 Olympics, constructs normal distributions for the swimmers based on their predicted swim times, and uses the normal distributions to simulate individual swim times & relay times for different combinations of the fastest swimmers in each country. The simulation results reveal that the US used their 61st predicted fastest relay combination for the mixed medley, potentially explaining their mediocre medley ranking in the 2021 Tokyo Olympics. Additionally, based on the simulation results, our recommended fastest relay combination for the US would have Regan Smith swimming back, Michael Andrew swimming breast, Caeleb Dressel swimming fly, and Simone Manuel swimming free in the 2024 Paris Olympics, assuming the aforementioned swimmers are able to participate. These results are further summarized in a web application designed to help coaches select their optimal relay team.

Session 3: Big Data Bowl Finalists

Momentum-based fractional tackles

Quang Nguyen, Department of Statistics & Data Science, Carnegie Mellon University

Abstract: Tackle is a fundamental defensive move in American football, with the main purpose of stopping the forward progress of the ball carrier. However, current tackling metrics are mostly observed outcomes at the play-level, hence inherently flawed due to their discrete nature. Using player tracking data, we present a model-free framework for assessing tackling contribution throughout a play. Our approach first identifies when a defender is in a contact window of the ball carrier during a play, before assigning values to each window and the players involved. Importantly, we devise a novel statistic, called momentum-based fractional tackles, which properly credits defenders for halting the ball carrier’s momentum in the end zone direction. This overcomes the shortcomings of previous metrics like tackles and assists, which overstate a defender’s tackling contribution as suggested by our results. Furthermore, we analyze the variation of our statistic across different positions and over time. We also display the top NFL defenders based on our metric for the first nine weeks of the 2022 regular season.

Session 4: Sports Analytics Beyond the Field

A New Framework to Estimate Return on Investment for Player Salaries in the National Basketball Association

Jackson Lautier, Bentley University

Abstract: The evaluation of financial investment decisions always begins with a measurement of realized returns. Difficulties arise when returns are nonmonetary, however. One such example is offering a monetary salary in exchange for playing basketball, such as in the National Basketball Association. Despite many proposals to measure the on-court performance of basketball players, there are very few studies that consider both salary and on-court performance simultaneously. We thus present a novel five-part framework to translate a basketball player’s on-court performance into a series of cash flows for the purpose of estimating a contractual return on investment (ROI). Our framework relies on a novel performance-based per-game wealth redistribution measurement that is calibrated with logistic regression on player tracking data against team wins, a WinLogit. The cumulative nature of player statistics into team statistics directly leads to pleasing statistical properties, such as the sum of player game logits equating to the team’s game logit. We also find a maximum likelihood estimate for the WinLogit. We further demonstrate that centered player statistics allow for player calculations to be directly normalized to an average or replacement player, often desirable in sports analysis. We find WinLogit is more accurate than a per-game version of Win Score and Game Score at estimating team total wins and relative team rankings. We present ROI calculations based on WinLogit, Win Score, and Game Score. All results are presented for the 2022-2023 NBA regular season, including by position. For the purposes of replication, all data and code have been made available on a public repository.

Estimating the age-conditioned average treatment effects curves: An application for assessing load-management strategies in the NBA

Shinpei Nakamura Sakai, Yale University

Abstract: In the realm of competitive sports, athletes' performance trajectories are typically characterized by an "age curve," exhibiting progression, peak performance, and eventual decline. These curves are anticipated to exhibit variability based on the unique attributes of each athlete. Despite its critical importance, there has been scant research on the variability of treatment effects across different ages. Addressing this gap, our study introduces a novel analytical framework that quantifies the treatment effect for individual ages, thereby delineating the age-specific treatment effect curve.

Our research makes three pivotal contributions. Firstly, we establish a Conditional Expectation Function (CEF) that enables a comparative analysis of age curves across diverse covariates and treatment modalities. Secondly, we advance a causal inference methodology that facilitates the construction of an Age-Conditioned Treatment Effect (ACTE) for any given intervention, allowing for the testing of causal hypotheses specific to each age regarding the treatment and its consequent outcomes. Lastly, we implement our method to evaluate the impact of rest days between games on various performance indicators, taking into account the age of the athletes.

Empirical Determination of Baseball Eras: Multivariate Changepoint Analysis in Major League Baseball

Mena Whalen, Loyola University Chicago

Abstract: Multivariate change point analysis methods are used to identify not only mean shifts but also changes in variance across a wide array of statistical time series in Major League Baseball. The primary objective is to discern distinct eras in the evolution of baseball, shedding light on significant transformations in team performance and management strategies. Results confirm previous research, pinpointing well-known baseball eras, such as the Dead Ball Era, Integration Era, Steroid Era, and Post-Steroid Era. Moreover, the study delves into the detection of substantial changes in team performance, effectively identifying periods of both dynasties and collapses within a team's history. The multivariate change point analysis proves to be a valuable tool for understanding the intricate dynamics of baseball's evolution. The method offers a data-driven approach to unveil structural shifts in the sport's historical landscape, providing fresh insights into the impact of rule changes, player strategies, and external factors on baseball's evolution.

Parametric outline fitting to track the evolution of the called strike zone in Major League Baseball from 2008-2023

Dale Zimmerman, University of Iowa

Abstract: Although the strike zone is rigorously defined by Major League Baseball’s official rules, umpires make mistakes in calling pitches as strikes (and balls) and may even adhere to a strike zone somewhat different than that prescribed by the rule book. The availability of data from the automated pitch tracking systems known as PITCHf/x (2008—2016) and TrackMan (2017—2023) makes it possible to answer interesting questions about this “called strike zone” (CSZ). I develop methods for using these data to make inferences about the evolution of geometric attributes (centroid, dimensions, orientation and shape) of the CSZ over time. The methodology consists of first using kernel discriminant analysis to determine a noisy outline representing the CSZ for each year, then fitting generalized superelliptic (GSE) models for closed curves to that outline, and finally applying Mann-Kendall tests for monotonic trend to the estimated parameters of the GSE models. I apply these methods to PITCHf/x and TrackMan data comprising about five million called pitches from the 2008–2023 Major League Baseball seasons to study how the CSZ evolved over this time period. I find that most, but not all, geometric attributes of the CSZ became significantly more like those of the rule-book strike zone from 2008–2023.

Poster Session

Predicting Injuries through Pose Estimation: A Biomechanical Analysis of Fatigue in Track and Field Athletes

Daniel Griffiths, Syracuse University

Abstract: This study introduces a pose estimation model, leveraging artificial intelligence and computer vision, to analyze biomechanical changes in track and field athletes due to fatigue. The primary aim was to investigate how fatigue influences athletes' movement patterns, potentially leading to injuries and performance degradation. The method involved capturing videos from a lateral (side) view and an anterior (front) view as athletes ran five laps (1.25 miles) at a mile race pace (around 5:40 per mile for this athlete) on a standard Olympic track. The hypothesis was that athletes would exhibit signs of fatigue by the final lap, impacting their biomechanical efficiency. Using the OpenPose model, the research extracted x and y coordinates from athletes' movements during different laps, facilitating a detailed kinematic analysis. This included examining joint correlations, ankle positions, cyclical asymmetry, posture analysis, and stride lengths across laps. The results showed notable shifts in joint correlations, particularly between the ankles and shoulders, and the knees and shoulders, indicating compensatory movements due to fatigue. Ankle positions demonstrated increasing consistency, suggesting an adaptation in running technique to counteract fatigue. Additionally, variations in the cyclical asymmetry index and forward lean angles further pointed towards fatigue-induced changes in knee movement and posture. These findings were complemented by a decrease in average step length across laps, highlighting a shift to more sustainable running patterns as fatigue set in. Temporal consistency scores improved over time, implying that athletes maintained stable movement patterns despite fatigue. This suggests a high level of adaptability in the athletes’ biomechanical strategies to cope with the increasing physical demands of the task. This research highlights the potential of AI-driven pose estimation in sports science, offering valuable insights for enhancing training plans, preventing overtraining, optimizing peaking for later competitions, and reducing injury risks in athletics. It underscores the importance of continuous monitoring of athletes' biomechanics, not just in competition but also during training. By integrating AI and computer vision into sports science, coaches and athletes gain access to a powerful tool for real-time biomechanical analysis, enabling more informed decisions to enhance performance and prevent injuries. This study paves the way for more sophisticated and personalized training programs, tailored to the individual biomechanical profiles and fatigue responses of athletes.

Quantitative Analysis of Mental Resilience and Scoring in Tennis

Shihao Li, Yale University

Abstract: This paper aims to investigate whether mental pressure has an impact on point win probability and performance metrics at the highest level of tennis. Tennis is structured such that certain points are more impactful to the match outcome than other points, providing the perfect setting to investigate the impact of mental resiliency in sports. We use point-by-point data from the 2019 Grand Slam tournaments provided by Jack Sackmann, which includes metrics like serving performance, rally performance, scores, and outcomes, to identify high pressure points. We then trained linear and logistic mixed effect regression models of win probability and key metrics to evaluate the effect of mental pressure while controlling for other factors in the match. We additionally compared model accuracy depending on the use of player random slopes with respect to point pressure, allowing us to evaluate whether differences in individual ability to play under pressure are meaningful in determining match outcomes between two players. We find that, on average, mental pressure has a small but directionally negative effect on win probability and performance metrics. There is, however, little to no benefit to model accuracy from including player random slopes with pressure, suggesting that pressure does not affect one player in the data particularly more than others. As such, professional tennis match outcomes are not significantly influenced by a difference in mental resiliency between the players of that match.

Evaluating Player Performance Within the 50 Largest Contracts in MLB History

Stephen Parziale, Yale University

Abstract: Major League Baseball (MLB) is heavily populated with player contracts exceeding $100,000,000. The purpose of this project is to analyze the player performance within the 50 most valuable contracts signed in MLB history. Overall player performance will be compared with the year prior to signing each contract, to the first year the player played under that new contract. We hypothesize that due to strong monetary incentive in the final year before free agency, player performance in the year prior to the new contract will be stronger than the first-year performance under the new contract. In this project, average WAR in Contract Years and First Years will be estimated, and we will test if there is a significant difference in these values while adjusting for other confounding variables. We will also analyze secondary measures of interest.

PAVing the Way for the Future - A Model That Determines Player Value and Evaluates Trades in the National Football League

Atul Venkatesh, Dartmouth College

Abstract: Evaluating player performance is one of the most critical concepts in pro football. One method to do so is Approximate Value (AV), where players are given a seasonal score based on accolades and statistics that accumulate throughout their careers. While AV reflects a player's performance historically, it lacks insight into their current worth. To address this issue, I have modified AV to create the Present AV (PAV) statistic, which shows the current value of over 1800 players through the 2022 NFL season. I have also created a draft value chart which gives the expected PAV of every draft pick. With both of these metrics, it is now possible to evaluate past trades, determine the appropriate draft compensation for a player, and use a statistics-based method to determine the most valuable players in the league.

NFL QB Air Yards Over Expected Model

Amrit Vignesh, Seminole High School

Abstract: The amount of yardage generated by a quarterback has two components: air yards and YAC (yards after catch). Air yards describe the amount of yards traveled by the football from the line of scrimmage to the point of completion (if completed) or where it is deemed incomplete. Air yards can represent different qualities of a quarterbacks such as arm talent and aggressiveness. In order to quantify aggressiveness, an AYOE (Air Yards Over Expected) Model is created using XGBoost by comparing average air yards per attempt to the predicted average air yards per attempt. The predicted value is based on a variety of situational factors such as yard-line, down, etc, but also includes a factor of trust in the surrounding offense by incorporating the cumulative receiving EPA of the targeted receiver within that season before the game in which the pass attempt took place. The model generates AYOE values for qualified quarterbacks during the 2023 NFL season (regular season and playoffs) between -3 and 3 yards, with larger values demonstrating an aggressive playstyle and shorter values demonstrating a conservative playstyle.