Abstract: Purpose: Impact loading alterations have been associated with cartilage metabolism changes before and after joint injury; however, the relationship and variation in these metrics over an athletic season is unclear. We aimed to establish whether cumulative impact load and serum biomarkers are related to lower extremity injury and evaluate any impact load and cartilage biomarker relationships. Methods: Eleven collegiate women’s basketball athletes (height:1.86 meters, mass:82.0 kilograms, age:20.54 years) participated in a prospective case-series evaluating lower extremity impact load, serum cartilage biomarkers, and injury incidence. Cartilage synthesis (CPII, CS846) and degradation (C2C) biomarkers were evaluated at 6 season timepoints. Impact load metrics (cumulative bone stimulus, impact intensity) were collected during practices using inertial measurement units secured to the distal medial tibiae. Injury was defined as restriction of participation for 1 or more days beyond day of initial injury. Cumulative impact load metrics were calculated over the week prior to any documented injury and blood draws for analysis. Point biserial and Pearson product moment correlations were utilized to determine the relationship between impact load metrics, serum biomarkers, and injury. Results: Greater medium range (6-20g’s) cumulative impact intensities during week 5 for both limbs (r=0.674, p=0.023) and high range (20-200g’s) during week 8 for both limbs (0.672, p=0.024) were associated with injury. Greater cumulative bone stimulus was associated with increased CPII levels in the post-season for right (r=0.694, p=0.026) and left (r=0.747, p=0.013) limbs. Greater CS846 levels at off-season-1 (r=0.729, p=0.017), and at the beginning of the competitive season (r=0.645, p=0.044) were associated with season long injury incidence. Conclusions: Higher impact intensities (6-200g’s) early in the season were associated with subsequent injury. Increased cartilage synthesis at various timepoints was related to increased cumulative bone stimulus metrics and season-long injury incidence. These relationships should be further evaluated to determine predictive capabilities in larger cohorts.
Abstract: The integration of technology in sport has significantly advanced over the last decade. Technology and player monitoring has allowed us to make more informed decisions as it pertains to health and physical performance. Monitoring is used to optimize physical performance and stress with the intention of prescribing training load, reducing risks, and identifying fatigue. Widespread understanding, adoption, and application of these practices is severely lacking in elite collegiate women’s basketball. The technologies highlighted in this presentation include player load management using the Catapult System and neuromuscular responses using force plate technology by Sparta Science. The purpose of this session is to describe the technology we use and how we monitor workload and neuromuscular response. Furthermore, this presentation will share experiences related to technology integration and player monitoring with the intention of optimizing performance. Examples of unique cases where technology was able to detect subtle changes in these metrics will be examined in depth. The application of this data in the future for women’s basketball has endless potential.
Abstract: Ultra-marathon open water swimming (OWS) events are one of the toughest endurance challenges in the world. The sport has gained notoriety with athletes swimming across the English Channel, Diana Nyad swimming from Cuba to Florida, and the 5 and 10km OWS in the Olympic schedule. The athletes who participate are exposed to dangerous conditions that expose risks inherent to the sport. The optimal time to prepare for an emergency is before it happens. The aim of this session is to present the emergency action plan (EAP) designed for the “Swim Tuff” event, a record-breaking ultramarathon swim that took place in Rhode Island, USA. We will also present physiological data recorded from the event. This session identifies an overview of Swim Tuff, the challenges experienced, and how the team designed and implemented risk mitigation strategies.
Abstract: The demands of intense training and rigorous game schedule may result in decreased performance, increased fatigue, and increase injury risk in women’s NCAA Division I basketball. Wearable technology provides quantifiable insights on a team’s training load that can be used to optimize exercise stress, response, and adaptation to optimize performance. PURPOSE: To investigate the fluctuations in measures of external load between the four essential periods (Pre-Season, Non-Conference, Conference, and Post-Season) in a women’s basketball team. METHODS: 15 elite NCAA women’s basketball players were monitored via Catapult's S7 inertial measurement units (IMU) during practice and game days. Weekly total PlayerLoad? (PL), PL?min-1 (PLM), and total jumps (TJ) means were collected and exported using OpenField Software (Catapult) then transferred to Tableau. A one-way ANOVA was used to detect significant differences (p < 0.05) in measures of external load across periods. Additionally, we conducted Tukey’s HSD post-hoc tests to identify pairwise differences between periods. RESULTS: We observed a significant difference in PL between the Pre-Season and Conference periods (MD -186.85, 95% CI [-353.21, -20.49], p = 0.02). There were no significant differences in PLM and TJ between all four periods (p?>?0.05); however, Pre-Season averaged the highest total PL, PLM, TJ compared to all other periods. PLPM was highest in the Pre-Season compared to all other periods (vs. Non-Conference (MD -0.66, 95% CI [-1.76, 0.44] , vs Conference (MD -0.50, 95% CI [-1.63, 0.63], vs. Post-Season (MD -0.60, 95% CI [-1.74, 0.54]). TJ was also highest in the Pre-Season compared to all other periods (vs. Non-Conference (MD -8.29, 95% CI -30.26, 13.67] , vs. Conference (MD -10.59, 95% CI -33.34, 12.14], vs. Post-Season (MD -5.29, 95% CI -28.03, 17.45). CONCLUSION: While our study is ongoing, the findings for the 2022 – 2023 season suggest PL demands are highest during the Pre-Season period of the D1 NCAA women’s basketball season. Our findings suggest that modifications to pre-season are warranted. Periodization of training load though the modification of acute program variables would allow for optimal adaptation and performance prior to the start of the non-conference period. With optimal management, both physical performance and injury risk may be optimized within this NCAA women’s basketball team.
Abstract: Sport climbing, which made its Olympic debut at the 2020 Summer Games in Tokyo, generally consists of three separate disciplines: speed climbing, bouldering, and lead climbing. However, the International Olympic Committee only allowed one set of medals each for men and women in sport climbing at Tokyo 2020. As a result, the governing body of sport climbing, rather than choosing only one of the three disciplines for the 2020 Summer Olympics, decided to create a competition combining all three disciplines. In order to determine a winner, a combined scoring system was created using the product of the ranks across the three disciplines to determine an overall score for each climber. In this work, the rank-product scoring system of sport climbing is evaluated through simulation to investigate its general features, specifically, the advancement probabilities and scores for climbers given certain placements. Additionally, analyses of historical climbing results are presented and real examples of violations of the independence of irrelevant alternatives are illustrated. Finally, this work finds evidence that the competition format at Tokyo 2020 is putting speed climbers at a disadvantage.
Abstract: Conventional wisdom dispersed by fans and coaches in the stands at almost any high school track meet suggests female athletes typically peak around 10th grade or earlier (15 years of age), particularly for distance runners, and male athletes continuously improve. Given that universities in the United States typically recruit track and field athletes from high school teams, it is important to understand the age of peak performance at the high school level. Athletes are often recruited starting in their sophomore year of high school and individuals develop at different rates during adolescence; however, the individual development factor is usually not taken into account during recruitment. In this study, we curate data on event times for high school track and field athletes from the years 2011 to 2019 to determine the trajectory of fastest times for male and female athletes in the 200m, 400m, 800m, and 1600m races. We show, through visualizations and models, that, for most athletes, the sophomore peak is a myth. Performance is mostly dependent on the individual athlete. That said, the trajectories cluster into four or five types, depending on the race distance. We explain the significance of the types for future recruitment.
Abstract: Since the United States is one of the only countries with an Olympic and Paralympic Committee that is not supported through federal government support, the USOPC Performance Innovation team takes a utilitarian approach to using data and analytics to support athlete performance. Learn how our small but mighty team innovates to overcome challenges with examples of a projects completed for USA Gymnastics.
Abstract: Despite having some of the world’s best swimmers, the US placed 5th in the 4x100 mixed medley relay at the 2021 Tokyo Summer Olympics. Articles came out after the event criticizing the United States’ strategy and lineup for the relay, so this project aims to pin down the optimal relay combination for the 4x100 mixed medley relay for the United States at the 2024 Paris Summer Olympics. Coming directly from the US Olympic Committee, the data used in this project include individual swim data for the 100 back, 100 breast, 100 free, and 100 fly events for men and women from 2016 to 2022. The first portion of the analysis fits a multiple linear regression model for each of the eight swim events, with the predictors in each model consisting of the swimmer age, session description, meet type, relay indicator, swim meet year, and unique swimmer ID, and the outcome variable being swim time in seconds. The results of the models confirm an age curve for swimmers and indicate that swimmers tend to swim faster in finals and Olympic Games compared to prelims and Olympic Trials. The second portion of the analysis uses the models to predict swim times for the 2021 and 2024 Olympics, constructs normal distributions for the swimmers based on their predicted swim times, and uses the normal distributions to simulate individual swim times & relay times for different combinations of the fastest swimmers in each country. The simulation results reveal that the US used their 61st predicted fastest relay combination for the mixed medley, potentially explaining their mediocre medley ranking in the 2021 Tokyo Olympics. Additionally, based on the simulation results, our recommended fastest relay combination for the US would have Regan Smith swimming back, Michael Andrew swimming breast, Caeleb Dressel swimming fly, and Simone Manuel swimming free in the 2024 Paris Olympics, assuming the aforementioned swimmers are able to participate. These results are further summarized in a web application designed to help coaches select their optimal relay team.
Abstract: In American football, predicting end-of-play location is crucial to the creation of commonly referenced player-evaluation metrics such as Yards Over Expected (YOE). However, most existing models in this domain primarily focus on predicting only point estimates for gained downfield yardage. While these models are ideal for metric creation and simplicity, they fail to estimate the probability distributions behind outcomes, thereby neglecting to explore the potential for multi-modal end-of-play possibilities. Moreover, while estimating only downfield yardage is sufficient for yardage-gained metrics, it fails to provide insight into how players influence play outcomes across the entire field.
I propose a method that leverages an ensemble of Deep Convolutional Gaussian Mixture Models (DCGMM) to address these limitations. This approach (1) estimates the end-of-play location as a two-dimensional mixture distribution, encompassing outcomes in both downfield and lateral field directions, and (2) allows for the employment of the subtraction method to analyze how estimated end-of-play distributions change with and without certain players. Through such modeling, I also estimate appropriate model error with bootstrap ensembling to account for the limitations of tracking data sample sizes. This research enables a more comprehensive and visual understanding of how player positioning affects influence over the field.
Abstract: Defensive plays on the football field require the effort of 11 players. However, defensive statistics only credit one or two defenders for the play’s end result (sack, tackle, interception, etc.). Our project focused on the off-ball contributions of the oft-ignored nine or ten others. In particular, we focused on a vital skill to stopping the run — setting the edge. SET: Spatial Edge Technique measures a defender’s ability to set the outside edge and force the ball carrier inside. Our goal is to enrich the statistical narrative by spotlighting the vital role of setting the edge—an aspect highlighted in film rooms but frequently missing from the raw data and numbers.
Abstract: Tackle is a fundamental defensive move in American football, with the main purpose of stopping the forward progress of the ball carrier. However, current tackling metrics are mostly observed outcomes at the play-level, hence inherently flawed due to their discrete nature. Using player tracking data, we present a model-free framework for assessing tackling contribution throughout a play. Our approach first identifies when a defender is in a contact window of the ball carrier during a play, before assigning values to each window and the players involved. Importantly, we devise a novel statistic, called fractional tackles, which properly credits defenders for halting the ball carrier’s momentum in the end zone direction. This overcomes the shortcomings of previous metrics like tackles and assists, which overstate a defender’s tackling contribution as suggested by our results. Furthermore, we analyze the variation of our statistic across different positions and over time. We also display the top NFL defenders based on our metric for the first nine weeks of the 2022 regular season.
Abstract: The evaluation of financial investment decisions always begins with a measurement of realized returns. Difficulties arise when returns are nonmonetary, however. One such example is offering a monetary salary in exchange for playing basketball, such as in the National Basketball Association. Despite many proposals to measure the on-court performance of basketball players, there are very few studies that consider both salary and on-court performance simultaneously. We thus present a novel five-part framework to translate a basketball player’s on-court performance into a series of cash flows for the purpose of estimating a contractual return on investment (ROI). Our framework relies on a novel performance-based per-game wealth redistribution measurement that is calibrated with logistic regression on player tracking data against team wins, a WinLogit. The cumulative nature of player statistics into team statistics directly leads to pleasing statistical properties, such as the sum of player game logits equating to the team’s game logit. We also find a maximum likelihood estimate for the WinLogit. We further demonstrate that centered player statistics allow for player calculations to be directly normalized to an average or replacement player, often desirable in sports analysis. We find WinLogit is more accurate than a per-game version of Win Score and Game Score at estimating team total wins and relative team rankings. We present ROI calculations based on WinLogit, Win Score, and Game Score. All results are presented for the 2022-2023 NBA regular season, including by position. For the purposes of replication, all data and code have been made available on a public repository.
Abstract: In the realm of competitive sports, athletes' performance trajectories are typically characterized by an "age curve," exhibiting progression, peak performance, and eventual decline. These curves are anticipated to exhibit variability based on the unique attributes of each athlete. Despite its critical importance, there has been scant research on the variability of treatment effects across different ages. Addressing this gap, our study introduces a novel analytical framework that quantifies the treatment effect for individual ages, thereby delineating the age-specific treatment effect curve.
Our research makes three pivotal contributions. Firstly, we establish a Conditional Expectation Function (CEF) that enables a comparative analysis of age curves across diverse covariates and treatment modalities. Secondly, we advance a causal inference methodology that facilitates the construction of an Age-Conditioned Treatment Effect (ACTE) for any given intervention, allowing for the testing of causal hypotheses specific to each age regarding the treatment and its consequent outcomes. Lastly, we implement our method to evaluate the impact of rest days between games on various performance indicators, taking into account the age of the athletes.
Abstract: Multivariate change point analysis methods are used to identify not only mean shifts but also changes in variance across a wide array of statistical time series in Major League Baseball. The primary objective is to discern distinct eras in the evolution of baseball, shedding light on significant transformations in team performance and management strategies. Results confirm previous research, pinpointing well-known baseball eras, such as the Dead Ball Era, Integration Era, Steroid Era, and Post-Steroid Era. Moreover, the study delves into the detection of substantial changes in team performance, effectively identifying periods of both dynasties and collapses within a team's history. The multivariate change point analysis proves to be a valuable tool for understanding the intricate dynamics of baseball's evolution. The method offers a data-driven approach to unveil structural shifts in the sport's historical landscape, providing fresh insights into the impact of rule changes, player strategies, and external factors on baseball's evolution.
Abstract: Although the strike zone is rigorously defined by Major League Baseball’s official rules, umpires make mistakes in calling pitches as strikes (and balls) and may even adhere to a strike zone somewhat different than that prescribed by the rule book. The availability of data from the automated pitch tracking systems known as PITCHf/x (2008—2016) and TrackMan (2017—2023) makes it possible to answer interesting questions about this “called strike zone” (CSZ). I develop methods for using these data to make inferences about the evolution of geometric attributes (centroid, dimensions, orientation and shape) of the CSZ over time. The methodology consists of first using kernel discriminant analysis to determine a noisy outline representing the CSZ for each year, then fitting generalized superelliptic (GSE) models for closed curves to that outline, and finally applying Mann-Kendall tests for monotonic trend to the estimated parameters of the GSE models. I apply these methods to PITCHf/x and TrackMan data comprising about five million called pitches from the 2008–2023 Major League Baseball seasons to study how the CSZ evolved over this time period. I find that most, but not all, geometric attributes of the CSZ became significantly more like those of the rule-book strike zone from 2008–2023.
Abstract: With domestic league Twenty20(T20) cricket growing at a seemingly exponential rate alongside a commensurate revenue growth, being able to determine whether the various players have provided appropriate value to their respective teams is of vital importance. Furthermore, being able to make accurate predictions of a players salary given their performance metric is equally of value. This study aims to tackle both of these questions using a combination of datasets on the largest domestic cricket league, the Indian Premier League(IPL). We are using a dataset of each of the IPL games for all available seasons which includes granular data at the ball-by-ball level. This is then matched to a complete list of player salaries by season. The data that our models used was ultimately derived from the ball-by-ball analysis such that each observation is a player, the outcome being the Salary and the predictors being the various derived statistics, which among other things includes, the total number of runs, total number of wickets, whether the player was international or domestic, etc. Our primary analysis were two related, but different regression and classification analyses, based on the aforementioned setup. Firstly, we aimed to predict the salaries of a given player in the subsequent season based on their performance on a given seaso n including a classification model to predict whether or not a given player would be ‘retained’ for the upcoming season. A number of models were used and compared, linear regressions, random forests (including with boosting) were used during the regression component of the analysis and logistic models alongside random forest classifiers for the classification component of the analysis. We found that in general, random forests were better able to make predictions on player salaries, which we theorized was due to their ability to better handle multicollinearity in the features. Additionally, classification proved a difficult challenge regardless of model, while models were able to accurately determine if a player would not be retained, they were much less successful in determining when players were retained. We hypothesize this was due to unmeasured variables that contributed to this, unrelated to player performance.
Abstract: What is playmaking? Playmaking can be defined as the ability to make plays that facilitate scoring for one’s team. The process of playmaking can be broken down into three steps. Firstly, great playmakers are aware of where 10 players are located and what they are doing all the time (Court Vision). Secondly, based on their understanding of the game, they make decisions about which play to run or simply where to pass (Decision Making). In other words, they find the best possible play to increase the chance of scoring in a given situation. Lastly, they execute the play with precision (Execution). For instance, if they pass the ball, they should complete the pass without it being intercepted.
How can we measure a player’s ability of playmaking? One approach to measure this ability is through proxy measures. For example, we can look at how often a player passes the ball to his teammate who has the highest chance of scoring or who has an open look. If a player passes to an open teammate whenever his teammate is open, it indicates that he has excellent court vision and decision-making skills; he can thus be considered a good playmaker.
Abstract: The punter may be the most overlooked position in football, but they can have a large impact on game outcomes. This project aims to better quantify the value of each punter using data from the 2019 Big Data Bowl and nflweather.com. Both traditional data, such as yard line, snap quality, and weather, among others, as well as tracking data were used. With this data, we created a multiple regression model for predicting punt distance and a neural network to predict expected return lengths to compute an expected net punt yardage. This allows us to better understand how many additional yards of field position each punter gains compared to the others. Enabling us to better rank and compare punters, and to determine both quality and value more accurately.
Abstract: Purpose: Breakdance, a form of dance interconnected with Hip-Hop, is currently rising as a sport, and has worldwide participation. Breakdance will be an event for the first time at the 2024 Olympics. Due to its artistic nature, there are many challenges associated with scoring breakdance competitions and battles. The development of analytic methods should measure and quantify a dancer’s movement and performance in breakdance. Computer vision and machine learning can be used to determine how and where a dancer is moving across a dance floor.
Methods: My work uses pose tracking and a random forest classifier to evaluate three North Carolina breakdancers. Video of these three dancers are from competitions hosted by the Raleigh Rockers and Carolinas Dance Committee. A centertrack-based deep learning pose tracking model is used to track each dancer’s pose across each frame. Within every frame of a video, each breakdancer’s moves are manually categorized as footwork, toprock, freezes, or power moves. These four types of movement are fundamental to breakdance. The resulting analysis of each breakdancer’s movement becomes a matrix where one’s pose and type of movement are represented in each frame. Each dancer’s matrix is then used for classification in a random forest.
Results: It was found that all three breakdancers use the fundamental moves within breakdance at different rates and different frequencies. Furthermore, the majority of dances would start with toprock but would have much more varied movements over time. Within different rounds or one minute intervals by the same dancer, the same sequence of moves were oftentimes used. A random forest was able to differentiate between unique dancers and classify the four fundamental movements of breakdance.
Conclusions: These metrics can identify a particular style and showcase areas where a dancer needs improvement and may lead to more consistent and reliable scoring in competitions. Pose prediction and machine learning can have applications to other avenues within sports analytics.
Abstract: Rule changes in Major League Baseball (MLB) are uncommon compared to other sports like basketball and football, especially ones which completely change a certain aspect of the game. At the start of the 2023 season, MLB instituted a set of three significant rule changes: the introduction of a pitch clock with a corresponding limit on the number of pickoff attempts per plate appearance, an increase in the size of the bases, and a ban on certain defensive alignments (banning the shift). This project aims to determine how much of an impact these three rule changes actually had on the various aspects of gameplay and associated outcomes. Other than reducing the length of games, did the pitch clock change league wide trends in statistics like ERA, or did it only help or hurt certain pitchers? How much did base stealing change throughout the year, and did pitchers eventually adjust to better control the running game? Initial results suggest that although the pitch clock may not have had a significant impact across the league, certain pitchers who took longer between pitches during the 2022 MLB season fared worse once the pitch clock was introduced. Furthermore, findings suggest that the 2023 season marked a significant change in league-wide trends across various baserunning outcomes. Though a three inch change in the size of the bases may seem insignificant, in tandem with pitch clock and pickoff rules, the new rule changes shifted how MLB teams approach stealing bases and their success in doing so.
Abstract: Devon Allen was disqualified in the men's 110 meter hurdle final of the 2022 World Track and Field Championships after registering a reaction time of 0.099 seconds, 0.001 seconds faster than what is allowed. Following the games, bloggers on the running website LetsRun concluded that the reaction times
from the 2022 World Championships seemed to be generally faster compared to the other reaction times they considered, but they did not perform any formal statistical analysis. This paper questions if athletes who competed at the 2022 World Track and Field Championship had similar times at other competitions and if 0.1 seconds is a reasonable disqualification barrier. To do so, we employ a signed-rank test for clustered data to compare reaction times for the same athletes at different competitions and a generalized linear mixed model with random venue (year) effect and a random heat effect within venue in order to model the reaction time data. This matter needs to be addressed because disqualification based on allowable reaction time will continue to be an issue in future world championships.
Abstract: Sports analytics have revolutionized the global sports landscape. Recognizing the limitations of statistics in capturing the complexities of team sports, this research project focuses on developing non-box-score player rating systems for Ultimate Frisbee. Utilizing data from the UFA (formerly the AUDL), the project employs models inspired by on/off ratings in basketball and mixed effects linear models in baseball. The results reveal intriguing patterns, challenging preconceived notions of top players, prompting questions about system oversights, and illuminating player insights. This research offers a fresh perspective on player evaluation in Ultimate Frisbee and sparks broader considerations for applications in other sports.
Abstract: This study introduces a pose estimation model, leveraging artificial intelligence and computer vision, to analyze biomechanical changes in track and field athletes due to fatigue. The primary aim was to investigate how fatigue influences athletes' movement patterns, potentially leading to injuries and performance degradation. The method involved capturing videos from a lateral (side) view and an anterior (front) view as athletes ran five laps (1.25 miles) at a mile race pace (around 5:40 per mile for this athlete) on a standard Olympic track. The hypothesis was that athletes would exhibit signs of fatigue by the final lap, impacting their biomechanical efficiency. Using the OpenPose model, the research extracted x and y coordinates from athletes' movements during different laps, facilitating a detailed kinematic analysis. This included examining joint correlations, ankle positions, cyclical asymmetry, posture analysis, and stride lengths across laps. The results showed notable shifts in joint correlations, particularly between the ankles and shoulders, and the knees and shoulders, indicating compensatory movements due to fatigue. Ankle positions demonstrated increasing consistency, suggesting an adaptation in running technique to counteract fatigue. Additionally, variations in the cyclical asymmetry index and forward lean angles further pointed towards fatigue-induced changes in knee movement and posture. These findings were complemented by a decrease in average step length across laps, highlighting a shift to more sustainable running patterns as fatigue set in. Temporal consistency scores improved over time, implying that athletes maintained stable movement patterns despite fatigue. This suggests a high level of adaptability in the athletes’ biomechanical strategies to cope with the increasing physical demands of the task. This research highlights the potential of AI-driven pose estimation in sports science, offering valuable insights for enhancing training plans, preventing overtraining, optimizing peaking for later competitions, and reducing injury risks in athletics. It underscores the importance of continuous monitoring of athletes' biomechanics, not just in competition but also during training. By integrating AI and computer vision into sports science, coaches and athletes gain access to a powerful tool for real-time biomechanical analysis, enabling more informed decisions to enhance performance and prevent injuries. This study paves the way for more sophisticated and personalized training programs, tailored to the individual biomechanical profiles and fatigue responses of athletes.
Abstract: This presentation delves into the analysis of first-lap incidents in Formula 1 racing during the new 'ground effect' era (2022 onwards), aiming to understand the factors influencing their occurrence and severity. My goal was to quantitatively examine how variables such as starting grid position, corner, race format (sprint or full-length), rain conditions, and whether the track is a street circuit or purpose-built impact the frequency and severity of first-lap incidents. Using extracted data for 120 incidents from race reports and Fédération Internationale de l'Automobile (FIA) debriefs enabled me to parse this; the severity of an incident was operationalized through a proprietary 0-4 scale metric called "Degree of Impact" (DoI). Through preliminary data analysis, regression modeling, and multivariable hypothesis testing, this presentation therefore endeavors to find out which factors influence crash frequency and DoI the most. The findings of this presentation will hopefully provide valuable insights for F1 teams to strategize and mitigate risks during crucial first-lap moments, contributing to safer and more competitive racing.
Abstract: This study investigates whether USA Fencing referees exhibit favoritism towards competitors from the same designated regions, as they frequently officiate bouts involving fencers in their respective areas. This regular pairing establishes familiarity, which may lead to unconsciously biased officiating as suggested by prior research. This is particularly relevant in foil and saber, where simultaneous hits require subjective judgement calls on "right-of-way" to determine who scores the point. Furthermore, penalty decisions also hold potential for bias in epee. Unlike other sports with subjective scoring, fencing relies solely on a single referee, granting them substantial control over scoring. Analyzing a seven-year dataset (2012-2019) of 33,555 Division 1 pool bouts and using multiple linear regression to control for fencer skill and region, this study found no significant evidence of referees favoring fencers from their own region. The p-values for foil, epee, and saber were 0.437, 0.831, and 0.404 respectively. The lack of compelling evidence for regional favoritism in referees' decisions, despite the inherent subjectivity in officiating fencing bouts, might encourage a re-assessment of similar assumptions in other sports. These findings challenge the common assumption of regional bias in sports officiating and highlight the need for further research into the factors influencing referee decision-making.
Abstract: This research project aims to apply a new machine learning technique in the context of predicting MLB batting averages. Probabilistic Bayesian neural networks are a new machine learning technique that attempts to account for both epistemic and aleatoric uncertainty by using Bayesian methods. Epistemic uncertainty is addressed with a Bayesian neural network by choosing weights for a neural network from a distribution, allowing there to be different outputs for the same input. Aleatoric uncertainty is accounted for with a probabilistic Bayesian neural network - which outputs a distribution of values. Confidence intervals can be constructed from this distribution to provide a range of values for the response variable. In the context of this application, we can obtain a confidence interval for a player’s batting average in the following season. These three methods (standard neural network, Bayesian neural network, and probabilistic Bayesian neural network) will all be compared to evaluate their effectiveness and usefulness when predicting a player’s batting average for the following year.
Abstract: In this research endeavor, the capabilities of neural networks are utilized to discern regional patterns within a diverse populace via marathon finishing times. Leveraging two decades of race data from both the TCS New York Marathon along with the prestigious Boston Marathon, the study aims to develop a predictive model capable of identifying individuals' geographic origins based on their previous athletic performances. This exploration holds promise in shedding light on the intersection of environmental variables and ultimate race achievements in the realm of marathon running.
Abstract: This project aims to answer the question of team selection for USA Gymnastics for the Summer Olympic Games. By simulating athletes' scores based on their historic performance and incorporating a system of weights to allow for effective and robust comparisons between different medal objectives, we introduce an interactive tool which provides a framework to select teams, evaluate performances and make data-driven decisions in selecting Team USA.
Abstract: In preparation for the Summer Olympics, USA Gymnastics Selection Committees are tasked with assembling men’s and women’s teams of five gymnasts each. Considering that Olympic competition encompasses both team and individual events across a range of apparatuses, determining the optimal team presents a complex challenge. This project, as part of the UCSAS 2024 USOPC Data Challenge, seeks to determine the “best” possible men’s and women’s gymnastics teams for the United States at the 2024 Summer Olympics. In our analysis, we use score data from gymnastics competitions held in the 2022 and 2023 seasons to fit a mixed-effects model for each apparatus (four total for women, six total for men) that predicts competition score using gymnast, competition round, and competition location as predictors. We use the results of these models to simulate Olympic competition results for a selection of potential U.S. gymnastics teams. To do this, we first build normal distributions for each athlete and apparatus using the predicted scores from our mixed effects model. We then sample from these normal distributions in simulations of full Olympic competitions, which closely follow the Olympic rules and multi-stage competition format. Analyzing our simulation results, we find that a few select gymnasts are present on all the highest-performing teams: Brody Malone, Paul Juda, Yul Moldauer, and Khoi Young on the men’s team, and Simone Biles, Shilese Jones, and Jade Carey on the women’s team. However, the exact composition of the ideal team depends on what the U.S. Olympic Committee’s medal-winning priorities are — for example, whether the committee finds maximizing the chances of a gold medal on a specific event or winning the most medals of any color across all events to be more important. We present simulation results for hundreds of top teams in an R Shiny web application, which for each team displays an average overall weighted medal count and a detailed view probabilities of individual athletes winning specific medals. The app features adjustable medal weights and options to filter teams by specific gymnasts to allow selection committees to interactively judge teams according to different priorities.
Abstract: In the middle of a marathon, a runner’s expected finish time is commonly estimated by extrapolating the average pace covered so far, assuming it to be constant for the rest of the race. These predictions have two key issues: the estimates do not consider the in-race context that can determine if a runner is likely to finish faster or slower than expected, and the prediction is a single point estimate with no information about uncertainty. We implement two approaches to address these issues: Bayesian linear regression and quantile regression. Both methods incorporate information from all splits in the race and allow us to quantify uncertainty around the predicted finish times. We utilized 15 years of Boston Marathon data (312,805 runners total) to evaluate and compare both approaches. Finally, we developed an app for runners to visualize their estimated finish distribution in real time.
Abstract: Major League Baseball (MLB) is heavily populated with player contracts exceeding $100,000,000. The purpose of this project is to analyze the player performance within the 50 most valuable contracts signed in MLB history. Overall player performance will be compared with the year prior to signing each contract, to the first year the player played under that new contract. We hypothesize that due to strong monetary incentive in the final year before free agency, player performance in the year prior to the new contract will be stronger than the first-year performance under the new contract. In this project, average WAR in Contract Years and First Years will be estimated, and we will test if there is a significant difference in these values while adjusting for other confounding variables. We will also analyze secondary measures of interest.
Abstract: We seek to provide the United States Olympic and Paralympic Committee with a clear analysis framework for selecting the best men’s and women’s Artistic Gymnastics teams at the Paris 2024 Olympics. We consider all gymnasts that participated in at least three competitions between the Tokyo Olympics and the 2023 World Artistic Gymnastics Championships, inclusive, with two apparatuses in those competitions and at least four scores in each apparatus to be contenders for Paris 2024. There are 40 male gymnasts and 29 female gymnasts that meet these criteria, and we create time-series forecasts for the scores in every apparatus each of these 69 gymnasts competed in since the Tokyo Olympics. From prior score observations in the time-series analysis, we compute 95% prediction intervals for the apparatus scores of all contenders. Using these score intervals, we program a Monte Carlo machine that calculates the expected medal count of teams of gymnasts, comparing present-day gymnasts’ forecasted scores with the 2012 and 2016 results. It is important to note that many mathematically possible teams were filtered out from thresholds given by regression trees. According to Olympedia data from every Olympic year between 1956 and 2016, excluding the strike year 1980, men's teams with mean height <166.5cm and women's teams with mean height <155.5 cm, mean weight <48.75, and a mean age between 18 and 19 years have ranked much higher than teams that did not meet those thresholds. As such, we only include teams of gymnasts whose averages were inside these ranges in the Monte Carlo machine. The machine, which runs 8 full simulations of estimated team performances at the Paris 2024 Olympics, outputs Simone Biles, Jade Carey, Shilese Jones, Evelynne Lowe, and Zoe Miller as the highest-ranking team with an expected Olympic score of 9.65 and a standard deviation of 2.3. It considers the team of Alex Diab, Caleb Melton, Noah Newfeld, Blake Sun, and Khoi Young as the highest-ranking men’s team, with an expected Olympic score of 5.75 and a standard deviation of 0.87.
Abstract: Selecting the optimal gymnastics team for the Olympics is a complex task that requires careful consideration of individual athlete performance across various apparatuses. Our study presents a data-driven approach to select the most promising US men's and women's artistic gymnastics teams for the 2024 Paris Olympics, aiming to maximize their overall medal performance. We integrate multiple data-driven methods, including Kernel Density Estimation (KDE), Gaussian Distribution, K-Nearest Neighbors (KNN), and a genetic algorithm for simulation. Our approach involves constructing probability densities for individual athletes' scores on each apparatus using historical data from 2017-2023 from different countries' athletes and inputting these into a simulation to predict the expected medal performance for various team configurations. The genetic algorithm efficiently explores a vast number of possible team combinations to identify the highest-scoring team, while parallel processing is employed to accelerate significantly the large-scale simulations, making the process computationally feasible. We use the same distribution estimation model for all athletes to ensure fairness, focusing on relative rankings rather than absolute scores. This approach allows for reliable comparisons that reflect actual performance and avoids bias from limited score records. This new approach to selecting gymnastics teams introduces interesting methods for sports analytics.
Abstract: This project builds a model for NHL "rebuilds". We define a rebuild in the NHL as the time it takes for a team to go from being successful to then struggling for a few years and then returning to being successful. The most successful franchises are those that can limit the time in the rebuilding stage and constantly be contenders for the Stanley Cup. The goal of this project is to estimate the average time it takes for a rebuild to occur. The response variable of interest in this analysis is Points_perc
which represents the points percentage of a team for a season. This variable is calculated by how many points the team ended the season with divided by the total possible points. The covariate is NHL season centered at the year that each franchise had the highest Points_perc
. For our model we chose a random effects cyclic model centered at the highest Points Percentage for each franchise. The data for this project comes from Wikipedia, more specifically each NHL teams season by season results. Franchises that have moved have been combined into one franchise, for example the New Jersey Devils, which is comprised of the Kansas City Scouts, the Colorado Rockies, and the New Jersey Devils franchise. For a long time the NHL had a small number of teams so only the years after 1980 are being included. This is the year the NHL reached 21 teams and begins to model the modern NHL.
Abstract: If you watched a game on ESPN in the year 2023 at some point in time the team had the ball on their opponent's side of the field on fourth down and the announcer said something like “ESPN analytics says they should go-for-it at this point”. Other than telling you that it was according to ESPN analytics no other information on why or how they came up with this is discussed. In this paper, we will use XGBoost to determine when a team should go-for-it. The term go-for-it is a football term in which a team on fourth down attempts to gain a first down instead of attempting a field goal or punting the ball. This is seen as a risky choice currently and coaches that follow these analytics are seen as gamblers or unconventional, however as we will present in this paper, we will show why this should be the standard at many times. The remainder of this paper will be laid out as follows: Section 1 presents a literature review of previous work done on this and similar topics, Section 2 is a discussion about the data used for this paper, Section 3 discusses the methodology used to build the model that is used to determine when a team should try for a first down, Section 4 provides the results of our model including some specific examples from our data where a team should have gone for it and full discussion of why, finally the conclusions and limitation can be found in Section 5.
Abstract: The UCSAS 2024 USOPC Data Challenge provided a unique opportunity to address the intricate task of selecting the most suitable artistic gymnasts for Team USA at the upcoming Paris 2024 Olympics. The dynamic nature of gymnastics scoring underscores the complexity of this selection process. Our project aims to revolutionize athlete selection by developing an interactive tool that harnesses historical performance data to provide actionable insights for the US Olympic Committee. The pressing issue in the gymnastics community of optimizing athlete selection for maximum team performance at the Paris 2024 Olympics is tackled by leveraging advanced data analytics. The aim is to streamline the decision-making process, ensuring that only the most deserving athletes represent Team USA on the global stage. Our approach introduces novel methodologies in athlete selection, departing from traditional subjective assessments to data-driven insights. Through comprehensive simulations and statistical analysis, we identify patterns of athlete performance, enabling informed decisions on team composition. The incorporation of a user-friendly R Shiny app empowers stakeholders to explore various team combinations and simulate outcomes, marking a significant departure from conventional selection practices. The actionable results elucidate the performance trajectories of individual athletes and team compositions. The utilization of Gibbs Sampling techniques and rigorous simulation methodologies ensures the robustness of our findings. By validating our results against historical data and real-world scenarios, we instill confidence in the efficacy of our approach, thereby enhancing state-of-the-art athlete selection strategies.
We leave behind a powerful tool for the gymnastics community, offering not only insights into optimal team compositions but also a platform for ongoing analysis and refinement. The R Shiny app serves as a valuable resource for coaches, analysts, and decision-makers, fostering continued engagement and innovation in athlete selection practices. Our project sets a precedent for the integration of data science in sports decision-making, paving the way for future advancements in Olympic athlete selection processes. In conclusion, our project demonstrates the transformative potential of data-driven approaches in shaping the future of Olympic sports. Through meticulous analysis and innovative solutions, we strive to elevate the performance of Team USA and inspire excellence in the global gymnastics community.
Abstract: Our research responds to the growing need for a statistically robust approach to estimate the called strike zones of MLB umpires, and introduces a novel metric towards evaluating umpire accuracy. Previous research on this topic does not provide both the explicit representation of a flexible estimated strike zone and uncertainty quantification for the estimation. To advance beyond existing methodologies, we develop a period-constrained random weight neural network to learn the probability of a called strike on any location inside and around the rectangular MLB strike zone. Our model provides competitive out-of-sample prediction error when compared with previously used models. Bayesian inference is used in model fitting, which permits uncertainty quantification for the contour line. By utilizing polar coordinates, we derive an explicit form of the contour line, which facilitates the measuring of umpire accuracy through comparison with the MLB strike zone. By employing novel metrics to assess umpire accuracy, we can evaluate and compare umpire performance league wide, inform decisions about crucial game assignments and contribute to the ongoing dialogue on fair play in baseball.
Abstract: Despite the recent growth of the fitness tracker market, there is still a lack of trackers that can directly monitor electrical activity of the heart and muscles continuously over long recording periods. These types of direct measurements are necessary for the collection of medically viable information that can detect and prevent life-threatening conditions and injuries. Form factors of health-monitoring wearables must be reimagined in order to facilitate this type of recording. Contemporary wearable makers have explored smart-clothing as a new form factor for direct electrocardiogram (ECG) recording, but the scalable integration of comfortable and convenient electrodes into garments remains a challenge. Toribio Labs is leveraging a novel method of printing conductive elements directly onto fabric to create a comprehensive and continuous monitoring system capable of recording ECG, electromyography (EMG), and electrodermal activity (EDA) for use in athletics, defense, and healthcare.
Abstract: The Olympics brings about a sense of national pride felt by athletes and viewers alike, with gymnastics long being a source of international recognition for the United States, with names like Simone Biles, Gabby Douglas, and Nastia Liukin rising to global fame. It is thus vital that the US finds continued success in artistic gymnastics at the 2024 Olympics. Our goal is to build a tool that will allow us to recommend a team of five women and five men for the Olympic Games in Paris 2024 by providing an expected individual medal count for each team as well as an expected all-around team medal count. We accomplish this through the creation of score distributions for each US gymnast across all events, weighting for “clutchness” and recent performance, then using Monte Carlo methods to find their predicted medal contribution via simulation. Lastly, we built a web application that allows users to customize lineups and view the expected differences in total medals based on the selected gymnasts as well as optimize for the best five person lineups. By simulating entire distributions, users may specify the aspects of the predicted distributions that they wish to focus on.
Abstract: Evaluating player performance is one of the most critical concepts in pro football. One method to do so is Approximate Value (AV), where players are given a seasonal score based on accolades and statistics that accumulate throughout their careers. While AV reflects a player's performance historically, it lacks insight into their current worth. To address this issue, AV has been modified to create the Present AV (PAV) statistic, which shows the current value of over 2000 players through the 2023 NFL season. Along with PAV, there is also a draft value chart that gives the expected PAV for every draft pick. With both of these metrics, it is now possible to evaluate past trades, determine the appropriate draft compensation for a player, and use a statistics-based method to determine the most valuable players in the league. While the original version of this paper contained player value grades through the 2022 season, this poster presentation includes an updated version through the 2023 season and a new ordinary least squares regression model to account for age. Was the Stefon Diggs/Justin Jefferson trade really a win/win? How much value would it take to trade for Patrick Mahomes? This presentation seeks to answer both questions and more.
Abstract: Expected goals (xG) measure the quality of a shot as a scoring opportunity based on information from past shots. Using play-by-play data from the Professional Women's Hockey Players Association (PWHPA) with rebound locations, we model the probability of a rebound given the initial shot’s location in hockey, finding that shots from the inner slot are better in terms of the value of the rebound that may be generated from that shot. We then create a model which gives credit to initial shots for the rebound value generated. Comparing this to typical xG models, the largest difference occurs in the inner slot where more frequent and higher value rebounds occur.
Visualizing the location of passes leading directly to shot attempts shows that passes leading to high danger shots in terms of xG primarily originate from near the goal line, while those leading to low danger shots primarily originate from the point. As a next step, previous pass location may be a meaningful predictor to add to xG modeling.
Abstract: Every season since 1911 Major League Baseball has handed out two Most Valuable Player Awards, and since 1931 the award has been voted on by members of the Baseball Writers Association of America. The definition of "Most Valuable" the BBWAA gives isn't specific, but it is generally agreed that it's given to the best player. There has long been a sentiment among baseball fans that a lot of the earliest winners of this award might have been undeserving of the award when held under the scrutiny of modern statistics and perspectives. My project is about using statistical tools to determine whether the members of the BBWAA are getting better about selecting the most valuable player in both the American and National League every season. I did this by dividing the game into three phases, offense, defense, and base running. By using the selected stats from each facet of the game has a value at the end that is added together to get the player’s overall value score. I will compare who has the highest data score with the actual winner of the award. Pitchers were not included this study, because the way they impact the game is different from position players. When a pitcher wins MVP, the results are compared to the highest placing position player. Results suggest that voters tend to choose a player with a higher value score more often as years pass by.
Abstract: With the 2024 Paris Olympics approaching, the United States (US) is a strong contender in Artistic Gymnastics. Thus, we determine the five-member Men’s and Women’s gymnastics teams to optimize the US' performance at the 2024 Paris Olympics. In accomplishing this, we built three different teams based around different metrics. We first defined the most successful team as the team with the highest individual event medals. For this, we used the Type II error in unpaired t-tests to approximate the probability that an upset will occur in the medal rankings. These probabilities took into account both a competitor’s consistency and their overall performance. We then defined the most successful team as the team with the highest team all-around score, determined through a linear assignment model. Finally, we introduce an additive power model to identify team members through an ”all-rounder”-based strategy. We then compared the proposed teams from our three methods to determine our final suggested lineup.
Abstract: As swimmers look into colleges, they may wonder how much they will improve and which division they will improve the most in. A common assumption is that Division I attracts the fastest swimmers. However, many swimmers may still swim faster or improve more in Division II and III than in Division I. In this project, we collect the 50 freestyle times of 611 California high school swimmers from 2014 to 2022 and their college times from 2015 to 2023. We investigate the longitudinal best times of these swimmers and compare them across genders and divisions to address the potential concerns of a high school swimmer. Our preliminary result shows that the faster swimmers always drop less time than slower swimmers. We will further study the different time-dropping rates in high school and different college divisions.