For this data challenge, your goal is to use new baseball data on bat speed and swing length to analyze some aspect of the pitcher/batter interaction. We provide pitch-level data from Baseball Savant for 346,250 Major League Baseball plate appearances from 4/2/2024 to 6/30/2024, including relevant Statcast data along with bat speed and swing length on pitches with a swing tracked. Data from the second half of the season will be added after the conclusion of the regular season. Your analysis should involve bat speed and swing length to study any topic related to the batter, pitcher, or batter-pitcher interaction during an at bat.
Since these data are new, there are a variety of topics that have not previously been studied. Below are a few example topics. However, we note that this list is far from exhaustive. Participants should feel free to study any aspect of the batter, pitcher, or batter-pitcher interaction that interests them, provided that bat speed and swing length are used in the analysis in some meaningful way.
The data is in this shared folder. We recently added a file that has data for the whole season (full regular season and playoffs), and has the column arm_angle
included in the data along with some additional contextual variables and other information. The new columns were added to the right of the data frame. The new columns are estimated_slg_using_speedangle
through arm_angle
. See the file statcast_pitch_swing_data_20240402_20241030_with_arm_angle.csv
.
This has the same format as the original data file we provided in the file statcast_pitch_swing_data_20240402_20240630.csv
. The only differences are that (1) the new file contains all data from the 2024 season, not just the first half of the season through June 30, 2024, (2) it has the additional columns estimated_slg_using_speedangle
through arm_angle
, and (3) the redundant columns pitcher_1
and fielder_2_1
, which were identical to pitcher
and fielder_2
, were removed.
We left that original file statcast_pitch_swing_data_20240402_20240630
in that folder as well, in case anyone wants it for existing code. However, you only need to use the new statcast_pitch_swing_data_20240402_20241030_with_arm_angle.csv
file, since that has all of the data from the original file, and more.
The csv
file is large and may not open in the online version of Excel in your web browser. You'll have to click Download and create a local copy on your computer. The file contains data up through June 30, 2024. Data from the second half of the season will be updated at the end of the regular season.
The data is also available in Apache Arrow format, which is an efficient, columnar, in-memory format designed to facilitate fast data analytics and interoperability between different systems. Example codes to read the data into Julia, Python, and R are on GitHub here or on Kaggle here.
Please see here for a glossary of terms and here for background on how Statcast bat tracking data works.
Given the challenge's popularity, we would like to know the number of participating teams to adequately prepare and recruit the necessary judges. Please register for the data challenge by December 1, 2024.
Data challenge registration form.
This registration is for the data challenge only. Registration for the conference will be available at a later date.
Submit here: coming soon!
Students must submit a zip file containing:
Note: Students can include other files including any app code or supporting documents. Any code or apps included need to be self-contained and able to run on reviewers’ computers without modification.
The CSAS 2025 Data Challenge is open to students only. You must be enrolled as a high school, undergraduate, or graduate student at some point during the 2024-25 academic year. Participants must register using their school email address.
Teams must enter one of the following two tracks:
To be eligible for the High School / Undergraduate ALL members of the team must consist of either high school and/or undergraduate students. Each team can have up to 3 members. The team captain, if 18 years old or over, should fill out the registration form for the entire team using their school email addresses. If all team members are under 18, a faculty advisor needs to be the point of contact and register for the team.
A panel of judges from across academia and the sports industry will judge your submissions based on the following:
Finalists (Six teams: three high school / undergraduate and three graduate) will be invited to present their work at CSAS 2025 in New Haven, CT, and will receive some travel support and have their registration fees waived. Winning teams will have the opportunity to showcase their team’s work to data scientists in the baseball industry. The winning teams will also receive a cash prize and a plaque.
An introductory workshop led by members of the CSAS organizing committee and members of the baseball industry will be planned in early October. We will give a short introduction and spend most of the time on Q+A.
Another webinar may be arranged depending on demand.