Training Workshops

The workshops are 2-hour and scheduled in parallel in the afternoon on Saturday, Oct. 5, 2019.

Introductory Level
Intermediate Level
Advanced Level

Introduction to R

Outline

"R is a language and environment for statistical computing and graphics". It is one of the most popular language in Data Science for its flexibility, extensibility, great community support, open-source, and closeness to cutting edge methodological developments. This workshop offers a jumpstart with R assuming no prior exposure. The substantive topics include: 1) an overview of R basics; 2) R data structure; 3) learn and practice R data management; 4) a brief introduction to some useful R packages (e.g. dplyr). The workshop will be interactive and the participants will work with real-life sports dataset.

Instructor

Tuhin Sheikh Tuhin Sheikh is a Ph.D. student and a teaching assistant in the Department of Statistics, University of Connecticut, and a lecturer of Applied Statistics (on leave) at the University of Dhaka, Dhaka, Bangladesh. His current research works focus on survival data analysis with informative dropout process. He has also been a research assistant with the UConn School of Nursing on microbiome data analysis. He is an instructor for an undergraduate level Mathematical Statistics. Mr. Sheikh is a proficient R user. He has a great enthusiasm to develop statistical methodologies for the emerging research problems on machine learning and deep learning. In the past, he led research projects and has published research articles on public health problems. He has been awarded for his academic excellence in teaching and research. A full biography of Mr. Sheikh can be found at https://www.isrt.ac.bd/people/tsheikh/.

Prerequisites

Laptop with R/RStudio preinstalled; previous experience using R is NOT required; basic programming knowledge would be helpful but NOT required.

Training Materials

On GitHub


Introduction to Python

Outline

As a popular superior language, Python has many excellent features that data scientists like: easy to learn, object oriented, cross-platform, open source, and many extensions for machine learning. It is widely used in many data science challenges from the front end to the back end. Good Python programming makes it easier to analyze sports data. After a brief introduction to Python (more details about installation and introduction are attached in “Starting Guidance”), the workshop will cover four steps in a full sports data analysis project: 1) Data acquisition; 2) Modeling (with modules); 3) Optimization of algorithm (with modules); 4) Result validation. Two case studies will be demonstrated, through which participants can quickly pick up frequently-used Python code.

Instructor

Jun Jin Jun Jin is a Ph.D. student in the Department of Statistics, University of Connecticut, working as a research assistant. He has four years of programming experience in R and Python. His research interests focus on machine learning, web crawler, text miming, distributed computing, and PySpark. He had worked at the Digital Experience Center of PricewaterhouseCoopers where he created machine-learning solutions for consulting projects.

Prerequisites

A laptop with Python and certain modules (pandas, matplotlib, numpy, sklearn, requests, tensorflow, keras, redis, json, logging, basketball_reference_web_scraper) preinstalled. Installation instructions are here.


Data Visualization with R

Outline

Data visualization plays an essential role in sports analytics, with a wide range of applications from data exploration to result presentation. An accurate and attractive graph or animation can describe complex data in an easy and understandable way, while an poorly designed one may deliver wrong information and confuse the readers. This workshop will give a clear picture of how to generate an excellent statistical plot in data analysis and introduce a powerful graphics package in R, ggplot2. The contents include 1) a general introduction to data visualization; 2) grammar of graphics in ggplot2; 3) detailed instruction to generate advanced figures with ggplot2; 4) hands-on practices using data sets from basketball games; 5) extensions of ggplot2 and a brief introduction to some other graphics packages for R.

Instructor

Yiming Zhang Yiming Zhang is a Ph.D. student in the Department of Statistics, University of Connecticut. Working as a research assistant in the School of Nursing, he leads data analysis using healthcare and randomized trial data. In the past year, he also worked as a consultant in the Statistical Consulting Service in the Department of Statistics, providing statistical support for projects from a variety of different research fields. With a great passion for data analysis, Yiming always tries to implement and develop statistical methodologies to solve real world problems.

Prerequisites

Interest in statistical graphs and visualization; basic programming experience with R; a laptop with R/RStudio preinstalled.

Training Materials

On GitHub


Baseball Analytics with R

Outline

Who is the fastest pitcher in baseball currently? What is the difference between the most pitcher-friendly and the most hitter-friendly umpires? In which months are home runs more likely to occur? To answer these questions in a more efficient and scientific way, people weave statistical models and machine learning into sabermetrics. The most common purposes of sabermetrics are evaluating past performance and predicting future performance to determine a player's contributions to his team. In this 2-hour workshop, we are going to have: 1) access baseball data; 2) create traditional graphs in R; 3) explore the relation between runs and wins; 4 ) model a player's career trajectory. Example codes will be provided.

Instructor

Zhe Wang Zhe Wang is a PhD candidate in the Department of Statistics, University of Connecticut. She is an experienced instructor of Mathematical Statistics and Statistical Methods. Her current research interests include sequential analysis, change point detection, and sample size determination. She is also a research assistant in the Connecticut Convergence Institute. Working on large population-based public health datasets, she conducts secondary analyses to estimate the prevalence of and associations between risk factors, behaviors, disease states, and other health-related outcomes.

Prerequisites

Passion in Statistics and/or baseball; reading and writing data in R; a laptop with R and RStudio preinstalled.

Training Materials

On GitHub


Web Scraping for Sports Data

Outline

Ever want to get involved in the sports analytics community? “Where can I find data?” is usually the first question. It is an important question and learning to collect data from trusted open online sources is critical. Web scraping is a process that automates the extraction of data in an efficient way. This workshop aims to show beginners how to scrape useful information from both static and dynamic web pages, and how to convert unstructured data from the websites into structured data for further analysis in R. Topics covered include: high level overview of different web scraping techniques, exposure to several related R packages (rvest, RSelenum, deuce), and exercises on creating scraping functions. A full case study on scraping a list of sports web links along with further data analysis will be demonstrated in action.

Instructor

Wanwan Xu Wanwan Xu is a Ph.D. student in Statistics at the University of Connecticut. Her research interests include variable selection in high dimensional data and multimodal data fusion. She is currently working as a research assistant at the UConn Health Center, providing technical support on research based on medical claims data as well as applying statistical methods to improve the identification of patients with suicide risk. She was an instructor and teaching assistant of several undergraduate and graduate level Statistics courses, with experience in coaching with R and tutoring data analysis techniques.

Prerequisites

Previous experience with R; a laptop with R and RStudio preinstalled.

Training Materials

On GitHub