Applied Machine Learning in R

This course was developed by Jeffrey Girard and Shirley Wang for the Pittsburgh Summer Methodology Series (June 26–29, 2023).


Jeffrey Girard University of Kansas	Shirley Wang Harvard University

Whereas statistical methods traditionally used in the social and behavioral sciences emphasize interpretability and quantification of uncertainty, machine learning methods emphasize complexity and accuracy of predictions. Machine learning methods are thus particularly well-suited for applications where (1) there are nonlinear and complex relationships among a large number of predictor variables and (2) accurately predicting the outcome variable is more important than fully understanding the relationships between variables.

This workshop will provide a hands-on introduction to the application of machine learning techniques in R using the tidymodels packages. It will emphasize practical knowledge and conceptual intuitions (e.g., teaching you how to drive a car) rather than technical and theoretical mastery (e.g., teaching you how to build a car). In addition, rather than briefly surveying the full breadth of available machine learning techniques, this workshop will provide a deep dive into three supervised learning methods with broad applicability in the social and behavioral sciences: regularized regression models (e.g., GLMNET), random forest ensembles, and support vector machines (SVM).

Taken together, this workshop’s practical focus will allow attendees to learn about: formulating a good research question, preparing data for analysis, setting up a rigorous cross-validation procedure, evaluating predictive performance, and interpreting/reporting results for a scientific audience.

Although attendees of all backgrounds are welcome and the skills taught will be broadly applicable, example datasets and advice will be tailored specifically to the social and behavioral sciences (e.g., psychology, medicine, education, and related fields). Workshop attendees are not expected to have any background knowledge of machine learning, but some proficiency with R (e.g., knowledge of how to import data and manipulate data frames) will be assumed and some familiarity with statistical modeling (e.g., multiple regression and generalized linear models) will be helpful. If an attendee is new to R as well, we recommend they also enroll in the Introduction to R for Social Scientists workshop, which takes place several weeks before this one.

These materials are made freely available and may be re-used according to the CC-BY License.