Getting Started with Machine Learning at the D-Lab

January 8, 2017

Evan Muzzall, Chris Hench, Chris Kennedy

 

Machine learning is a high-value concept for the social sciences and humanities. Classification and regression models are becoming more common in these disciplines and student interest is rapidly growing. However, machine learning has a steep (if not frightening) learning curve that dissuades many from pursuing these interests.

 

Fall 2016 marked the latest instance of the UC Berkeley D-Lab Machine Learning Working Group (there have been a couple of predecessors - "http://dlab.berkeley.edu/working-groups/berkeley-machine-learning-group" from the summer of 2014 and "http://dlab.berkeley.edu/working-groups/neural-networks-machine-learning" in the Fall of 2014.). We met on alternating Fridays in an informal lunch hour setting with the goal of introducing participants from all backgrounds to common machine learning algorithms (decision tree, random forest, gradient boosted machine, and elastic net regularization models) in R and Python with the aim of teaching responsible application.

 

Participant rationales varied widely as students came from a variety of departments across campus. Our principal challenge as instructors was to determine how to condense complex topics into hour-long sessions so that participants could grasp the basic concepts for their individual applications. Because coding walkthroughs took precedent for the purpose of demonstrating how these algorithms functioned in multiple programming languages, theoretical backgrounds were only topically discussed and their caveats did not receive proper attention due to time constraints.

 

The major challenge for Spring 2017 is to integrate adequate theoretical backgrounds to accompany the corresponding coding walkthroughs. Feedback indicates that we should increase session lengths from sixty to ninety minutes so that they can be divided into thirds: the first thirty minutes for theoretical overviews, the middle thirty for R coding walkthroughs, and the final thirty for Python walkthroughs. More time will allow for higher quality discussions about critical topics such as the utility of ensemble methods, identifying inappropriate models, and overfitting.

 

Despite the depth and complexity of machine learning, we are confident that it is possible to reach a broad audience with diverse interests even within a relatively short period of time. However, we always encourage participants to explore relevant math, statistics, computer science, and data science resources on campus, such as the UC Berkeley Department of Statistics, Interdepartmental Group in Biostatistics, and EECS.

 

We are happy with our short term goals for providing to the UC Berkeley community a space, support, and resources for introduction to machine learning applications. Integrating student feedback will aid the pursuit of our long term goals in helping develop the machine learning community through collaborations with other D-Lab working groups and campus departments. Please check out the R and Python D-Lab introductory series if you want to get started.

 

We are looking for student presentations in Spring 2017. If you are interested please contact Evan Muzzall at evan.muzzall@berkeley.edu. All skill levels and backgrounds are welcome.

Author: 

Evan Muzzall

Evan earned his PhD in Biological Anthropology from Southern Illinois University Carbondale where he focused on spatial patterns of skeletal and dental variation in two large necropoles of Iron Age Central Italy (1st millennium BCE).