Log in

Sign up for our weekly newsletter!

When & Where
Date: 
Tue, April 5, 2016 - 1:00 PM to 3:00 PM
Location: 
D-Lab: Convening Room (356 Barrows Hall)
Description
Type: 

This workshop addresses clustering and topic modeling in Python, primarily through the use of scikit-learn and gensim. We first read in a corpus, prepare the data, create a tfidf matrix, and cluster using k-means. We will then compare results to LSI and LDA topic modeling approaches.

Prerequisites: Attendees should either already have a thorough knowledge of Python, or have attended the Python for Everything series.

Please install the following packages ahead of the workshop:

Python 3 (https://www.continuum.io/downloads)

Packages:

  • NLTK ( $ pip install nltk) 
  • scikit-learn ( $ pip install scikit-learn) 
  • pandas ( $ pip install pandas) 
  • matplotlib ( $ pip install matplotlib) 
  • gensim ( $ pip install gensim)

Dataset: http://www.cs.cmu.edu/~dbamman/booksummaries.html

Details
Training Host: 
D-lab Facilitator: 
Patty Frontiera
Format Detail: 
Hands-on, follow-along