Clustering and Topic Modeling in Python

When & Where

Date:

Tue, April 5, 2016 - 1:00 PM to 3:00 PM

Location:

D-Lab: Convening Room (356 Barrows Hall)

Description

Type:

This workshop addresses clustering and topic modeling in Python, primarily through the use of scikit-learn and gensim. We first read in a corpus, prepare the data, create a tfidf matrix, and cluster using k-means. We will then compare results to LSI and LDA topic modeling approaches.

Prerequisites: Attendees should either already have a thorough knowledge of Python, or have attended the Python for Everything series.

Please install the following packages ahead of the workshop:

Python 3 (https://www.continuum.io/downloads)

Packages:

NLTK ( $ pip install nltk)
scikit-learn ( $ pip install scikit-learn)
pandas ( $ pip install pandas)
matplotlib ( $ pip install matplotlib)
gensim ( $ pip install gensim)

Dataset: http://www.cs.cmu.edu/~dbamman/booksummaries.html

Keyword:

Python, Simulating and Modeling

Details

Training Host:

D-Lab

D-lab Facilitator:

Patty Frontiera

Format Detail:

Hands-on, follow-along

Intelligent research design for data intensive social science

Clustering and Topic Modeling in Python

Services

Connect with us