Python Text Analysis Fundamentals: Parts 1-3

Instructors:

Donald Max Ziff

Donald "Max" Ziff is a Master's student in the School of Information and has a Ph.D. in Computer Science from the University of Chicago. He is currently Senior Data Engineer at Tripit (part of SAP Concur), with previous experience at Altiscale, Google and Documentum. He has deep experience in data management, processing and cleaning, as well as Natural Language Processing, Machine Learning and Computers in the Humanities.

Read more about Donald Max Ziff

Aniket Kesari

Aniket is a postdoctoral scholar at the D-Lab. He earned his PhD from Berkeley Law, where he specialized in Law & Economics. He also holds a BA from Rutgers University – New Brunswick in Political Science and History, and is a JD candidate at Yale University. His research focuses on privacy and cybersecurity law, and he is generally interested in using data science to tackle public policy problems. During his graduate career, he was a Google Public Policy Fellow, a Data Science for Social Good (DSSG) Fellow at the University of Chicago, and a Technology Policy Analyst Intern at GitHub.

Read more about Aniket Kesari

Renata Barreto

Renata is a JD / PhD candidate at Berkeley, where her research focuses on the harms caused by machine learning models on marginalized groups. She is trained in computational social science and has interned at Twitter and Facebook. She enjoys learning both programming and human languages.

Read more about Renata Barreto

When & Where

Date:

Mon, April 12, 2021 - 9:00 AM to 12:00 PM

Wed, April 14, 2021 - 9:00 AM to 12:00 PM

Fri, April 16, 2021 - 9:00 AM to 12:00 PM

Location:

Remote (Zoom link below)

Description

Type:

Workshop

Overview

This workshop is one of a three-part series that will prepare participants to move forward with text analysis research, with a special focus on humanities and social science applications.

Part 1: Basic Tools and Techniques
Part 2: Unsupervised Approaches 
Part 3: Supervised Methods

Part 1: This hands-on workshop goes through the common “preprocessing recipe” that is used as the foundation for a variety of other applications as well as some basic natural language processing techniques. These include: a) removal of stopwords, numbers, punctuation, b) tokenization, c) calculation of word frequencies / proportions, and d) part of speech tagging.

Part 2: This hands on workshop builds on part 1 by introducing the basics of Python's scikit-learn package to implement unsupervised text analysis methods. This workshop will cover a) vectorization and Document Term Matrices, b) weighting (tf-idf), and c) uncovering patterns using topic modeling.

Part 3: In this workshop we will cover the most common CTA task: supervised classification. Using the Python library scikit-learn, we will implement Logistic Regression and Random Forest methods to perform sentiment analysis. Optional: introduction to word vector representations with Word2Vec.

Prior knowledge: D-Lab’s Python Fundamentals or equivalent knowledge.

NOTE: D-Lab workshops normally start 10 minutes after the scheduled start time (“Berkeley Time”). We recommend you log on at the start time to join the waiting room where hosts will message you further information.

Link:

Keyword:

Python, Text Analysis

Training Keywords:

Computational Text Analysis, Natural Language Processing

Primary Tool:

Python

Details