Python Text Analysis Fundamentals: Parts 1-3

Instructors:

Emily Grabowski

I am a PhD student in Linguistics. My research interests include understanding how our speech production and speech perception systems constrain linguistic variation, especially as it applies to the larynx. I am also interested in integrating theoretical representations of language with speech. I approach this using a broad variety of tools/methodologies, including theoretical work, experiments, and modeling. Current projects include developing a computational tool to expedite analysis of pitch and an online perception experiment on the relationship between pitch and perceived duration.

Read more about Emily Grabowski

Renata Barreto

Renata is a JD / PhD candidate at Berkeley, where her research focuses on the harms caused by machine learning models on marginalized groups. She is trained in computational social science and has interned at Twitter and Facebook. She enjoys learning both programming and human languages.

Read more about Renata Barreto

Brooks Jessup

Brooks is a Data Science Fellow at D-Lab and a Research Analyst at the Urban Displacement Project. He received his PhD in the History Department at Berkeley and was trained in Data Science at General Assembly. His work applies computational tools and methods to the study of modern cities and urban issues. At D-Lab he teaches workshops and provides consulting on geospatial analysis, machine learning, and data analytics with Python, R, SQL, and QGIS.

Read more about Brooks Jessup

Daphne Yang

Daphne is a current 5th year graduate student at the School of Information with a keen interested in the intersection between healthcare and data science. She has prior work experience in the realm of public health, consulting, and research. Currently, she is a data science research intern at a DC consumer experience startup. She is particularly interested in how data can be used to power insights and help move society towards a more equitable future.

Read more about Daphne Yang

When & Where

Date:

Tue, February 16, 2021 - 2:00 PM to 5:00 PM

Wed, February 17, 2021 - 2:00 PM to 5:00 PM

Fri, February 19, 2021 - 2:00 PM to 5:00 PM

Location:

Remote (Zoom link below)

Description

Type:

Workshop

*NOTE: Due to limited resources and staff, we are only able to offer workshops to UC Berkeley affiliates, partners (LBL, UCSF), and invited guests. We respectfully ask you not to register if you are not affliliated with UCB, LBL or UCSF.*

Part 1: This hands on workshop goes through the common “preprocessing recipe” that is used as the foundation for a variety of other applications as well as some basic natural language processing techniques. These include: a) removal of stopwords, numbers, punctuation, b) tokenization, c) calculation of word frequencies / proportions, and d) part of speech tagging.

Part 2: This hands on workshop builds on part 1 by introducing the basics of Python's scikit-learn package to implement unsupervised text analysis methods. This workshop will cover a) vectorization and Document Term Matrices, b) weighting (tf-idf), and c) uncovering patterns using topic modeling.

Part 3: In this workshop we will cover the most common CTA task: supervised classification. Using the Python library scikit-learn, we will implement Logistic Regression and Random Forest methods to perform sentiment analysis. Optional: introduction to word vector representations with Word2Vec.

Prior knowledge: We will be using the NLTK Python package, so basic familiarity with Python is required if you wish to follow along with the tutorial. Completion of D-Lab's Python FUN!damentals workshop series will be sufficient.

This workshop is one of a three-part series that will prepare participants to move forward with text analysis research, with a special focus on humanities and social science applications.

Text Analysis Fundamentals: Basic Tools and Techniques (Part 1)
Text Analysis Fundamentals: Unsupervised Approaches  (Part 2)
Text Analysis Fundamentals: Supervised Methods  (Part 3)

Getting started & software prerequisites:

We will learn how to implement text analysis methods with Jupyter Notebooks.

To run the code on your computer, you will need to have Python 3 installed as well as some additional libraries. Anaconda is a free product that makes the installation process easy. It bundles together the Python language and a whole bunch of additional packages that we often rely on in our workshops. This way, you only have to download and install one thing. To use this method, visit this site and follow the instructions for your operating system to download the Python 3.x version (it might be 3.6, or 3.7, or higher). Please, please, please download the 3.x version, not the Python 2.x version. You may have a choice between using the graphical installer or the command line installer. Use whichever you're comfortable with, but the graphical one is easier.

IMPORTANT: Please download the material for day 1 using the link below and save the folder on your desktop. The content may change between workshops so make sure you have downloaded the most recent version before each workshop.

Link:

Keyword:

Software Tools, Python, Text Analysis

Primary Tool:

Python

Details

Training Learner Level:

Basic Competency