This is the fifth of five posts on History 100S: Text Analysis for Digital Humanists ans Social Scientists, a Spring 2017 course that I taught which exposed UC Berkeley students to cutting edge computational text analysis techniques.  In this post we introduce a python notebook that describes text encoding and decoding and why these are important for text analysis.

If you've done any sort of computer-assisted text analysis, using Python or R or another programming language, you have likely run into an encoding error. If you haven't seen an encoding error, it's likely that you have seen weird or unexpected characters in your text or output. If you haven't seen either of these problems yet, you will.

Encoding and decoding text can be confusing and frustrating, but it's really important for anyone doing computer-assisted text analysis to understand the fundamentals. Although there is a lot written on encoding and decoding text, I couldn't find anything that I found suitable for teaching these issues to non-computer science majors. Consequently, my wonderful teaching assistant, Leon Liang, and I created a python notebook to introduce the fundamentals. This notebook is a modified version of a lecture given in my course that Leon Liang wrote and I modified it for public use here.

This notebook describes what encoding and decoding do, why they are necessary, and how to trouble-shoot your own text if you run into issues. This notebook is for those doing text analysis, but also for those teaching text analysis. We include exercises throughout notebook for use in the classroom, or for you to test your understanding of these issues yourself!

Please use and modify this notebook to best fit your needs. You can do so by cloning this GitHub repository, which includes the notebook in the scripts folder, images in the images folder, and sample texts in the data folder.

 

Posts in this series:

Text Analysis for Digital Humanists and Social Scientists, Part 1: Introduction

Text Analysis for Digital Humanists and Social Scientists, Part 2:  Looking Through Legacies: the Role of Identity and Profession in Biographies

Text Analysis for Digital Humanists and Social Scientists, Part 3:  The Evolution of Modern Hip Hop

Text Analysis for Digital Humanists and Social Scientists, Part 4: An Exploratory Topical Analysis of Obama's Speeches

Text Analysis for Digital Humanists and Social Scientists, Part 5: Text Encoding and Decoding

Author: 

Laura Nelson

Laura Nelson is an Assistant Professor of Sociology at Northeastern University and author of “Computational Grounded Theory: A Methodological Framework” and a contributor to various blog forums, most recently the orgtheory.net forum on data analytics and inclusivity.