I use data science to study political learning, organization, and mobilization among marginalized populations. I have always loved programming and want to serve people lacking voice and representation in a society. I am blessed to have found and chosen computational social science—a field situated between social science and data science—as my main research area. 

 I also love teaching people how to code, especially social scientists, and I take that mission seriously. I have taught computational tools and techniques at both graduate and undergraduate levels in semester-long courses and short workshops. I have served as a Data Science Education Program Fellow at UC Berkeley and have advised more than 40 applied data science projects, working with community partners and undergraduate students. I also co-organized the Summer Institute in Computational Social Science in the Bay Area, which has a thematic focus on using computational social science for social good. This semester, I am co-teaching a short graduate-level workshop on digital data collection in the political science department at UC Berkeley. I also teach several original workshops on topics ranging from SQL to functional programming as well as R fundamentals and machine learning in R at D-Lab.

 If social scientists want to know how to work smart and not just hard, they need to take full advantage of the power of modern programming languages, and that power is automation. Social scientists are trained to work hard. My colleagues are some of the most brave and curious people I have ever met in my life. They work on almost every social problem you can imagine and are willing to make additional effort to conduct the best research they can. 

 Nevertheless, I have observed that the way these brilliant people do their work is not very efficient. For instance, suppose a social scientist needs to collect data on civic organizations in the United States from websites, Internal Revenue Service reports, and social media posts. As the number of these organizations is large, the researcher could not collect a large volume of data from diverse sources, so they would hire undergraduates and distribute tasks among them. This is a typical data collection plan in social science research, and it is labor-intensive. Automation is not part of the game plan. Yet, it is critical for so many reasons. Because the process is costly, no one is likely to either replicate or update the data collection effort. Put differently, without making the process efficient, it is difficult for it to be reproducible and scalable. 

 An alternative is to write computer programs that collect such data automatically, parse them, and store them in interconnected databases. Additionally, someone may need to maintain and validate the quality of the data infrastructure. Nevertheless, this approach lowers the cost of the data collection process, thereby substantially increasing the reproducibility and scalability of the process. Furthermore, the researcher can document their code and publicly share it using their GitHub repository or even gather some of the functions they used and distribute them as open-source libraries. 

 If we want social scientists to conduct research more like tech start-ups by building data infrastructure first and then developing applications (including but not limited to research articles), we also need to teach and train them to write code like professionals. To achieve this level of proficiency, it is not enough for social scientists to know how to write code that works. They also need to be able to write code that is efficient, reproducible, and reusable. Programming is as valuable a skill as writing in social science research. The extent to which a researcher can automate the research process can determine its efficiency, reproducibility, and scalability.

 

Author: 

Jae Yeon Kim

I am a PhD candidate in Political Science and a D-Lab Senior Data Science Fellow at UC Berkeley and a Visiting Student Fellow at the SNF Agora Institute’s P3 Lab at Johns Hopkins University. I study political learning, organizing, and mobilization among marginalized populations using big data and data science. I also build tools that make social science research more efficient and reproducible.