The Importance of Design Plans for Data Science

Since becoming a Data Fellow at the D-Lab, I have had the opportunity to assist many talented social scientists through the D-Lab’s Consulting service. A regular consulting request is to help with the research design for a new project. These requests are understandable. For empirical researchers, a high-quality research design makes or breaks a research project. In this post, I suggest a few benefits of writing a skeleton design plan before writing any code whatsoever.

One of the exciting aspects of data science is that there are so many options for estimating a result. Quick searches of popular methods books reveal a plethora of different methodological choices. Because there are so many choices, it is easy to become overwhelmed and thoroughly confused about which one is best. Googling and Stack Overflow are likely to increase cognitive overload and lead to more confusion. While having options is exciting, often “less is more” applies the strongest to our research. Researchers and data scientists can likely do better by focusing on what they want to answer before turning to the how question.

A relatively low-cost way to begin a research project is with a design plan. For me, a design plan is a concise statement of the research question and the proposed design to answer the question. An example question template that I use to structure my own work is available at this Github repository. Blair et al. provide a more formalized structure for research design plans. I like to write in an RMarkdown file because I can easily incorporate code at later stages if applicable, and the files are easy to save and track. Some members of the D-Lab write entire manuscripts in RMarkdown, as explained inthis blog post.

The plan is initially no longer than a page of text and does not have to be a complete formalization of every possible specification I might run later. Instead, the primary objective is to convince the writer that they have a question that is answerable and worthwhile in the first place. Often, a well-posed question invites a simple answer and estimation strategy that we can implement with simple comparisons of summary data. Because the plan is written down on (virtual) paper, it is easier to evaluate what kind of data needs to be collected and any inferential challenges along the way. This helps us “fail faster” by discarding projects that are unhelpful to the question of interest early in the process and course correct. It also helps provide a principled reason for why we may want to look at a different means as an estimator instead of training a large-scale convolutional neural network. By focusing from the beginning on the question of interest, we are free to find the best (which often means easiest!) way to answer the question. If we start from the tool instead of the question, we risk the equivalent of using a hammer to install electrical wires.

Once written, a design plan can serve as the basis of discussion for an idea. We can easily share a design plan with others to get feedback that helps to illuminate further questions as the research progresses. With a clear research goal in mind, our question regarding implementation can become sharper as well. Starting a project with a design plan also makes it easier to incorporate open science principles regarding transparency later. For example, in my research discipline of political science,some notable journals are even incorporating the requirement to have a pre-analysis plan (PAP) to submit experimental work. While opinions vary on the level of detail required for a PAP, all of them include some description of the question at hand, key variables and conditions, and expected analyses. By starting every project, even if it’s just a toy project, with a basic design plan, these questions can not only be answered early in a project but be refined by the time data is ready to be collected. Fully fleshed-out design plans can even become the basis of our final research reports and papers.

To summarize, before worrying about the code to write, or the data to collect, it is often most helpful to worry about the question we want to answer. At the beginning of a project, a design plan provides a low-cost way to organize our thoughts and yield multiple benefits for a project down the line. If you have never tried them out, I encourage you to experiment with one on your next project.

Author:

Alex Stephenson

I am a PhD Student in the Travers Department of Political Science and a D-Lab Data Science Fellow. My primary research interests are military organizations, policing, the determinants of political violence, and causal inference. I am also interested in creating tools to make software easier to use for non-technical political scientists.

Read more about Alex Stephenson

Intelligent research design for data intensive social science

The Importance of Design Plans for Data Science

Blog

Latest Posts

Machine Learning in Poverty Measurement

Handling Missing Data

The Importance of Design Plans for Data Science

Machine Learning in Atmospheric Science

The Importance of Design Plans for Data Science

Alex Stephenson

Connect with us

Intelligent research design for data intensive social science

The Importance of Design Plans for Data Science

Blog

Search

Latest Posts

Machine Learning in Poverty Measurement

Handling Missing Data

The Importance of Design Plans for Data Science

Machine Learning in Atmospheric Science

The Importance of Design Plans for Data Science

Alex Stephenson

Connect with us