As I started graduate school, I received the opportunity to become part of a research group working on the prevention of mother-to-child transmission of HIV. Quickly I discovered that being part of a research group requires a lot of independent research. But that independent research is not quite so independent. Instead my main collaborator when it came to code review and continuation of work, was myself. However, it was not myself at the time the analysis was conducted, it was myself two weeks later trying to build on my previous work.

 

Luckily, as a D-Lab Data Science Fellow I have access to a D-Lab staff mentor, and I brought up this conundrum. He started me on a journey into learning more about version control and Git.

 

 

What is version control and git?

 

To put it simply, git is a tool that allows you to keep track of changes you make to files in a project. There are a couple great things about using git for version control in your research workflows:

 

  • If used correctly, you will never lose your code because all old versions of your files will be saved.
  • Because all old versions are saved, it is easy to undo a change, even if the change was made months ago.
  • Not all code needs to part of your “main” codebase. Instead it can exist adjacent in what is known as a “feature branch”.

 

Git maintains a version of your codebase called the “master branch” which you can think of as the major or production version of your code. Git will track all changes made against the master branch as you work. You can make individual changes part of the master branch through a “commit” or you can create a new branch off of master to work on a complex feature without affecting the code in master. All of this means that is very easy to revert changes if necessary while still maintaining a functional base version. An illustration of this is seen below.

 

 

Source: Software Carpentry

 

Applications to my research

 

RStudio for the R programming language has a built-in integration with Git which can easily be used for many common Git operations. This integration allows individuals to push their changes to GitHub which is a popular Git host server for repositories.

 

After finishing a quick tutorial on this process, I was ready to implement Git into my research workflows. The first thing I did was create a repository (or repo) on GitHub for my research. A repo is essentially a folder, that has version control enabled. I committed the most recent version of my data analysis script to my repo using Git through RStudio.

 

Most of the time when I work on my research – I mainly commit changes to the master. I’m often just building upon or expanding on previous statistical models and thus this workflow works best for me. This workflow now allows me to be better able to see the changes I’ve made over time and have much better documentation on the intention behind these changes.

The rest of the time I am creating branches off my master. This usually is because I want to begin a tangential analysis that is a side idea I want to explore, or a request from someone in the research group. In these circumstances because I am not sure whether the results of the analysis will be fruitful, I create a branch off the master. This allows me to build code independently from my master or “main code”, without fear of the tangential idea making my code unreadable in the future.  

 

D-Lab Resources

 

I’ve learned from this experience that version control should be implemented into all research workflows. If you would like to also learn more about how you could do so, D-Lab has a Git Fundamentals training that is offered. If you would like help setting up and tailoring Git to your own specific workflows and needs, sign up for a consultation with the staff at the D-Lab!




 

 

 
Author: 

Ijeamaka Anyene

Ijeamaka is a MPH student in the Epidemiology and Biostatistics program. Her current research involves using regression models and geospatial techniques to understand why Zimbabwe women travel to multiple health facilities for antenatal care and its effect on the cascade of care for prevention of mother-to-child transmission of HIV.