The Season for Sharing Data: Working with the newly released Census 2010-2014 ACS 5 year data in R

On December 3, 2015 the U.S. Census Bureau released the 2010-2014 5 year ACS (American Community Survey) data. You can read all about it on the Census website. This fantastic five-year statistical database provides aggregate social and economic characteristics about American individuals and families down to the block group level. A number of online tools provide access to the ACS 2010-2014 data using graphical user interfaces (GUIs). These include the Census American FactFinder tool or via Social Explorer. The latter requires a subscription which UC Berkeley has and which is accessible to all folks with a CalNet login. Programmatic access to the data is possible via the Census API. In this blog post we will use the Census API to explore the ACS 2010-2014 data in the R statistical programming language.

It is not easy to work with Census data because of the size and the breadth of these data products. However, there are a number of packages that make it easier (but not easy!) to fetch, process and visualize Census data in R. These include the acs package by Ezra Glenn and the acs14lite package by Kyle Walker for downloading the ACS tabular data. The tigris package by Kyle Walker and Bob Rudis makes it relatively easy to download TIGER geographic boundary files, needed to map Census data, and link those boundaries to the ACS data. Hadley Wickham's popular and powerful ggplot2 package can be used to make static maps and charts of the data. The ggmap package allows you to create maps of Google Maps data and other online reference maps that can be used as basemaps for our plots. The CartoDB-R package by Virgilio Gómez-Rubio, which extends the work of Andrew Hill and Kyle Walker, makes it super easy to create online interactive maps of these data using the CartoDB.com APIs. The sp, rgdal and rgeos libraries are the workhorses of spatial data operations in R and are libraries upon which other packages may depend.

This tutorial has been tested with R version 3.2.2 (2015-08-14) on MacOSX 10.9.5. You may need to update your R to follow along. Ok, let's get started.

First, open RStudio or similar R development enviroment. Install the packages we will use if they are not already on your system. Some of these need to be installed with devtools which facilitaties installing packages that are not in a CRAN repository. You only do the package install once.

install.packages(c('dplyr','ggplot2','ggmap','sp','rgdal','rgeos','maptools','devtools'))
devtools::install_github('walkerke/acs14lite')
devtools::install_github('walkerke/tigris')
devtools::install_github('becarioprecario/cartodb-r/CartoDB', dep=TRUE)

Now load the libraries. A library is the code part of an R package which may also include documentation, data, and tests, etc.

library(sp) # for working with spatial data objects
library(rgdal) # for importing and exporting spatial data in various formats
library(acs14lite) # used to fetch ACS data
library(tigris) # used to fetch TIGER data (shapefiles)
library(dplyr) # used to reformat the ACS data
library(maptools) # used by ggplot and base maps
library(ggplot2) # used to make maps of the ACS data
library(ggmap) # for adding Google Maps data to our maps
library(CartoDB) # to create interactive maps in CartoDB.com

Set your working directory on your local computer

setwd('~/Documents/census') #mac style
#setwd("c:/docs/mydir") #windows os stye

Set the Census API key for the acs14lite library. If you don't have an API key go to http://api.census.gov/data/key_signup.html to get one.

my_census_api_key <- "your api key"
set_api_key(my_census_api_key)

The next step is to indentify the variables for the ACS data you are interested in. The ACS variable must be available for your geography of interest, e.g., tract or block group. If it is not then null values will be returned for the variable. The Census American FactFinder tool one way to identify these variables.

For this tutorial we will explore B17021: Poverty Status of Individuals in the Past 12 months by Living Arrangement. This data is available for county, tract and block group aggregations in 2010-2014. The four variables we for which we will retrieve data are:

B17021_001E: count of people for whom poverty status has been determined (the sample estimate)
B17021_001M: count of people for whom poverty status has been determined (the margin of error)
B17021_002E: count of those people whose income in the past 12 months is below poverty (estimate)
B17021_002M: count of those people whose income in the past 12 months is below poverty (margin of error)

You can view these data in a web browser by putting the following URL in the address bar. Note, you will need to add your census API key

http://api.census.gov/data/2014/acs5?get=NAME,

B17021_001E,B17021_001M,B17021_002E,B17021_002M

&for=tract:*&in=state:06+county:075&key=YOUR_KEY

Fetching the ACS 2010 - 2014 Data

So now let's use the acs14lite R library to fetch ACS 2010-2014 poverty data for San Francisco census tracts. Available geographies for exploring an ACS variable with the acs14lite package include: 'us', 'region', 'division', 'state', 'county', 'tract', 'block group'. For package details enter ??acs14lite in the R console.

sf_poverty <- acs14(geography = 'tract', state = 'CA', county = 'San Francisco', 
                 variable = c('B17021_001E', 'B17021_001M', 'B17021_002E', 'B17021_002M'))

head(sf_poverty) # view retrieved data

We can use the dplyr mutate and select functions to convert the counts to percents and create a simple data frame with those values. We will also use the acs14lite function moe_prop to calculate the margin of error for each percentage.

sf_poverty14 <- mutate(sf_poverty,
                    geoid = paste0(state, county, tract),
                    pctpov = round(100 * (B17021_002E / B17021_001E), 1),
                    moepov = round(100 * (moe_prop(B17021_002E, B17021_001E, B17021_002M, B17021_001M)),1))
sf_poverty14 <- select(sf_poverty14, geoid, pctpov, moepov)

head(sf_poverty14) # take a look at the retieved and reformatted ACS data

Linking ACS Data to TIGER Census Tracts

To create a map of the data use the tigris package to download TIGER geographic data in the form of ESRI shapefiles. Use ??tigris in R for details about the package and availalable functions and options. By default tigris downloads the 2014 TIGER data. The tigris functions for retrieving data are names after the types of data that they retrieve, e.g., tracts() or block_groups().

sf_tracts <- tracts('CA', 'San Francisco', cb=TRUE)

The cb=TRUE option will retrieve the more generalized TIGER data which will save time and memory. It's a good idea unless you know you need the more detailed data. The sf_tracts data object is of class SpatialPolygonsDataFrame. For details see ?"SpatialPolygonsDataFrame-class".

Use the acs14lite function geo_join to join the ACS data (sf_poverty14) to the tracts spatial data (sf_tracts). Both of these objects have a geoid/GEOID value (the column names vary only by capitalization). Then remove any tracts with no data for the ACS variable. We do this for SF mainly because the Farralon Islands are so far off the coast of SF that they mess up the map of the data.

sf_tracts2 <- geo_join(sf_tracts, sf_poverty14, "GEOID", "geoid")
sf_tracts2 <- sf_tracts2[!is.na(sf_tracts2$pctpov),]
# look at the data
class(sf_tracts2)
str(sf_tracts2)
str(sf_tracts2@data)

Mapping with GGPLOT and GGMAP

We now have a spatial data object that we can map ACS data with! Below is some code to do this with ggplot2 and ggmap.

# First use fortify() to make the spatial data object a data frame that ggplot can map.
ggplotData <- fortify(sf_tracts2, data=sf_tracts2@data, region="geoid")
head(ggplotData) # look at the data frame created with the fortify function

# Join the ACS data to the fortified data frame
ggplotData <- merge(ggplotData, sf_tracts2@data, by.x="id", by.y="geoid")
head(ggplotData) # look at the data

# Plot the data to emphasize the areas of highest poverty
# First use the ggmap get_map function to fetch a Google Map image to use as our basemap
sf_basemap <-get_map('San Francisco', zoom=12) 
ggmap(sf_basemap) +
  geom_polygon(data = ggplotData, aes(x = long, y = lat, group = group, fill = pctpov), alpha=0.75) +
  scale_fill_distiller(palette = "Reds") +
  guides(fill = guide_legend(reverse = TRUE)) +
  ggtitle("Percent of Individuals below Poverty Level\n ACS 2010-2014 Data") +
  theme_nothing(legend=TRUE) +
  coord_map()

The code above creates the following map. To provide context, we used ggmap to create a basemap on which the ACS data is displayed. The ggplot and ggmap options are quite customizeable and powerful. However, the syntax can get a bit complicated, especially if you are unfamiliar with ggplot. You can use the package documentation to gain a better understanding of each function.

Mapping in CartoDB

If you don't want to dive into ggplot you can use an online mapping tool like CartoDB to create an interactive map of your ACS data. You need to first create a CartoDB account. When you login to your CartoDB account you can click on the heart icon to get your account user name and API Key. These are used to export your R spatial data object (here sf_tracts2) from R and import it into CartoDB. This process is shown below.

library(CartoDB)
cdb_username <- 'your_username'
cdb_apikey <- 'your_apikey'
cartodb(cdb_username, cdb_apikey)
r2cartodb(sf_tracts2, 'sf_poverty_by_tract')

In the above call to r2cartodb, the first parameter (sf_tracts2) is the name of the spatial object you are sending to CartoDB and the second parameter ('sf_poverty_by_tract') is the name that the dataset will have in CartoDB. Once you run this command you can login to or refresh your CartoDB dataset dashboard and you will see the dataset. From there, you can go style the map any way you like and share it. Explore the map below to get a sense of the possibilities.

Next Steps

For additional practice, redo this tutorial with census block group data. Because block group is two words you will need to put quotes around them in some functions. You will also need to use the tigris function block_groups instead of tracts.

Another interesting task would be to compare the 2010-2014 ACS with an earlier five year period. The 2005-2009 ACS data would be ideal as these would be non-overlapping years. However, this earlier period is not yet available via the Census API (although it is via FactFinder). Instead you can get the 2006-2010 ACS data. Once you fetch that data you can compute percent change and map that. However, the acs14lite package, used above, does not work with other ACS data periods. You would need to create some custom code or use the acs package.

If you are interested in trying those two additional tasks or if you want the code from this tutorial in an R script, check out the D-Lab github repo for this tutorial. Also, check the D-Lab calendar in the Spring Semester for a workshop on this and other topics. Until then, Happy New Year!

Acknowledgements

This tutorial borrows heavily from similar ones, listed below. We highly encourage you to explore these!

Author:

Patty Frontiera

Dr. Patty Frontiera is the D-Lab Data Services Lead and a geospatial data scientist. She is the the official campus representative for ICPSR, the Roper Center, and the Census State Data Center network, and serves as the Co-Director of the Berkeley Federal Statistical Research Data Center (FSRDC). Patty also develops the geospatial workshop curriculum, teaches workshops and consults on geospatial topics. Patty has been with the D-Lab since 2014 and served as the the Academic Coordinator through Spring 2017. Patty received her Ph.D.

Read more about Patty Frontiera

Intelligent research design for data intensive social science

The Season for Sharing Data: Working with the newly released Census 2010-2014 ACS 5 year data in R

Blog

Latest Posts

Machine Learning in Poverty Measurement

Handling Missing Data

The Importance of Design Plans for Data Science

Machine Learning in Atmospheric Science

The Season for Sharing Data: Working with the newly released Census 2010-2014 ACS 5 year data in R

Patty Frontiera

Connect with us

Intelligent research design for data intensive social science

The Season for Sharing Data: Working with the newly released Census 2010-2014 ACS 5 year data in R

Blog

Search

Latest Posts

Machine Learning in Poverty Measurement

Handling Missing Data

The Importance of Design Plans for Data Science

Machine Learning in Atmospheric Science

The Season for Sharing Data: Working with the newly released Census 2010-2014 ACS 5 year data in R

Patty Frontiera

Connect with us