About this tutorial

Spatial Analysis begins with an exploration of the data. This entails mapping the data to see the locations and distribution of the data and to look for abt obvious insights like patterns or trends that can be further explored.

The next step is asking questions of, or querying, your data. While aspatial queries include basic summary statistics of the attribute data, spatial queries explore the spatial aspects of your data. There are two key types of spatial queries:

  • spatial measurement queries, e.g. area, length, distance
  • spatial relationship queries, e.g. what locations in A are also in B.

Learning Goals

  • Review key geographic data concepts
  • Identify the key R packages for working with spatial data
  • Practice an introductory spatial analysis workflow using a consistent data set, SF Parks
  • Provide sample code and references for students to move forward.

Fundamentals

Geographic Data

Geographic data are observations about locations on or near the surface of the Earth. For example:

  • air or ocean temperature
  • buildings, roads or water fountains
  • elevation, soil type, land use
  • city, county, state boundaries
  • locations of crimes, concerts, protests

Question: can you think of ways to categorize the different types of geographic data listed above?

Solution: continous phenomena, discrete objects, events

Representing Location

When geographic locations are represented as place names, like Berkeley CA or Gourmet Getto, we typically treat those locations as categorical values in our data set. For example,

setwd("~/Documents/Dlab/dlab_workshops/r_spatial_analysis/r_spatial_workshop")
# 2013: Percent of adults who report consuming fruit less than one time daily†
# https://nccd.cdc.gov/NPAO_DTM/IndicatorSummary.aspx?category=21&indicator=37&year=2013&yearId=17
fruit_data <- read.csv("fruit_consumption.csv")
head(fruit_data)
##     Location Value     X95..CI Sample.Size
## 1    Alabama  46.0 (44.1-47.9)        5975
## 2     Alaska  39.7 (37.6-41.8)        4252
## 3    Arizona  39.5 (36.7-42.3)        3866
## 4   Arkansas  50.5 (48.5-52.6)        4794
## 5 California  30.4 (29.2-31.7)        9746
## 6   Colorado  35.7 (34.5-36.8)       12168

Categorical location data makes its hard to explore spatial characteristics in the data like “are there any regional trends?” For that you need a map and to make a map you need to have geometric representations of geographic locations.

You can see above we have States delineated by areas - or polygons.

Spatial Data

Geospatial data are geometric representations of geographic locations. Spatial data is a more generic term that can be used for both geographic and non-geographic data.

There are two main types of spatial data: vector & raster data. Discrete real world geographic entities are typically represented with vector objects. Continuous phenomena are more often represented with raster data.

Vector Data

Vector data represent location as points, lines or polygons.

Raster Data

Raster data represent location as a gridded surface where each grid cell has a value for the phenomena being mapped, eg temperature. Grid cells are typically uniform and square and are rendered digitally as pixels Grid cells have a dimension, eg a 30m raster consists of 30x30m cells whose value is representative of a 900 sq meter area.

Note: This tutorial will focus on vector data

Coordinate Reference Systems

Points, lines, polygons & grids are geometric objects. When those geometries are referenced to the surface of the earth they can represent geographic locations. We do this by defining a coordinate reference system (CRS) for those objects.

Geographic coordinates are expressed as longitude and latitude.

  • latitude tells us how far we are north or south of the equator, the zero degree of latitude. Values range from +90 (north pole) to -90 degrees (south pole).
  • longitude tells us how far east or west we are of a specified prime meridean, currently defined by international standard as the Greenwich Meridian, the zero degree of longitude. Values range from +180 to -180 (which are the same location! approximately the international date line).

For a number of reasons working with longitude & latitude values are tricky. One issue is that folks often treat longitude & latitude as respectively analogous to the x and y coordinates in a euclidean plane. However, geographic coordinates are angular measures on a 3D model of the earth.

A geographic coordinate system or GCS specifies longitude and latitude on reference surface defined by the orgin (zero degrees of latitude and longitude) and a datum. A datum is a model of the shape of the earth (shperical or ellipsoidal/spheriodal) and a set of ground control points that fit the coordinates to specific locations. The key takeaway here is that to properly interpret latitude and longitude values you need to know their GCS. Fortunately, two main GCSs are used and

  • WGS84: the world geographic system (& datum) of 1984
  • NAD83: The North American Datum of 1983 GCS

These can be considered identical for locations within the USA when sub-meter accuracy is not required. Note the terms datum and GCS are often used interchangeably but there are slight differences. More detail is beyond the scope of this tutorial.

Getting Started

It is time to dive into spatial analysis in R. We will walk through an example workflow and discuss technical specifics and concepts as they arise in the context of this work.

First up, prepare your R workspace

#clean environment
rm(list=ls())
objects()
## character(0)
setwd('/Users/patty/Documents/Dlab/dlab_workshops/r_spatial_analysis')

Spatial Data in R

Packages

Below is some useful code that will:

  • identify the R packages, or libraries, we will use,
  • check to see if they are installed,
  • install them if they are not already installed, and
  • load them into your R environment so they are available to use.
required.pkg <- c("sp", "rgdal", "rgeos", "leaflet", "ggmap", "maptools", "RColorBrewer")
pkgs.not.installed <- required.pkg[!sapply(required.pkg, function(p) require(p, character.only=T))]
install.packages(pkgs.not.installed, dependencies=TRUE)

# Load all libraries them all at once.
lapply(required.pkg, library, character.only = TRUE)         

About the Packages

The R packages used in this tutorial provide the following functionality:

  • sp: Classes and methods for spatial data
  • rgdal: for importing and transforming different spatial data types and providing access to proj4 library for working with map projections and coordinate systems
  • rgeos: for spatial operations and queries on geometric objects
  • ggmap: extends ggplot2 to provide access to online basemaps (e.g. Google Maps, OpenStreetMaps) and online services like the Google Geocoder
  • leaflet: for creating interactive web maps
  • RColorBrewer: for selecting predefined color palettes.

The SP Package

The SP package is most commonly used to provide support for vector data objects in R. Other packages that do things with spatial data typically build on these SP objects.

library(sp)
?sp

The three main types of spatial objects that you will use in R are summarized below.

Vector Data Type SP Spatial Class SP Spatial Class with Attributes
Points SpatialPoints SpatialPointsDataFrame
Lines SpatialLines SpatialLinesDataFrame
Polygons SpatialPolygons SpatialPolygonsDataFrame


Loading Spatial Data

These data are a subset of the parks data set found on the SF OpenData Portal (data.sfgov.org). - https://data.sfgov.org/Culture-and-Recreation/Recreation-Park-Department-Park-Info-Dataset/z76i-7s65

parks <- read.csv("sf_neighborhood_parks.csv", stringsAsFactors = FALSE)
head(parks)
##                     ParkName                        ParkType Acreage
## 1    29TH/DIAMOND OPEN SPACE Neighborhood Park or Playground    0.82
## 2           ADAM ROGERS PARK Neighborhood Park or Playground    2.74
## 3               ALAMO SQUARE Neighborhood Park or Playground   12.70
## 4 ALICE MARBLE TENNIS COURTS Neighborhood Park or Playground    0.84
## 5                ALLYNE PARK Neighborhood Park or Playground    0.75
## 6                 ALTA PLAZA Neighborhood Park or Playground   11.91
##   ParkID                                                       location
## 1    194       Diamond\nSan Francisco, CA\n(37.74360211, -122.43523589)
## 2     46       Ingalls\nSan Francisco, CA\n(37.73101645, -122.38385466)
## 3    117         Hayes\nSan Francisco, CA\n(37.77634875, -122.43467396)
## 4    151     Greenwich\nSan Francisco, CA\n(37.80142776, -122.42034327)
## 5    131 2609 Gough St\nSan Francisco, CA\n(37.79746066, -122.42759992)
## 6    129       Jackson\nSan Francisco, CA\n(37.79117333, -122.43766978)
##        lat       lon
## 1 37.74360 -122.4352
## 2 37.73102 -122.3839
## 3 37.77635 -122.4347
## 4 37.80143 -122.4203
## 5 37.79746 -122.4276
## 6 37.79117 -122.4377
class(parks)
## [1] "data.frame"
str(parks)
## 'data.frame':    121 obs. of  7 variables:
##  $ ParkName: chr  "29TH/DIAMOND OPEN SPACE" "ADAM ROGERS PARK" "ALAMO SQUARE" "ALICE MARBLE TENNIS COURTS" ...
##  $ ParkType: chr  "Neighborhood Park or Playground" "Neighborhood Park or Playground" "Neighborhood Park or Playground" "Neighborhood Park or Playground" ...
##  $ Acreage : num  0.82 2.74 12.7 0.84 0.75 ...
##  $ ParkID  : int  194 46 117 151 131 129 105 65 97 174 ...
##  $ location: chr  "Diamond\nSan Francisco, CA\n(37.74360211, -122.43523589)" "Ingalls\nSan Francisco, CA\n(37.73101645, -122.38385466)" "Hayes\nSan Francisco, CA\n(37.77634875, -122.43467396)" "Greenwich\nSan Francisco, CA\n(37.80142776, -122.42034327)" ...
##  $ lat     : num  37.7 37.7 37.8 37.8 37.8 ...
##  $ lon     : num  -122 -122 -122 -122 -122 ...

Questions

  • What type of data are parks?
  • Are parks spatial data?

Let’s convert the data.frame to a SpatialPointsDataFrame using the SP function coordinates.

?coordinates

coordinates(parks) <- c("lon","lat") # or ~lon+lat
head(parks)
##             coordinates                   ParkName
## 1  (-122.4352, 37.7436)    29TH/DIAMOND OPEN SPACE
## 2 (-122.3839, 37.73102)           ADAM ROGERS PARK
## 3 (-122.4347, 37.77635)               ALAMO SQUARE
## 4 (-122.4203, 37.80143) ALICE MARBLE TENNIS COURTS
## 5 (-122.4276, 37.79746)                ALLYNE PARK
## 6 (-122.4377, 37.79117)                 ALTA PLAZA
##                          ParkType Acreage ParkID
## 1 Neighborhood Park or Playground    0.82    194
## 2 Neighborhood Park or Playground    2.74     46
## 3 Neighborhood Park or Playground   12.70    117
## 4 Neighborhood Park or Playground    0.84    151
## 5 Neighborhood Park or Playground    0.75    131
## 6 Neighborhood Park or Playground   11.91    129
##                                                         location
## 1       Diamond\nSan Francisco, CA\n(37.74360211, -122.43523589)
## 2       Ingalls\nSan Francisco, CA\n(37.73101645, -122.38385466)
## 3         Hayes\nSan Francisco, CA\n(37.77634875, -122.43467396)
## 4     Greenwich\nSan Francisco, CA\n(37.80142776, -122.42034327)
## 5 2609 Gough St\nSan Francisco, CA\n(37.79746066, -122.42759992)
## 6       Jackson\nSan Francisco, CA\n(37.79117333, -122.43766978)
class(parks)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
str(parks)
## Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
##   ..@ data       :'data.frame':  121 obs. of  5 variables:
##   .. ..$ ParkName: chr [1:121] "29TH/DIAMOND OPEN SPACE" "ADAM ROGERS PARK" "ALAMO SQUARE" "ALICE MARBLE TENNIS COURTS" ...
##   .. ..$ ParkType: chr [1:121] "Neighborhood Park or Playground" "Neighborhood Park or Playground" "Neighborhood Park or Playground" "Neighborhood Park or Playground" ...
##   .. ..$ Acreage : num [1:121] 0.82 2.74 12.7 0.84 0.75 ...
##   .. ..$ ParkID  : int [1:121] 194 46 117 151 131 129 105 65 97 174 ...
##   .. ..$ location: chr [1:121] "Diamond\nSan Francisco, CA\n(37.74360211, -122.43523589)" "Ingalls\nSan Francisco, CA\n(37.73101645, -122.38385466)" "Hayes\nSan Francisco, CA\n(37.77634875, -122.43467396)" "Greenwich\nSan Francisco, CA\n(37.80142776, -122.42034327)" ...
##   ..@ coords.nrs : int [1:2] 7 6
##   ..@ coords     : num [1:121, 1:2] -122 -122 -122 -122 -122 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : NULL
##   .. .. ..$ : chr [1:2] "lon" "lat"
##   ..@ bbox       : num [1:2, 1:2] -122.5 37.7 -122.4 37.8
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:2] "lon" "lat"
##   .. .. ..$ : chr [1:2] "min" "max"
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
##   .. .. ..@ projargs: chr NA

SpatialPointsDataFrame

You can see from str(parks) that the parks SPDF object is a collection of slots or components. The key ones are:

  • @data: the attribute data describing each location
  • @coords: the coordinates for each location
  • @bbox: the min and max lon(x) and lat(y) coordinates tthat together define the minimum bounding box around the locations
  • @proj4string: the coordinate reference system defintion as a string
?SpatialPointsDataFrame

Explore the SpatialPointsDataFrame

parks@bbox
##           min        max
## lon -122.5110 -122.37372
## lat   37.7124   37.80342
head(parks@coords)
##            lon      lat
## [1,] -122.4352 37.74360
## [2,] -122.3839 37.73102
## [3,] -122.4347 37.77635
## [4,] -122.4203 37.80143
## [5,] -122.4276 37.79746
## [6,] -122.4377 37.79117
head(parks@data)
##                     ParkName                        ParkType Acreage
## 1    29TH/DIAMOND OPEN SPACE Neighborhood Park or Playground    0.82
## 2           ADAM ROGERS PARK Neighborhood Park or Playground    2.74
## 3               ALAMO SQUARE Neighborhood Park or Playground   12.70
## 4 ALICE MARBLE TENNIS COURTS Neighborhood Park or Playground    0.84
## 5                ALLYNE PARK Neighborhood Park or Playground    0.75
## 6                 ALTA PLAZA Neighborhood Park or Playground   11.91
##   ParkID                                                       location
## 1    194       Diamond\nSan Francisco, CA\n(37.74360211, -122.43523589)
## 2     46       Ingalls\nSan Francisco, CA\n(37.73101645, -122.38385466)
## 3    117         Hayes\nSan Francisco, CA\n(37.77634875, -122.43467396)
## 4    151     Greenwich\nSan Francisco, CA\n(37.80142776, -122.42034327)
## 5    131 2609 Gough St\nSan Francisco, CA\n(37.79746066, -122.42759992)
## 6    129       Jackson\nSan Francisco, CA\n(37.79117333, -122.43766978)
head(parks$ParkName)
## [1] "29TH/DIAMOND OPEN SPACE"    "ADAM ROGERS PARK"          
## [3] "ALAMO SQUARE"               "ALICE MARBLE TENNIS COURTS"
## [5] "ALLYNE PARK"                "ALTA PLAZA"

Creating Maps

Create a simple map of the parks using the R base plot method.

plot(parks, pch=21, col="black", bg='palegreen') # all parks  
plot(parks[parks$Acreage > 10,], pch=21, col="black", bg="darkgreen", add=T) # big parks

Plot the Parks on top of a baselayer

What’s a baselayer?

library(ggmap)
## Warning: package 'ggmap' was built under R version 3.2.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.4
sf_basemap <- get_map( location = c(lon=mean(parks$lon),lat=mean(parks$lat)), zoom=12 )
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=37.757504,-122.435636&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
ggmap(sf_basemap) +
  geom_point(data=parks@data,aes(x = parks$lon, y = parks$lat), size = 3, col="purple")

Create an interactive map with Leaflet

library(leaflet)
leaflet() %>% addTiles() %>%
  setView(lng = mean(parks$lon), lat = mean(parks$lat), zoom = 12) %>%
  addCircleMarkers(data = parks, lng = ~parks$lon,
                   lat = ~parks$lat, radius = 5, stroke=F,
                   popup = paste0("<strong>Park:</strong><br>", parks$ParkName),
                   color = "purple", fillOpacity = 1)

Recap

  • We have loaded a csv file containing lat & lon coordinates for SF Parks
  • Converted the data to a SpatialPointsDataFrame
  • Plotted the points on a static map using ggmap and on an interactive map using leaflet
  • QUESTIONS?

Spatial Queries

Our scenario in this tutorial is an exploration of SF neighborhood parks. The first question we have about these data is:

  • ** What neighborhood is each park in?**

To answer this question let’s load some data that delineate SF neighborhoods. These data are in an ESRI Shapefile. This is one of the most, if not the most common spatial vector data file formats. The source of these data are the SF OpenData Portal.

We will use the rgdal library to load the neighborhoods shapefile. The rgdal library is the most commonly used R library for importing and exporting spatial data. For more info on working with rgdal to load different types of spatial data in R see this excellent tutorial by Zev Ross. In fact his blog has a wealth of information about working with spatial data in R.

library(rgdal)
## rgdal: version: 1.0-4, (SVN revision 548)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 1.11.2, released 2015/02/10
##  Path to GDAL shared files: /Library/Frameworks/R.framework/Versions/3.2/Resources/library/rgdal/gdal
##  Loaded PROJ.4 runtime: Rel. 4.9.1, 04 March 2015, [PJ_VERSION: 491]
##  Path to PROJ.4 shared files: /Library/Frameworks/R.framework/Versions/3.2/Resources/library/rgdal/proj
##  Linking to sp version: 1.1-1
# read the sf_nhoods.shp file from the current working directory
sf_nhoods <- readOGR(dsn=".", layer="sf_nhoods")
## OGR data source with driver: ESRI Shapefile 
## Source: ".", layer: "sf_nhoods"
## with 41 features
## It has 1 fields
class(sf_nhoods)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"

The class function tells us that the sf_nhoods data is a SpatialPolygonsDataFrame. We can take a look at it using str and head.

#str(sf_nhoods)
head(sf_nhoods@data)
##                            nhood
## 0          Bayview Hunters Point
## 1                 Bernal Heights
## 2            Castro/Upper Market
## 3                      Chinatown
## 4                      Excelsior
## 5 Financial District/South Beach

We can also make a simple pote of the neighborhoods and with the parks displayed on top

plot(sf_nhoods, col='grey')
points(parks, pch=21, col="black", bg='palegreen')

If we add labels to the neighborhoods and parks on the map can we answer our question?

  • What about if we create an interactive map?
leaflet() %>% addTiles() %>%
  setView(lng = mean(parks$lon), lat = mean(parks$lat), zoom = 12) %>%
  addPolygons(
    data=sf_nhoods,
    stroke = TRUE, fillColor="transparent", smoothFactor = 0.5,
    color = "red",
    popup = sf_nhoods$nhood
  ) %>%
  addCircleMarkers(data = parks, lng = ~parks$lon,
    lat = ~parks$lat, radius = 5, stroke=F,
    color = "purple", fillOpacity = 1,
    popup = paste0("<strong>Park:</strong><br>", parks$ParkName)
  )

Challenge - Does order matter? Is there any differnce if you add the Circles before the Polygons?

Adding labels to a static map or popups to an interactive map helps but it doesn’t scale up well with more than a handful of features. Moreover, it does not provide any output that you can use for further analysis like a table that links the neighborhoods to the parks. For that we need to join the data from these two different layers. This is called a spatial join. A spatial overlay operation is the method on which a spatial join is based. A spatial overlay operation computes the relationship between two sets of geometries. The results are then processed with data manipulation functions to output the resultant attribute information associated with those geometries.

Spatial overlay

Spatial overlay operations in R are implemented using the over function in the SP and rGEOS libraries. Point-polygon overlay use SP::over while operations that involve SpatialLines objects, or pairs of SpatialPolygons require package rgeos, and use gIntersects. That’s likely more detail than you need right now but the point here is that rgeos is the go-to library for vector geometric processing in R.

Let’s use over to find out:

  • What neighborhood is each park is located in?.
?over
## Help on topic 'over' was found in the following packages:
## 
##   Package               Library
##   grDevices             /Library/Frameworks/R.framework/Versions/3.2/Resources/library
##   sp                    /Library/Frameworks/R.framework/Versions/3.2/Resources/library
## 
## 
## Using the first match ...

As we just said, over is the man SP/rgeos method for spatial overlay operations. In its basic form, over(x,y) you can interperet it as:

  • for each feature in X give me information about the features in Y at the corresponding locations.

The term feature is used to refer to a single unit of geographic data that includes both the location, or geometry, and the descriptive attributes. So, over can tell us the index of the geometries that intersect or the attributes that describe those geometries.

So here goes…

#park_nhoods <- over(parks, sf_nhoods)

Oh snap! That didn’t work. We get the error:

  • Error: identicalCRS(x, y) is not TRUE

The over function, like almost all spatial analysis functions, requires that both data sets be spatial objects (they are) with the same coordinate reference system (CRS). Let’s investigate

# What is the CRS of the parks?
parks@proj4string # or proj4string(parks)
## CRS arguments: NA
# What is the CRS of the sf_nhoods?
sf_nhoods@proj4string
## CRS arguments: +proj=longlat +ellps=WGS84 +no_defs

We can see from the output that parks does not have a CRS and that the CRS forsf_nhoods is:

  • CRS arguments: +proj=longlat +ellps=WGS84 +no_defs

That CRS string refers to the WGS84 geographic coordinate reference system, which is the most commonly used CRS for latitude and longitude coordinates.

Defining a CRS

Ok, so now we know that the CRS for sf_nhoods is WGS84. We also know that our park SpatialPointsDataFrame was created from longitude & latitude coordinates So, it is safe to assume that these are also WGS84. let’s set the CRS of our park points to that of the SF neighborhoods.

Note, we are not doing any transformation on the data. We are just identifying the CRS for the coordinates used by the data. If we define the wrong projection (and we can) we will get errors and erroneous results when we apply spatial operations on the data.

# Set the CRS for parks to be the same as that for sf_nhoods
proj4string(parks) <- CRS(proj4string(sf_nhoods))

# make sure the CRSs are the same
proj4string(parks) == proj4string(sf_nhoods) 
## [1] TRUE

Now let’s try that overlay operation again

# Now try the overlay operation again
# For each feature in parks give me info about the sf_nhoods at the corresponding location
parks_w_nhoods <-over(parks,sf_nhoods)

What is our output? Does it answer our question?

head(parks_w_nhoods) # take a look at the output
##                   nhood
## 1            Noe Valley
## 2 Bayview Hunters Point
## 3          Hayes Valley
## 4          Russian Hill
## 5                Marina
## 6       Pacific Heights
class(parks_w_nhoods)
## [1] "data.frame"
nrow(parks_w_nhoods)
## [1] 121
nrow(parks)
## [1] 121

Our output parks_w_nhoods is a dataframe with the id of each park and the name of the neighborhood in which it is located in the column nhood. So we are close to answering our question. But for the data to be useful we need to link (or join) the name of the parks to the name of the nhoods.

# Now combine the information about neighborhoods with the spatial data 
parks2 <- cbind(parks,parks_w_nhoods)
head(parks2)
##                     ParkName                        ParkType Acreage
## 1    29TH/DIAMOND OPEN SPACE Neighborhood Park or Playground    0.82
## 2           ADAM ROGERS PARK Neighborhood Park or Playground    2.74
## 3               ALAMO SQUARE Neighborhood Park or Playground   12.70
## 4 ALICE MARBLE TENNIS COURTS Neighborhood Park or Playground    0.84
## 5                ALLYNE PARK Neighborhood Park or Playground    0.75
## 6                 ALTA PLAZA Neighborhood Park or Playground   11.91
##   ParkID                                                       location
## 1    194       Diamond\nSan Francisco, CA\n(37.74360211, -122.43523589)
## 2     46       Ingalls\nSan Francisco, CA\n(37.73101645, -122.38385466)
## 3    117         Hayes\nSan Francisco, CA\n(37.77634875, -122.43467396)
## 4    151     Greenwich\nSan Francisco, CA\n(37.80142776, -122.42034327)
## 5    131 2609 Gough St\nSan Francisco, CA\n(37.79746066, -122.42759992)
## 6    129       Jackson\nSan Francisco, CA\n(37.79117333, -122.43766978)
##        lat       lon optional                 nhood
## 1 37.74360 -122.4352     TRUE            Noe Valley
## 2 37.73102 -122.3839     TRUE Bayview Hunters Point
## 3 37.77635 -122.4347     TRUE          Hayes Valley
## 4 37.80143 -122.4203     TRUE          Russian Hill
## 5 37.79746 -122.4276     TRUE                Marina
## 6 37.79117 -122.4377     TRUE       Pacific Heights

Do we have our answer?

  • Closer but not quite. We have the info but it is not attached to the spatial data.

We can combine the information about neighborhoods and parks with the spatial data using the spCbind (spatial column bind) function in the maptools libary.

library(maptools)
## Checking rgeos availability: TRUE
?spCbind
parks3 <- spCbind(parks,parks_w_nhoods)
head(parks3@data)
##                     ParkName                        ParkType Acreage
## 1    29TH/DIAMOND OPEN SPACE Neighborhood Park or Playground    0.82
## 2           ADAM ROGERS PARK Neighborhood Park or Playground    2.74
## 3               ALAMO SQUARE Neighborhood Park or Playground   12.70
## 4 ALICE MARBLE TENNIS COURTS Neighborhood Park or Playground    0.84
## 5                ALLYNE PARK Neighborhood Park or Playground    0.75
## 6                 ALTA PLAZA Neighborhood Park or Playground   11.91
##   ParkID                                                       location
## 1    194       Diamond\nSan Francisco, CA\n(37.74360211, -122.43523589)
## 2     46       Ingalls\nSan Francisco, CA\n(37.73101645, -122.38385466)
## 3    117         Hayes\nSan Francisco, CA\n(37.77634875, -122.43467396)
## 4    151     Greenwich\nSan Francisco, CA\n(37.80142776, -122.42034327)
## 5    131 2609 Gough St\nSan Francisco, CA\n(37.79746066, -122.42759992)
## 6    129       Jackson\nSan Francisco, CA\n(37.79117333, -122.43766978)
##                   nhood
## 1            Noe Valley
## 2 Bayview Hunters Point
## 3          Hayes Valley
## 4          Russian Hill
## 5                Marina
## 6       Pacific Heights
plot(parks3)

Challenge Map the parks3 and sf data in leaflet. Update the code below to do this and change the popup so that you can confirm that the neighborhood results from the overlay operation match the sf_nhoods data.

leaflet() %>% addTiles() %>%
  setView(lng = mean(parks$lon), lat = mean(parks$lat), zoom = 12) %>%
  addPolygons(
    data=sf_nhoods,
    stroke = TRUE, fillColor="transparent", smoothFactor = 0.5,
    color = "red",
    popup = sf_nhoods$nhood
  ) %>%
  addCircleMarkers(data = parks3, lng = ~parks3$lon,
    lat = ~parks3$lat, radius = 5, stroke=F,
    color = "purple", fillOpacity = 1,
    popup = paste0(parks3$ParkName, "<br>", parks3$nhood)
  )

Recap: Points-in-polygon Spatial Overlay Query

We just did what’s called a type of spatial overlay query called a Point-in-Polygon query. We asked “in what neighborhood is each park located?”. This is a very common operation. For example, you may ask “In what census tracts are my address points located?”. Once you answer that question with a spatial overlay you can link your address data to census data!

The Neighborhood perspective

We now know the neighborhood for each park. Now let’s think about this question from the neighborhood perspective. First, let’s ask the question:

  • What parks are in Noe Valley?

How might we answer that?

# attribute query because we already made the spatial association
parks3[which(parks3$nhood=='Noe Valley'),]$ParkName
## [1] "29TH/DIAMOND OPEN SPACE"     "DOUGLASS PLAYGROUND"        
## [3] "DUNCAN/CASTRO OPEN SPACE"    "NOE VALLEY COURTS"          
## [5] "UPPER NOE RECREATION CENTER"

What about asking this question spatially from the get go using a spatial overlay?

parks_in_noe1 <-over(sf_nhoods[sf_nhoods$nhood=='Noe Valley',],parks)
parks_in_noe1
##                   ParkName                        ParkType Acreage ParkID
## 21 29TH/DIAMOND OPEN SPACE Neighborhood Park or Playground    0.82    194
##                                                    location
## 21 Diamond\nSan Francisco, CA\n(37.74360211, -122.43523589)
parks_in_noe2 <- over(sf_nhoods[sf_nhoods$nhood=='Noe Valley',],parks, returnList=T)
parks_in_noe2
## $`21`
##                        ParkName                        ParkType Acreage
## 1       29TH/DIAMOND OPEN SPACE Neighborhood Park or Playground    0.82
## 27          DOUGLASS PLAYGROUND Neighborhood Park or Playground    7.45
## 29     DUNCAN/CASTRO OPEN SPACE Neighborhood Park or Playground    0.55
## 85            NOE VALLEY COURTS Neighborhood Park or Playground    0.93
## 113 UPPER NOE RECREATION CENTER Neighborhood Park or Playground    2.51
##     ParkID                                                 location
## 1      194 Diamond\nSan Francisco, CA\n(37.74360211, -122.43523589)
## 27       8    26th\nSan Francisco, CA\n(37.74727823, -122.43893283)
## 29     178  Duncan\nSan Francisco, CA\n(37.74594931, -122.43339409)
## 85      72    24th\nSan Francisco, CA\n(37.75140324, -122.43892096)
## 113     22     Day\nSan Francisco, CA\n(37.74243091, -122.42779751)

Questions:

  • How is this over operation different from the first one we did?
  • What is the difference between parks_in_noe1 & parks_in_noe2?
  • Why didn’t we need to use returnList=TRUE in our first over query?

We know what parks are in Noe Valley - but let’s make it spatial so we can plot it!

over.list <- over(parks,sf_nhoods[sf_nhoods$nhood=='Noe Valley',] )
head(over.list,2)
##        nhood
## 1 Noe Valley
## 2       <NA>
noe_parks <- parks[!is.na(over.list[,1]),]

plot(sf_nhoods[sf_nhoods$nhood=='Noe Valley',])
points(noe_parks, pch=21, col='red', bg='black')

Questions

  • What type of data object is parks_in_noe2?
  • What type of data object is noe_parks?

How many parks are in each neighborhood?

We know what parks are in each neighborhood but we may also want to summarize the count of parks by neighborhood. Since our parks3 SPDF has a dataframe with this information we can summarize it aspaitally.

# using the work we have already done - aspatial 
nhood_park_count <- as.data.frame(table(parks3$nhood))
head(nhood_park_count)
##                             Var1 Freq
## 1          Bayview Hunters Point    9
## 2                 Bernal Heights    3
## 3            Castro/Upper Market    5
## 4                      Chinatown    2
## 5                      Excelsior    2
## 6 Financial District/South Beach    2
# relabel the columsn
names(nhood_park_count) <- c('nhood','park_count')

# look at it again
head(nhood_park_count)
##                            nhood park_count
## 1          Bayview Hunters Point          9
## 2                 Bernal Heights          3
## 3            Castro/Upper Market          5
## 4                      Chinatown          2
## 5                      Excelsior          2
## 6 Financial District/South Beach          2
# Attach the data to the sf_nhoods SPDF and save to a new SPDF
# so that we can compare the two data frames 
sf_nhoods2 <- merge(sf_nhoods,nhood_park_count, by.y='nhood', by.x='nhood')
head(sf_nhoods2@data)
##                            nhood park_count
## 1          Bayview Hunters Point          9
## 2                 Bernal Heights          3
## 3            Castro/Upper Market          5
## 4                      Chinatown          2
## 5                      Excelsior          2
## 6 Financial District/South Beach          2
# compare to 
head(sf_nhoods@data)
##                            nhood
## 0          Bayview Hunters Point
## 1                 Bernal Heights
## 2            Castro/Upper Market
## 3                      Chinatown
## 4                      Excelsior
## 5 Financial District/South Beach

Important note*: Always pass the merge command your SpatialDataFrame object. Never merge the data.frame associated with your Spatial DataFrame object directly!

That attribute join works great but what if we didn’t do the earlier operations. How can we answer that query spatially from the get go? We can do it with over.

over.list <- over(sf_nhoods, parks, returnList = TRUE)
head(over.list,1)
## $`0`
##                             ParkName                        ParkType
## 2                   ADAM ROGERS PARK Neighborhood Park or Playground
## 12               BAY VIEW PLAYGROUND Neighborhood Park or Playground
## 43                 GILMAN PLAYGROUND Neighborhood Park or Playground
## 53                      HILLTOP PARK Neighborhood Park or Playground
## 54  Hunter's Point Recreation Center Neighborhood Park or Playground
## 63      JOSEPH LEE RECREATION CENTER Neighborhood Park or Playground
## 87                 PALOU/PHELPS PARK Neighborhood Park or Playground
## 103        SILVER TERRACE PLAYGROUND Neighborhood Park or Playground
## 121    YOUNGBLOOD COLEMAN PLAYGROUND Neighborhood Park or Playground
##     Acreage ParkID
## 2      2.74     46
## 12     3.40     38
## 43     4.40     44
## 53     3.46     47
## 54     0.30    165
## 63     0.78     40
## 87     2.63    188
## 103    5.47     39
## 121    6.13     45
##                                                            location
## 2          Ingalls\nSan Francisco, CA\n(37.73101645, -122.38385466)
## 12             3rd\nSan Francisco, CA\n(37.72593906, -122.39329436)
## 43      Gilman Ave\nSan Francisco, CA\n(37.71698202, -122.38809021)
## 53        La Salle\nSan Francisco, CA\n(37.73291728, -122.38347479)
## 54   200 Middle Point Rd\nSan Francisco, CA\n(37.73735, -122.37834)
## 63   1395 Mendell St\nSan Francisco, CA\n(37.73453491, -122.389412)
## 87           Palou\nSan Francisco, CA\n(37.73551382, -122.39472336)
## 103       Thornton\nSan Francisco, CA\n(37.73285911, -122.40314958)
## 121 1398 Hudson St\nSan Francisco, CA\n(37.73876792, -122.38450221)
# We can sum the parks by neighborhood (NROW) and add the count to the sf_nhoods2 SPDF:
sf_nhoods2$park_count2 <- sapply(over.list, NROW)
head(sf_nhoods2@data)
##                            nhood park_count park_count2
## 1          Bayview Hunters Point          9           9
## 2                 Bernal Heights          3           3
## 3            Castro/Upper Market          5           5
## 4                      Chinatown          2           2
## 5                      Excelsior          2           2
## 6 Financial District/South Beach          2           2
# we could have done that in one step:
sf_nhoods2$park_count2 <- sapply(over(sf_nhoods2, parks, returnList = TRUE), NROW)
head(sf_nhoods2@data)
##                            nhood park_count park_count2
## 1          Bayview Hunters Point          9           9
## 2                 Bernal Heights          3           3
## 3            Castro/Upper Market          5           5
## 4                      Chinatown          2           2
## 5                      Excelsior          2           2
## 6 Financial District/South Beach          2           2

The aggregate function provids another way to count the number of points by polygons. I mention it here because you will see it used in other tutorials.

# and yet another way
# after: http://robinlovelace.net/r/2014/01/10/spatial-clipping-and-aggregation-in-R.html
parkNhoods <- aggregate(parks, sf_nhoods, length)
#head(parkNhoods)
head(parkNhoods@data)
##   ParkName ParkType Acreage ParkID location
## 0        9        9       9      9        9
## 1        3        3       3      3        3
## 2        5        5       5      5        5
## 3        2        2       2      2        2
## 4        2        2       2      2        2
## 5        2        2       2      2        2
nrow(parkNhoods)
## [1] 41
nrow(parks)
## [1] 121
nrow(sf_nhoods)
## [1] 41
plot(parkNhoods)

Any questions?

Creating New Spatial Data

Often the data we can find online is kind of but not exactly what we want. So being able to subset, merge or combine data sets is super useful. For example, what if we really only care about the parks in the Financial District Neighborhoods and we want to consider them together? We can create a new polygon that represents our area of interest. We can do this with the rgeos gUnaryUnion function.

library(rgeos)
## rgeos version: 0.3-14, (SVN revision 511)
##  GEOS runtime version: 3.4.2-CAPI-1.8.2 r3921 
##  Linking to sp version: 1.2-1 
##  Polygon checking: TRUE
fd_nhoods <- sf_nhoods[sf_nhoods$nhood=='Chinatown' | sf_nhoods$nhood=='North Beach' | sf_nhoods$nhood=='Nob Hill',]
plot(fd_nhoods)

fd_nhoods2 <- gUnaryUnion(fd_nhoods)

plot(fd_nhoods2)

We could now use our financial district polygon to ask spatial questions instead of asking them of all neighborhoods.

Computing Density

So far we have considered the count of parks by neighborhood. Why is that comparison misleading?

Densities, or counts normalized by area, are used frequently in spatial analysis. So knowing how to compute density is key.

We may want to ask the question:

  • What neighborhood has the most parks per square mile?

What is our strategy to approach this problem?

  • does the data contain area info for the nhoods?
  • if not we compute this from the polygons

We can calculate area of a geometry using the rgeos gArea function. Applied to an SPDF it gives us the area of all areas.

?gArea
gArea(sf_nhoods)
## Warning in RGEOSMiscFunc(spgeom, byid, "rgeos_area"): Spatial object is not
## projected; GEOS expects planar coordinates
## [1] 0.0124847

Oh snap! that didn’t work.

Important - All spatial data being compared need to be in the same projected CRS. - Almost all spatial measurement operations in R require data to be in a projected CRS.

Map Projections

Let’s take a brief interlude to explain map projections and projected CRSs, especially, why these are necessary for spatial measures like area, length.

What is a map projection?

Because the world isn’t flat Map Projections are used to convert locations on a 3D surface to a 2d plane. A map projection is a mathematical model that enables this coordinate transformation. Needless to say, this process introduces error sinces it is impossible to convert a 3D surface to 2D plane without introducing distortion.

There are three main types of projections:

  • conformal: preserves shape and direction
  • equal area: preserves area
  • compromise: preserves nothing, tries to minimize distortion

What is a Projected CRS?

A projected CRS defines how a map projection is applied to coordinate data in a GCS. It specifies the map projection, the input GCS of the data, the origin, and any other needed parameters that enabled the transformation.

Good projected CRSs for CA data include:

Here are some useful references:

Let’s transform the nhoods data to an appropriate CRS

# Transfrom the data to a projected CRS using sp package function spTransform (requires rgdal)
?spTransform
## Help on topic 'spTransform' was found in the following packages:
## 
##   Package               Library
##   rgdal                 /Library/Frameworks/R.framework/Versions/3.2/Resources/library
##   sp                    /Library/Frameworks/R.framework/Versions/3.2/Resources/library
## 
## 
## Using the first match ...
sf_nhoods2_utm10 <- spTransform(sf_nhoods2, CRS("+init=epsg:26910"))

#use rgeos function gArea to compute area
?gArea
gArea(sf_nhoods2_utm10)
## [1] 122021458
# conversion factor: sqmeters to sqmiles
sqm_to_sqmiles <- 3.861e-7

#compute area of SF
sf_area_sqmiles <- gArea(sf_nhoods2_utm10) * sqm_to_sqmiles
sf_area_sqmiles
## [1] 47.11248

Did that area calculation work? To find out, check the total area in square miles with that listed on the San Franciciso wikipedia page.

To compute the area of each neighborhood we set the byid argument of the function to TRUE. We can add the result to the sf_nhoods2_utm10 SPDF.

sf_nhoods2_utm10$area_sqm <- gArea(sf_nhoods2_utm10, byid = TRUE)
sf_nhoods2_utm10$area_sqm
##  [1] 13390445.7  2790169.1  2218768.5   581470.3  3603577.8  2908697.5
##  [7]  1730217.7  4463920.9  1441068.8  1269356.7  1925946.0  3685058.8
## [13]   312392.8  7456415.0  1021782.8  1499388.3  2648543.0  1593673.7
## [19]  4875571.0  2104564.4  1053085.4  2526691.1  1292437.0  2726378.8
## [25]  2599206.5  4635167.3  2058387.2  2137432.0  2943798.8  6118982.7
## [31]  1300202.4  1278685.0   551104.6  2290739.4 10945010.5  1016762.8
## [37]  2300826.7  1715247.6  1583637.4  7916524.9  1510120.9
# convert the area to sq miles
sf_nhoods2_utm10$area_sqmiles <- sf_nhoods2_utm10$area_sqm * sqm_to_sqmiles
sf_nhoods2_utm10$area_sqmiles
##  [1] 5.1700511 1.0772843 0.8566665 0.2245057 1.3913414 1.1230481 0.6680371
##  [8] 1.7235199 0.5563967 0.4900986 0.7436077 1.4228012 0.1206149 2.8789218
## [15] 0.3945103 0.5789138 1.0226024 0.6153174 1.8824580 0.8125723 0.4065963
## [22] 0.9755554 0.4990099 1.0526549 1.0035536 1.7896381 0.7947433 0.8252625
## [29] 1.1366007 2.3625392 0.5020081 0.4937003 0.2127815 0.8844545 4.2258686
## [36] 0.3925721 0.8883492 0.6622571 0.6114424 3.0565703 0.5830577
head(sf_nhoods2_utm10@data)
##                            nhood park_count park_count2   area_sqm
## 1          Bayview Hunters Point          9           9 13390445.7
## 2                 Bernal Heights          3           3  2790169.1
## 3            Castro/Upper Market          5           5  2218768.5
## 4                      Chinatown          2           2   581470.3
## 5                      Excelsior          2           2  3603577.8
## 6 Financial District/South Beach          2           2  2908697.5
##   area_sqmiles
## 1    5.1700511
## 2    1.0772843
## 3    0.8566665
## 4    0.2245057
## 5    1.3913414
## 6    1.1230481
# 5 biggest neighborhoods?
head(sf_nhoods2_utm10@data[order(-sf_nhoods2_utm10$area_sqmiles),], 5)
##                    nhood park_count park_count2 area_sqm area_sqmiles
## 1  Bayview Hunters Point          9           9 13390446     5.170051
## 35       Sunset/Parkside          6           6 10945011     4.225869
## 40    West of Twin Peaks          8           8  7916525     3.056570
## 14             Lakeshore          1           1  7456415     2.878922
## 30              Presidio          1           1  6118983     2.362539

Challenges - Confirm the biggest neighborhood in your leaflet map - Confirm the area of the presidio neighborhood in wikipedia

Always check your output to make sure your calculation method is correct.

Calculate density of parks per sq mile in each Neighborhood

Now that we have the area and count of parks for each neighborhood we can calculate park density.

sf_nhoods2_utm10$park_per_sqmile <- sf_nhoods2_utm10$park_count / sf_nhoods2_utm10$area_sqmiles
head(sf_nhoods2_utm10@data)
##                            nhood park_count park_count2   area_sqm
## 1          Bayview Hunters Point          9           9 13390445.7
## 2                 Bernal Heights          3           3  2790169.1
## 3            Castro/Upper Market          5           5  2218768.5
## 4                      Chinatown          2           2   581470.3
## 5                      Excelsior          2           2  3603577.8
## 6 Financial District/South Beach          2           2  2908697.5
##   area_sqmiles park_per_sqmile
## 1    5.1700511        1.740795
## 2    1.0772843        2.784780
## 3    0.8566665        5.836577
## 4    0.2245057        8.908460
## 5    1.3913414        1.437462
## 6    1.1230481        1.780868
# what?sf_nhoods2_utm10[c('nhood','park_count','park_per_sqmile'),]

Question: Which neighborhoods have the lowest park density (bottom 5)? - Show the numbers - Map the neighborhoods by park density

# 5 highest park densities
head(sf_nhoods2_utm10@data[order(-sf_nhoods2_utm10$park_per_sqmile),], 5)
##               nhood park_count park_count2  area_sqm area_sqmiles
## 7         Glen Park          7           7 1730217.7    0.6680371
## 32     Russian Hill          5           5 1278685.0    0.4937003
## 4         Chinatown          2           2  581470.3    0.2245057
## 13        Japantown          1           1  312392.8    0.1206149
## 41 Western Addition          4           4 1510120.9    0.5830577
##    park_per_sqmile
## 7        10.478461
## 32       10.127602
## 4         8.908460
## 13        8.290853
## 41        6.860385
# 5 lowest park densities
head(sf_nhoods2_utm10@data[order(sf_nhoods2_utm10$park_per_sqmile),], 5)
##               nhood park_count park_count2  area_sqm area_sqmiles
## 8  Golden Gate Park          0           0 4463920.9    1.7235199
## 15     Lincoln Park          0           0 1021782.8    0.3945103
## 20      Mission Bay          0           0 2104564.4    0.8125723
## 33         Seacliff          0           0  551104.6    0.2127815
## 37  Treasure Island          0           0 2300826.7    0.8883492
##    park_per_sqmile
## 8                0
## 15               0
## 20               0
## 33               0
## 37               0

Choropleth maps

Choropleth maps are a type of data map - they symbolize areas by data values.

leaflet() %>% addTiles() %>%
  setView(lng = mean(parks$lon), lat = mean(parks$lat), zoom = 12) %>%
  addPolygons(
    data=sf_nhoods2_utm10,
    stroke = TRUE, fillColor="transparent", smoothFactor = 0.5,
    color = "red"
  ) 

Where are the data? Why aren’t we seeing the neighborhood polygons?

sf_nhoods2_geo<- spTransform(sf_nhoods2_utm10, CRS("+init=epsg:4326"))
# What is another way to do this with spTransform

#http://zevross.com/blog/2015/10/14/manipulating-and-mapping-us-census-data-in-r-using-the-acs-tigris-and-leaflet-packages-3/
leaflet() %>% addTiles() %>%
  setView(lng = mean(parks$lon), lat = mean(parks$lat), zoom = 12) %>%
  addPolygons(
    data=sf_nhoods2_geo,
    stroke = T, fillOpacity= 0.7,smoothFactor = 0.5,
    color = "black",
    weight= 0.5,
    fillColor = ~colorQuantile("YlOrRd", sf_nhoods2_geo$area_sqmiles)(area_sqmiles),
    popup=paste0(sf_nhoods2_geo$nhood, " : ", sf_nhoods2_geo$area_sqmiles)
  ) 

Mapping people

Where are the people?

The number or density of parks in a neighborhood is interesting but parks serve people. So it is important to bring in the populations served by these parks. To do that we may wish to know: - bring in some population data for SF - calculate the populate within each neighborhood - calculate a rate of people per park by neighborhood - calculate the pop served within walking distance of each park - calculate the pop within 1/2 mile of each neighborhood park - identify the most/least accessible parks - compute the neighborhood population within walking distance of a neighborhood park.

We can get population information and other demographic data from the US Census web site. The US Census database is perhaps the largest and most well used database in the world. Needless to say, it’s complicated. So I will walk through one method for getting these data. You do not need to follow along, just load the resulting shapefile.

Census tracts geographic data

The census stores its geospatial data in TIGER/Line files. It makes these available in ESRI Shapefile format.

You can use the tigris library to fetch the census tracts for SF.

library(tigris) # used to fetch TIGER data (shapefiles)
sf_tracts <- tracts('CA', 'San Francisco', cb=TRUE) #state abbrev & countie name
class(sf_tracts)
str(sf_tracts)
head(sf_tracts@data)

# Map the tracts
leaflet() %>% addTiles() %>%
  setView(lng = mean(parks$lon), lat = mean(parks$lat), zoom = 12) %>%
  addPolygons(
    data=sf_tracts,
    stroke = TRUE, fillColor="transparent", smoothFactor = 0.5,
    color = "red"
    #popup = paste0("<strong>Neighborhood:</strong><br>", sf_nhoods2$nhood)
  )

Fetch the census population data

Once you have the TIGER data you can download the census demographic data and link it to the census tracts.

You can figure out what variable you want to fetch by going to the Census API web page at:

You will see that the variable B01003_001E contains total population by census tract.

Below we use the acs14lite library to fetch the census data via the census api. This requires a census api key.

library(acs14lite)
my_census_api_key <- "f2d6f4f743545d3a42a67412b05935dc7712c432"  # get your own key please!
set_api_key(my_census_api_key) 
sf_pop <- acs14(geography = 'tract', state = 'CA', county = 'San Francisco', 
                    variable = c('B01003_001E'))

# Look at the data
head(sf_pop)
tot_sf_pop <- sum(sf_pop$B01003_001E)
tot_sf_pop

Check your result for tot_sf_pop - does it match what Wikipedia says is the population for SF?

Joining Census data to Census tracts

This is never fun! And there are many ways to do it. The basic approach is a table join by a unique identifier, in this case the tract id. With census data the GEOID is typically used . For tract level data, this is a concatenation of the state, county and tract codes.

sf_pop$GEOID <- paste0(sf_pop$state, sf_pop$county, sf_pop$tract)
head(sf_pop)

# let's simplyfy the data so that we just have pop for the tracts
# and the area data
sf_pop2 <- sf_pop[c('GEOID','B01003_001E')]
head(sf_pop2)

# relabel the columsn
names(sf_pop2) <- c('GEOID','pop14')
head(sf_pop2)

# Now join the tabular data to the tracts
# using the tigris function geo_join
pop_by_tracts<- geo_join(sf_tracts, sf_pop2, "GEOID", "GEOID")
head(pop_by_tracts@data)
pop_by_tracts$GEOID.1 <- NULL  # get rid of duplicative column data
head(pop_by_tracts@data)

# save the data to a shapefile so we do not need to do that again!
# http://zevross.com/blog/2016/01/13/tips-for-reading-spatial-files-into-r-with-rgdal/
# write to current directory as sf_pop_by_tracts.shp
writeOGR(pop_by_tracts, dsn=".", layer="sf_pop_by_tracts", driver="ESRI Shapefile")
dir()

Read in the Census Data

Ok - you can just read in the sf_pop_by_tracts.shp shapefile instead of doing all that work!

pop_by_tracts <- readOGR(dsn=".", layer="sf_pop_by_tracts")
## OGR data source with driver: ESRI Shapefile 
## Source: ".", layer: "sf_pop_by_tracts"
## with 196 features
## It has 10 fields
head(pop_by_tracts@data)
##   STATEFP COUNTYFP TRACTCE             AFFGEOID       GEOID   NAME LSAD
## 0      06      075  010700 1400000US06075010700 06075010700    107   CT
## 1      06      075  012201 1400000US06075012201 06075012201 122.01   CT
## 2      06      075  013102 1400000US06075013102 06075013102 131.02   CT
## 3      06      075  013500 1400000US06075013500 06075013500    135   CT
## 4      06      075  015500 1400000US06075015500 06075015500    155   CT
## 5      06      075  016300 1400000US06075016300 06075016300    163   CT
##    ALAND AWATER pop14
## 0 183170      0  5311
## 1  92048      0  4576
## 2 237886      0  2692
## 3 235184      0  2592
## 4 311339      0  3918
## 5 245867      0  4748

Ok, so now that we have pop by census tract, how do we compute the population of each neighborhood?

Start by mapping the neighborhoods & the census tracts together. Note - we want to work with the neighborhood data that has the area and density info - sf_nhoods2_utm10.

plot(pop_by_tracts)
plot(sf_nhoods2_utm10, add=T, border="red")

The neighborhoods don’t show on the map - why?

#transform the census data to utm10
pop_by_tracts@proj4string
## CRS arguments:
##  +proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0
tracts_utm10 <- spTransform(pop_by_tracts, CRS(proj4string(sf_nhoods2_utm10)))
plot(tracts_utm10)
plot(sf_nhoods2_utm10, border="red", add=T)

# remove census tracts not in our neighborhoods
nhood_tracts_utm10 <-  tracts_utm10[sf_nhoods2_utm10, ]

# map them again
plot(nhood_tracts_utm10)
plot(sf_nhoods2_utm10, border="red", add=T)

We can see that our two data sets overlap. They are both in a projected coordinate system (utm10). We now need to transfer the population data from the census tracts to the neighborhoods. To do this we need to:

  • calculate the proportion of each census tract that is within each neighborhood and then
  • allocate that proportion of the census tract population to the neighborhood and then
  • sum those population counts.

Although this operation is common, tehre are two things to keep in mind:

  • there is no easy way to do it in R
  • it’s very easy to make mistakes so you need to check your work
  • this type of operation is prone to errors in any language or tool (can you think why?)
#These are the data
## sf_nhoods2_utm10 
head(sf_nhoods2_utm10@data)
##                            nhood park_count park_count2   area_sqm
## 1          Bayview Hunters Point          9           9 13390445.7
## 2                 Bernal Heights          3           3  2790169.1
## 3            Castro/Upper Market          5           5  2218768.5
## 4                      Chinatown          2           2   581470.3
## 5                      Excelsior          2           2  3603577.8
## 6 Financial District/South Beach          2           2  2908697.5
##   area_sqmiles park_per_sqmile
## 1    5.1700511        1.740795
## 2    1.0772843        2.784780
## 3    0.8566665        5.836577
## 4    0.2245057        8.908460
## 5    1.3913414        1.437462
## 6    1.1230481        1.780868
## nhood_tracts_utm10
head(nhood_tracts_utm10@data)
##   STATEFP COUNTYFP TRACTCE             AFFGEOID       GEOID   NAME LSAD
## 0      06      075  010700 1400000US06075010700 06075010700    107   CT
## 1      06      075  012201 1400000US06075012201 06075012201 122.01   CT
## 2      06      075  013102 1400000US06075013102 06075013102 131.02   CT
## 3      06      075  013500 1400000US06075013500 06075013500    135   CT
## 4      06      075  015500 1400000US06075015500 06075015500    155   CT
## 5      06      075  016300 1400000US06075016300 06075016300    163   CT
##    ALAND AWATER pop14
## 0 183170      0  5311
## 1  92048      0  4576
## 2 237886      0  2692
## 3 235184      0  2592
## 4 311339      0  3918
## 5 245867      0  4748
# The polygon objects have arbitray ids. 
# Let set them to GEOID for tracts 
# But nhood is too long a string to use for sf_nhoods
# So let's set it to the row.names
sf_nhoods2_utm10$nhood_id <- row.names(sf_nhoods2_utm10@data)
head(sf_nhoods2_utm10@data)
##                            nhood park_count park_count2   area_sqm
## 1          Bayview Hunters Point          9           9 13390445.7
## 2                 Bernal Heights          3           3  2790169.1
## 3            Castro/Upper Market          5           5  2218768.5
## 4                      Chinatown          2           2   581470.3
## 5                      Excelsior          2           2  3603577.8
## 6 Financial District/South Beach          2           2  2908697.5
##   area_sqmiles park_per_sqmile nhood_id
## 1    5.1700511        1.740795        1
## 2    1.0772843        2.784780        2
## 3    0.8566665        5.836577        3
## 4    0.2245057        8.908460        4
## 5    1.3913414        1.437462        5
## 6    1.1230481        1.780868        6
# Buffer around the polygons a distance of zero (no buffer)
# We do this only to specify the column for the ID of each polygon!!!!
sfhoods_buf0 <- gBuffer(sf_nhoods2_utm10, width=0, byid=TRUE, id=sf_nhoods2_utm10$nhood_id)
sftracts_buf0 <-gBuffer(nhood_tracts_utm10, width=0, byid=TRUE, id=nhood_tracts_utm10$GEOID)

head(sfhoods_buf0@data)
##                            nhood park_count park_count2   area_sqm
## 1          Bayview Hunters Point          9           9 13390445.7
## 2                 Bernal Heights          3           3  2790169.1
## 3            Castro/Upper Market          5           5  2218768.5
## 4                      Chinatown          2           2   581470.3
## 5                      Excelsior          2           2  3603577.8
## 6 Financial District/South Beach          2           2  2908697.5
##   area_sqmiles park_per_sqmile nhood_id
## 1    5.1700511        1.740795        1
## 2    1.0772843        2.784780        2
## 3    0.8566665        5.836577        3
## 4    0.2245057        8.908460        4
## 5    1.3913414        1.437462        5
## 6    1.1230481        1.780868        6
head(sftracts_buf0@data)
##             STATEFP COUNTYFP TRACTCE             AFFGEOID       GEOID
## 06075010700      06      075  010700 1400000US06075010700 06075010700
## 06075012201      06      075  012201 1400000US06075012201 06075012201
## 06075013102      06      075  013102 1400000US06075013102 06075013102
## 06075013500      06      075  013500 1400000US06075013500 06075013500
## 06075015500      06      075  015500 1400000US06075015500 06075015500
## 06075016300      06      075  016300 1400000US06075016300 06075016300
##               NAME LSAD  ALAND AWATER pop14
## 06075010700    107   CT 183170      0  5311
## 06075012201 122.01   CT  92048      0  4576
## 06075013102 131.02   CT 237886      0  2692
## 06075013500    135   CT 235184      0  2592
## 06075015500    155   CT 311339      0  3918
## 06075016300    163   CT 245867      0  4748
plot(sftracts_buf0, col="black")
plot(sfhoods_buf0, col="red", add=T)

# Calculate fraction of polygon that overlaps each point buffer
#After: http://rstudio-pubs-static.s3.amazonaws.com/6577_3b66f8d8f4984fb2807e91224defa854.html
#
# first intersect the two geometries (tracts and neighborhoods)
t2 <- gIntersection(sfhoods_buf0,sftracts_buf0, byid = T)
plot(t2) # take a look

#

head(t2,1) # CHECK OUT THE ID SLOT!!!
## An object of class "SpatialPolygons"
## Slot "polygons":
## [[1]]
## An object of class "Polygons"
## Slot "Polygons":
## [[1]]
## An object of class "Polygon"
## Slot "labpt":
## [1]  554229.2 4176527.4
## 
## Slot "area":
## [1] 489397.3
## 
## Slot "hole":
## [1] FALSE
## 
## Slot "ringDir":
## [1] 1
## 
## Slot "coords":
##              x       y
##  [1,] 553978.1 4176561
##  [2,] 553997.6 4176668
##  [3,] 553877.3 4176753
##  [4,] 554160.7 4177173
##  [5,] 554490.3 4176940
##  [6,] 554271.5 4176871
##  [7,] 554373.1 4176538
##  [8,] 554615.2 4176335
##  [9,] 554517.5 4176486
## [10,] 554637.1 4176415
## [11,] 554652.5 4176128
## [12,] 554506.4 4176103
## [13,] 554448.9 4176109
## [14,] 554331.4 4176008
## [15,] 554167.1 4176124
## [16,] 553835.3 4176358
## [17,] 553978.1 4176561
## 
## 
## 
## Slot "plotOrder":
## [1] 1
## 
## Slot "labpt":
## [1]  554229.2 4176527.4
## 
## Slot "ID":
## [1] "1 06075023102"
## 
## Slot "area":
## [1] 489397.3
## 
## 
## 
## Slot "plotOrder":
## [1] 1
## 
## Slot "bbox":
##         min       max
## x  553835.3  554652.5
## y 4176008.5 4177172.6
## 
## Slot "proj4string":
## CRS arguments:
##  +init=epsg:26910 +proj=utm +zone=10 +datum=NAD83 +units=m +no_defs
## +ellps=GRS80 +towgs84=0,0,0
#
# second, calculate the areas of the output polygons formed by the intersection
# A combo id (tracts-nhoods) is associated with each area
overlap.area <- gArea(t2, byid = TRUE)
head(overlap.area) # take a look
## 1 06075023102 1 06075025701 1 06075061400 1 06075025800 1 06075023200 
##   489397.2910     2361.9119    12785.4676       45.4304   846310.3119 
## 1 06075025702 
##      325.5555
n.overlaps <- length(overlap.area)
n.overlaps #this is the number of resultant overlaps (intersections)
## [1] 553
# NOTE: the type and order of calls to R functions below ensures that
# resulting dataframe retains IDs in character format; and the area of
# each overlap in numeric format
tmp <- as.character(scan(text = names(overlap.area), what="character"))
head(tmp)
## [1] "1"           "06075023102" "1"           "06075025701" "1"          
## [6] "06075061400"
tmp <- matrix(tmp, nrow = n.overlaps, ncol = 2, byrow = TRUE, dimnames = list(c(NULL), c("nhood.id", "tract.id")))
tmp <- as.data.frame(tmp, stringsAsFactors = FALSE)
overlaps <- cbind(tmp, overlap.area)
rownames(overlaps) <- NULL

# We now have a dataframe of nhood id + tract id  + area of overlap for these two geometries
head(overlaps) # Take a look
##   nhood.id    tract.id overlap.area
## 1        1 06075023102  489397.2910
## 2        1 06075025701    2361.9119
## 3        1 06075061400   12785.4676
## 4        1 06075025800      45.4304
## 5        1 06075023200  846310.3119
## 6        1 06075025702     325.5555
# Calculate the Proportion of area of overlap
# First compute the area of each neighborhood polygon
tmp <- gArea(sfhoods_buf0, byid = TRUE)
nhood.area <- data.frame(nhood.id = names(tmp), nhood.area = tmp, stringsAsFactors = FALSE)
head(nhood.area, 1)
##   nhood.id nhood.area
## 1        1   13390446
#  Second add the area of each polygon to the overlaps table
overlaps <- data.frame(overlaps, nhood.area = nhood.area[match(overlaps[, "nhood.id"], 
                                                             nhood.area[, "nhood.id"]), "nhood.area"])

head(overlaps)
##   nhood.id    tract.id overlap.area nhood.area
## 1        1 06075023102  489397.2910   13390446
## 2        1 06075025701    2361.9119   13390446
## 3        1 06075061400   12785.4676   13390446
## 4        1 06075025800      45.4304   13390446
## 5        1 06075023200  846310.3119   13390446
## 6        1 06075025702     325.5555   13390446
# Now compute the area of each sftracts polygon
tmp <- gArea(sftracts_buf0, byid = TRUE)
tract.area <- data.frame(tract.id = names(tmp), tract.area = tmp, stringsAsFactors = FALSE)
head(tract.area, 1)
##                tract.id tract.area
## 06075010700 06075010700   183022.4
#  Then add the area of each polygon to the overlaps table
overlaps <- data.frame(overlaps, tract.area = tract.area[match(overlaps[, "tract.id"], 
                                                               tract.area[, "tract.id"]), "tract.area"])

#check it out
head(overlaps)
##   nhood.id    tract.id overlap.area nhood.area tract.area
## 1        1 06075023102  489397.2910   13390446   489397.3
## 2        1 06075025701    2361.9119   13390446   638558.3
## 3        1 06075061400   12785.4676   13390446   805193.3
## 4        1 06075025800      45.4304   13390446   240480.9
## 5        1 06075023200  846310.3119   13390446   850098.4
## 6        1 06075025702     325.5555   13390446   516727.8
tail(overlaps)
##     nhood.id    tract.id overlap.area nhood.area tract.area
## 548       41 06075016100 3.686220e+05    1510121   368635.6
## 549       41 06075016200 2.059215e-02    1510121   272357.3
## 550       41 06075016400 4.729286e+01    1510121   308717.7
## 551       41 06075012402 4.680288e+02    1510121   359515.4
## 552       41 06075015700 9.575137e+02    1510121   830881.0
## 553       41 06075015802 1.702923e+05    1510121   170306.1

Now that we know the area of overlap and the area of the input polygons, we can compute the proportion of each census tract that falls within a neighborhood. We can then use that proportion to compute the population within the neighborhood.

# Divide the area of overlap by the total area to get the proportion of overlap
overlaps$tract.area.share <- overlaps$overlap.area/overlaps$tract.area
head(overlaps)
##   nhood.id    tract.id overlap.area nhood.area tract.area tract.area.share
## 1        1 06075023102  489397.2910   13390446   489397.3     1.0000000000
## 2        1 06075025701    2361.9119   13390446   638558.3     0.0036988195
## 3        1 06075061400   12785.4676   13390446   805193.3     0.0158787563
## 4        1 06075025800      45.4304   13390446   240480.9     0.0001889148
## 5        1 06075023200  846310.3119   13390446   850098.4     0.9955439123
## 6        1 06075025702     325.5555   13390446   516727.8     0.0006300330

Reapportion the Population Data

We assign to each neighborhood the sum of the proportion of the population within each census tract that is within the neighborhood.

#sfhoods_buf0,sftracts_buf0

sfhoods_buf0$pop14 <- 0

for (i in 1:n.overlaps) {

    sfhoods_buf0@data[overlaps$nhood.id[i], "pop14"] <- sfhoods_buf0@data[overlaps$nhood.id[i], 
        "pop14"] + (overlaps$tract.area.share[i] * sftracts_buf0@data[overlaps$tract.id[i], 
        "pop14"])
}

head(sfhoods_buf0@data)
##                            nhood park_count park_count2   area_sqm
## 1          Bayview Hunters Point          9           9 13390445.7
## 2                 Bernal Heights          3           3  2790169.1
## 3            Castro/Upper Market          5           5  2218768.5
## 4                      Chinatown          2           2   581470.3
## 5                      Excelsior          2           2  3603577.8
## 6 Financial District/South Beach          2           2  2908697.5
##   area_sqmiles park_per_sqmile nhood_id    pop14
## 1    5.1700511        1.740795        1 37101.20
## 2    1.0772843        2.784780        2 25775.41
## 3    0.8566665        5.836577        3 20134.59
## 4    0.2245057        8.908460        4 14608.08
## 5    1.3913414        1.437462        5 39449.89
## 6    1.1230481        1.780868        6 13861.97

Did it work? Let’s check.

#tot_sf_pop
#sum(sfhoods_buf0$pop14)
#tot_sf_pop - sum(sfhoods_buf0$pop14)
### 11,323 difference with method A & B - same answer

Ok - so now that we have the population within each Neighborhood we can compute the number of people per park ratio for each neighborhood

head(sfhoods_buf0@data)
##                            nhood park_count park_count2   area_sqm
## 1          Bayview Hunters Point          9           9 13390445.7
## 2                 Bernal Heights          3           3  2790169.1
## 3            Castro/Upper Market          5           5  2218768.5
## 4                      Chinatown          2           2   581470.3
## 5                      Excelsior          2           2  3603577.8
## 6 Financial District/South Beach          2           2  2908697.5
##   area_sqmiles park_per_sqmile nhood_id    pop14
## 1    5.1700511        1.740795        1 37101.20
## 2    1.0772843        2.784780        2 25775.41
## 3    0.8566665        5.836577        3 20134.59
## 4    0.2245057        8.908460        4 14608.08
## 5    1.3913414        1.437462        5 39449.89
## 6    1.1230481        1.780868        6 13861.97
sfhoods_buf0$pop_per_park <- ifelse(sfhoods_buf0$park_count==0, 0, sfhoods_buf0$pop14/sfhoods_buf0$park_count)

head(sfhoods_buf0@data)
##                            nhood park_count park_count2   area_sqm
## 1          Bayview Hunters Point          9           9 13390445.7
## 2                 Bernal Heights          3           3  2790169.1
## 3            Castro/Upper Market          5           5  2218768.5
## 4                      Chinatown          2           2   581470.3
## 5                      Excelsior          2           2  3603577.8
## 6 Financial District/South Beach          2           2  2908697.5
##   area_sqmiles park_per_sqmile nhood_id    pop14 pop_per_park
## 1    5.1700511        1.740795        1 37101.20     4122.356
## 2    1.0772843        2.784780        2 25775.41     8591.805
## 3    0.8566665        5.836577        3 20134.59     4026.917
## 4    0.2245057        8.908460        4 14608.08     7304.041
## 5    1.3913414        1.437462        5 39449.89    19724.944
## 6    1.1230481        1.780868        6 13861.97     6930.984
#top 5
head(sfhoods_buf0@data[order(-sfhoods_buf0$pop_per_park),c('nhood','pop_per_park')], 10)
##                nhood pop_per_park
## 21          Nob Hill     25534.38
## 11    Inner Richmond     21322.08
## 5          Excelsior     19724.94
## 16 Lone Mountain/USF     17193.91
## 28           Portola     15893.31
## 36        Tenderloin     13798.50
## 14         Lakeshore     13313.46
## 35   Sunset/Parkside     13276.49
## 25     Outer Mission     11864.77
## 27   Pacific Heights     11799.00
#bottom 5
head(sfhoods_buf0@data[order(sfhoods_buf0$pop_per_park),c('nhood','pop_per_park')], 10)
##               nhood pop_per_park
## 8  Golden Gate Park       0.0000
## 15     Lincoln Park       0.0000
## 20      Mission Bay       0.0000
## 33         Seacliff       0.0000
## 37  Treasure Island       0.0000
## 18     McLaren Park     470.7181
## 7         Glen Park    1209.5866
## 32     Russian Hill    3463.6078
## 38       Twin Peaks    3491.6903
## 30         Presidio    3509.2412

Question If you had the money to fund one new neigbborhood park, in what neighborhood would you suggest it be built and why?

Discussion & Recap

Extras

Coming soon!

Population served within walking distance?

Neighborhood Impervious surface?

Length of roads?