Spatial Analysis begins with an exploration of the data. This entails mapping the data to see the locations and distribution of the data and to look for abt obvious insights like patterns or trends that can be further explored.
The next step is asking questions of, or querying, your data. While aspatial queries include basic summary statistics of the attribute data, spatial queries explore the spatial aspects of your data. There are two key types of spatial queries:
Geographic data are observations about locations on or near the surface of the Earth. For example:
Question: can you think of ways to categorize the different types of geographic data listed above?
When geographic locations are represented as place names, like Berkeley CA or Gourmet Getto, we typically treat those locations as categorical values in our data set. For example,
setwd("~/Documents/Dlab/dlab_workshops/r_spatial_analysis/r_spatial_workshop")
# 2013: Percent of adults who report consuming fruit less than one time daily†
# https://nccd.cdc.gov/NPAO_DTM/IndicatorSummary.aspx?category=21&indicator=37&year=2013&yearId=17
fruit_data <- read.csv("fruit_consumption.csv")
head(fruit_data)
## Location Value X95..CI Sample.Size
## 1 Alabama 46.0 (44.1-47.9) 5975
## 2 Alaska 39.7 (37.6-41.8) 4252
## 3 Arizona 39.5 (36.7-42.3) 3866
## 4 Arkansas 50.5 (48.5-52.6) 4794
## 5 California 30.4 (29.2-31.7) 9746
## 6 Colorado 35.7 (34.5-36.8) 12168
Categorical location data makes its hard to explore spatial characteristics in the data like “are there any regional trends?” For that you need a map and to make a map you need to have geometric representations of geographic locations.
You can see above we have States delineated by areas - or polygons.
Geospatial data are geometric representations of geographic locations. Spatial data is a more generic term that can be used for both geographic and non-geographic data.
There are two main types of spatial data: vector & raster data. Discrete real world geographic entities are typically represented with vector objects. Continuous phenomena are more often represented with raster data.
Vector data represent location as points, lines or polygons.
Raster data represent location as a gridded surface where each grid cell has a value for the phenomena being mapped, eg temperature. Grid cells are typically uniform and square and are rendered digitally as pixels Grid cells have a dimension, eg a 30m raster consists of 30x30m cells whose value is representative of a 900 sq meter area.
Points, lines, polygons & grids are geometric objects. When those geometries are referenced to the surface of the earth they can represent geographic locations. We do this by defining a coordinate reference system (CRS) for those objects.
Geographic coordinates are expressed as longitude and latitude.
For a number of reasons working with longitude & latitude values are tricky. One issue is that folks often treat longitude & latitude as respectively analogous to the x and y coordinates in a euclidean plane. However, geographic coordinates are angular measures on a 3D model of the earth.
A geographic coordinate system or GCS specifies longitude and latitude on reference surface defined by the orgin (zero degrees of latitude and longitude) and a datum. A datum is a model of the shape of the earth (shperical or ellipsoidal/spheriodal) and a set of ground control points that fit the coordinates to specific locations. The key takeaway here is that to properly interpret latitude and longitude values you need to know their GCS. Fortunately, two main GCSs are used and
These can be considered identical for locations within the USA when sub-meter accuracy is not required. Note the terms datum and GCS are often used interchangeably but there are slight differences. More detail is beyond the scope of this tutorial.
It is time to dive into spatial analysis in R. We will walk through an example workflow and discuss technical specifics and concepts as they arise in the context of this work.
First up, prepare your R workspace
#clean environment
rm(list=ls())
objects()
## character(0)
setwd('/Users/patty/Documents/Dlab/dlab_workshops/r_spatial_analysis')
Below is some useful code that will:
required.pkg <- c("sp", "rgdal", "rgeos", "leaflet", "ggmap", "maptools", "RColorBrewer")
pkgs.not.installed <- required.pkg[!sapply(required.pkg, function(p) require(p, character.only=T))]
install.packages(pkgs.not.installed, dependencies=TRUE)
# Load all libraries them all at once.
lapply(required.pkg, library, character.only = TRUE)
The R packages used in this tutorial provide the following functionality:
SP
PackageThe SP
package is most commonly used to provide support for vector data objects in R. Other packages that do things with spatial data typically build on these SP objects.
library(sp)
?sp
The three main types of spatial objects that you will use in R are summarized below.
Vector Data Type | SP Spatial Class | SP Spatial Class with Attributes |
---|---|---|
Points | SpatialPoints | SpatialPointsDataFrame |
Lines | SpatialLines | SpatialLinesDataFrame |
Polygons | SpatialPolygons | SpatialPolygonsDataFrame |
These data are a subset of the parks data set found on the SF OpenData Portal (data.sfgov.org). - https://data.sfgov.org/Culture-and-Recreation/Recreation-Park-Department-Park-Info-Dataset/z76i-7s65
parks <- read.csv("sf_neighborhood_parks.csv", stringsAsFactors = FALSE)
head(parks)
## ParkName ParkType Acreage
## 1 29TH/DIAMOND OPEN SPACE Neighborhood Park or Playground 0.82
## 2 ADAM ROGERS PARK Neighborhood Park or Playground 2.74
## 3 ALAMO SQUARE Neighborhood Park or Playground 12.70
## 4 ALICE MARBLE TENNIS COURTS Neighborhood Park or Playground 0.84
## 5 ALLYNE PARK Neighborhood Park or Playground 0.75
## 6 ALTA PLAZA Neighborhood Park or Playground 11.91
## ParkID location
## 1 194 Diamond\nSan Francisco, CA\n(37.74360211, -122.43523589)
## 2 46 Ingalls\nSan Francisco, CA\n(37.73101645, -122.38385466)
## 3 117 Hayes\nSan Francisco, CA\n(37.77634875, -122.43467396)
## 4 151 Greenwich\nSan Francisco, CA\n(37.80142776, -122.42034327)
## 5 131 2609 Gough St\nSan Francisco, CA\n(37.79746066, -122.42759992)
## 6 129 Jackson\nSan Francisco, CA\n(37.79117333, -122.43766978)
## lat lon
## 1 37.74360 -122.4352
## 2 37.73102 -122.3839
## 3 37.77635 -122.4347
## 4 37.80143 -122.4203
## 5 37.79746 -122.4276
## 6 37.79117 -122.4377
class(parks)
## [1] "data.frame"
str(parks)
## 'data.frame': 121 obs. of 7 variables:
## $ ParkName: chr "29TH/DIAMOND OPEN SPACE" "ADAM ROGERS PARK" "ALAMO SQUARE" "ALICE MARBLE TENNIS COURTS" ...
## $ ParkType: chr "Neighborhood Park or Playground" "Neighborhood Park or Playground" "Neighborhood Park or Playground" "Neighborhood Park or Playground" ...
## $ Acreage : num 0.82 2.74 12.7 0.84 0.75 ...
## $ ParkID : int 194 46 117 151 131 129 105 65 97 174 ...
## $ location: chr "Diamond\nSan Francisco, CA\n(37.74360211, -122.43523589)" "Ingalls\nSan Francisco, CA\n(37.73101645, -122.38385466)" "Hayes\nSan Francisco, CA\n(37.77634875, -122.43467396)" "Greenwich\nSan Francisco, CA\n(37.80142776, -122.42034327)" ...
## $ lat : num 37.7 37.7 37.8 37.8 37.8 ...
## $ lon : num -122 -122 -122 -122 -122 ...
parks
?parks
spatial data?Let’s convert the data.frame to a SpatialPointsDataFrame using the SP function coordinates
.
?coordinates
coordinates(parks) <- c("lon","lat") # or ~lon+lat
head(parks)
## coordinates ParkName
## 1 (-122.4352, 37.7436) 29TH/DIAMOND OPEN SPACE
## 2 (-122.3839, 37.73102) ADAM ROGERS PARK
## 3 (-122.4347, 37.77635) ALAMO SQUARE
## 4 (-122.4203, 37.80143) ALICE MARBLE TENNIS COURTS
## 5 (-122.4276, 37.79746) ALLYNE PARK
## 6 (-122.4377, 37.79117) ALTA PLAZA
## ParkType Acreage ParkID
## 1 Neighborhood Park or Playground 0.82 194
## 2 Neighborhood Park or Playground 2.74 46
## 3 Neighborhood Park or Playground 12.70 117
## 4 Neighborhood Park or Playground 0.84 151
## 5 Neighborhood Park or Playground 0.75 131
## 6 Neighborhood Park or Playground 11.91 129
## location
## 1 Diamond\nSan Francisco, CA\n(37.74360211, -122.43523589)
## 2 Ingalls\nSan Francisco, CA\n(37.73101645, -122.38385466)
## 3 Hayes\nSan Francisco, CA\n(37.77634875, -122.43467396)
## 4 Greenwich\nSan Francisco, CA\n(37.80142776, -122.42034327)
## 5 2609 Gough St\nSan Francisco, CA\n(37.79746066, -122.42759992)
## 6 Jackson\nSan Francisco, CA\n(37.79117333, -122.43766978)
class(parks)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
str(parks)
## Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
## ..@ data :'data.frame': 121 obs. of 5 variables:
## .. ..$ ParkName: chr [1:121] "29TH/DIAMOND OPEN SPACE" "ADAM ROGERS PARK" "ALAMO SQUARE" "ALICE MARBLE TENNIS COURTS" ...
## .. ..$ ParkType: chr [1:121] "Neighborhood Park or Playground" "Neighborhood Park or Playground" "Neighborhood Park or Playground" "Neighborhood Park or Playground" ...
## .. ..$ Acreage : num [1:121] 0.82 2.74 12.7 0.84 0.75 ...
## .. ..$ ParkID : int [1:121] 194 46 117 151 131 129 105 65 97 174 ...
## .. ..$ location: chr [1:121] "Diamond\nSan Francisco, CA\n(37.74360211, -122.43523589)" "Ingalls\nSan Francisco, CA\n(37.73101645, -122.38385466)" "Hayes\nSan Francisco, CA\n(37.77634875, -122.43467396)" "Greenwich\nSan Francisco, CA\n(37.80142776, -122.42034327)" ...
## ..@ coords.nrs : int [1:2] 7 6
## ..@ coords : num [1:121, 1:2] -122 -122 -122 -122 -122 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : NULL
## .. .. ..$ : chr [1:2] "lon" "lat"
## ..@ bbox : num [1:2, 1:2] -122.5 37.7 -122.4 37.8
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:2] "lon" "lat"
## .. .. ..$ : chr [1:2] "min" "max"
## ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
## .. .. ..@ projargs: chr NA
You can see from str(parks)
that the parks SPDF object is a collection of slots or components. The key ones are:
?SpatialPointsDataFrame
Explore the SpatialPointsDataFrame
parks@bbox
## min max
## lon -122.5110 -122.37372
## lat 37.7124 37.80342
head(parks@coords)
## lon lat
## [1,] -122.4352 37.74360
## [2,] -122.3839 37.73102
## [3,] -122.4347 37.77635
## [4,] -122.4203 37.80143
## [5,] -122.4276 37.79746
## [6,] -122.4377 37.79117
head(parks@data)
## ParkName ParkType Acreage
## 1 29TH/DIAMOND OPEN SPACE Neighborhood Park or Playground 0.82
## 2 ADAM ROGERS PARK Neighborhood Park or Playground 2.74
## 3 ALAMO SQUARE Neighborhood Park or Playground 12.70
## 4 ALICE MARBLE TENNIS COURTS Neighborhood Park or Playground 0.84
## 5 ALLYNE PARK Neighborhood Park or Playground 0.75
## 6 ALTA PLAZA Neighborhood Park or Playground 11.91
## ParkID location
## 1 194 Diamond\nSan Francisco, CA\n(37.74360211, -122.43523589)
## 2 46 Ingalls\nSan Francisco, CA\n(37.73101645, -122.38385466)
## 3 117 Hayes\nSan Francisco, CA\n(37.77634875, -122.43467396)
## 4 151 Greenwich\nSan Francisco, CA\n(37.80142776, -122.42034327)
## 5 131 2609 Gough St\nSan Francisco, CA\n(37.79746066, -122.42759992)
## 6 129 Jackson\nSan Francisco, CA\n(37.79117333, -122.43766978)
head(parks$ParkName)
## [1] "29TH/DIAMOND OPEN SPACE" "ADAM ROGERS PARK"
## [3] "ALAMO SQUARE" "ALICE MARBLE TENNIS COURTS"
## [5] "ALLYNE PARK" "ALTA PLAZA"
Create a simple map of the parks using the R base plot
method.
plot(parks, pch=21, col="black", bg='palegreen') # all parks
plot(parks[parks$Acreage > 10,], pch=21, col="black", bg="darkgreen", add=T) # big parks
What’s a baselayer?
library(ggmap)
## Warning: package 'ggmap' was built under R version 3.2.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.4
sf_basemap <- get_map( location = c(lon=mean(parks$lon),lat=mean(parks$lat)), zoom=12 )
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=37.757504,-122.435636&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
ggmap(sf_basemap) +
geom_point(data=parks@data,aes(x = parks$lon, y = parks$lat), size = 3, col="purple")
library(leaflet)
leaflet() %>% addTiles() %>%
setView(lng = mean(parks$lon), lat = mean(parks$lat), zoom = 12) %>%
addCircleMarkers(data = parks, lng = ~parks$lon,
lat = ~parks$lat, radius = 5, stroke=F,
popup = paste0("<strong>Park:</strong><br>", parks$ParkName),
color = "purple", fillOpacity = 1)
ggmap
and on an interactive map using leaflet
Our scenario in this tutorial is an exploration of SF neighborhood parks. The first question we have about these data is:
To answer this question let’s load some data that delineate SF neighborhoods. These data are in an ESRI Shapefile. This is one of the most, if not the most common spatial vector data file formats. The source of these data are the SF OpenData Portal.
We will use the rgdal
library to load the neighborhoods shapefile. The rgdal
library is the most commonly used R library for importing and exporting spatial data. For more info on working with rgdal
to load different types of spatial data in R see this excellent tutorial by Zev Ross. In fact his blog has a wealth of information about working with spatial data in R.
library(rgdal)
## rgdal: version: 1.0-4, (SVN revision 548)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 1.11.2, released 2015/02/10
## Path to GDAL shared files: /Library/Frameworks/R.framework/Versions/3.2/Resources/library/rgdal/gdal
## Loaded PROJ.4 runtime: Rel. 4.9.1, 04 March 2015, [PJ_VERSION: 491]
## Path to PROJ.4 shared files: /Library/Frameworks/R.framework/Versions/3.2/Resources/library/rgdal/proj
## Linking to sp version: 1.1-1
# read the sf_nhoods.shp file from the current working directory
sf_nhoods <- readOGR(dsn=".", layer="sf_nhoods")
## OGR data source with driver: ESRI Shapefile
## Source: ".", layer: "sf_nhoods"
## with 41 features
## It has 1 fields
class(sf_nhoods)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
The class
function tells us that the sf_nhoods data is a SpatialPolygonsDataFrame. We can take a look at it using str
and head
.
#str(sf_nhoods)
head(sf_nhoods@data)
## nhood
## 0 Bayview Hunters Point
## 1 Bernal Heights
## 2 Castro/Upper Market
## 3 Chinatown
## 4 Excelsior
## 5 Financial District/South Beach
We can also make a simple pote of the neighborhoods and with the parks displayed on top
plot(sf_nhoods, col='grey')
points(parks, pch=21, col="black", bg='palegreen')
If we add labels to the neighborhoods and parks on the map can we answer our question?
leaflet() %>% addTiles() %>%
setView(lng = mean(parks$lon), lat = mean(parks$lat), zoom = 12) %>%
addPolygons(
data=sf_nhoods,
stroke = TRUE, fillColor="transparent", smoothFactor = 0.5,
color = "red",
popup = sf_nhoods$nhood
) %>%
addCircleMarkers(data = parks, lng = ~parks$lon,
lat = ~parks$lat, radius = 5, stroke=F,
color = "purple", fillOpacity = 1,
popup = paste0("<strong>Park:</strong><br>", parks$ParkName)
)
Challenge - Does order matter? Is there any differnce if you add the Circles before the Polygons?
Adding labels to a static map or popups to an interactive map helps but it doesn’t scale up well with more than a handful of features. Moreover, it does not provide any output that you can use for further analysis like a table that links the neighborhoods to the parks. For that we need to join the data from these two different layers. This is called a spatial join
. A spatial overlay
operation is the method on which a spatial join is based. A spatial overlay operation computes the relationship between two sets of geometries. The results are then processed with data manipulation functions to output the resultant attribute information associated with those geometries.
Spatial overlay operations in R are implemented using the over
function in the SP
and rGEOS
libraries. Point-polygon overlay use SP::over while operations that involve SpatialLines objects, or pairs of SpatialPolygons require package rgeos, and use gIntersects
. That’s likely more detail than you need right now but the point here is that rgeos
is the go-to library for vector geometric processing in R.
Let’s use over
to find out:
?over
## Help on topic 'over' was found in the following packages:
##
## Package Library
## grDevices /Library/Frameworks/R.framework/Versions/3.2/Resources/library
## sp /Library/Frameworks/R.framework/Versions/3.2/Resources/library
##
##
## Using the first match ...
As we just said, over
is the man SP/rgeos method for spatial overlay operations. In its basic form, over(x,y)
you can interperet it as:
The term feature
is used to refer to a single unit of geographic data that includes both the location, or geometry, and the descriptive attributes. So, over
can tell us the index of the geometries that intersect or the attributes that describe those geometries.
So here goes…
#park_nhoods <- over(parks, sf_nhoods)
Oh snap! That didn’t work. We get the error:
The over
function, like almost all spatial analysis functions, requires that both data sets be spatial objects (they are) with the same coordinate reference system (CRS). Let’s investigate
# What is the CRS of the parks?
parks@proj4string # or proj4string(parks)
## CRS arguments: NA
# What is the CRS of the sf_nhoods?
sf_nhoods@proj4string
## CRS arguments: +proj=longlat +ellps=WGS84 +no_defs
We can see from the output that parks
does not have a CRS and that the CRS forsf_nhoods
is:
That CRS string refers to the WGS84 geographic coordinate reference system, which is the most commonly used CRS for latitude and longitude coordinates.
Ok, so now we know that the CRS for sf_nhoods
is WGS84. We also know that our park SpatialPointsDataFrame was created from longitude & latitude coordinates So, it is safe to assume that these are also WGS84. let’s set the CRS of our park points to that of the SF neighborhoods.
Note, we are not doing any transformation on the data. We are just identifying the CRS for the coordinates used by the data. If we define the wrong projection (and we can) we will get errors and erroneous results when we apply spatial operations on the data.
# Set the CRS for parks to be the same as that for sf_nhoods
proj4string(parks) <- CRS(proj4string(sf_nhoods))
# make sure the CRSs are the same
proj4string(parks) == proj4string(sf_nhoods)
## [1] TRUE
Now let’s try that overlay operation again
# Now try the overlay operation again
# For each feature in parks give me info about the sf_nhoods at the corresponding location
parks_w_nhoods <-over(parks,sf_nhoods)
What is our output? Does it answer our question?
head(parks_w_nhoods) # take a look at the output
## nhood
## 1 Noe Valley
## 2 Bayview Hunters Point
## 3 Hayes Valley
## 4 Russian Hill
## 5 Marina
## 6 Pacific Heights
class(parks_w_nhoods)
## [1] "data.frame"
nrow(parks_w_nhoods)
## [1] 121
nrow(parks)
## [1] 121
Our output parks_w_nhoods is a dataframe with the id of each park and the name of the neighborhood in which it is located in the column nhood
. So we are close to answering our question. But for the data to be useful we need to link (or join) the name of the parks to the name of the nhoods.
# Now combine the information about neighborhoods with the spatial data
parks2 <- cbind(parks,parks_w_nhoods)
head(parks2)
## ParkName ParkType Acreage
## 1 29TH/DIAMOND OPEN SPACE Neighborhood Park or Playground 0.82
## 2 ADAM ROGERS PARK Neighborhood Park or Playground 2.74
## 3 ALAMO SQUARE Neighborhood Park or Playground 12.70
## 4 ALICE MARBLE TENNIS COURTS Neighborhood Park or Playground 0.84
## 5 ALLYNE PARK Neighborhood Park or Playground 0.75
## 6 ALTA PLAZA Neighborhood Park or Playground 11.91
## ParkID location
## 1 194 Diamond\nSan Francisco, CA\n(37.74360211, -122.43523589)
## 2 46 Ingalls\nSan Francisco, CA\n(37.73101645, -122.38385466)
## 3 117 Hayes\nSan Francisco, CA\n(37.77634875, -122.43467396)
## 4 151 Greenwich\nSan Francisco, CA\n(37.80142776, -122.42034327)
## 5 131 2609 Gough St\nSan Francisco, CA\n(37.79746066, -122.42759992)
## 6 129 Jackson\nSan Francisco, CA\n(37.79117333, -122.43766978)
## lat lon optional nhood
## 1 37.74360 -122.4352 TRUE Noe Valley
## 2 37.73102 -122.3839 TRUE Bayview Hunters Point
## 3 37.77635 -122.4347 TRUE Hayes Valley
## 4 37.80143 -122.4203 TRUE Russian Hill
## 5 37.79746 -122.4276 TRUE Marina
## 6 37.79117 -122.4377 TRUE Pacific Heights
Do we have our answer?
We can combine the information about neighborhoods and parks with the spatial data using the spCbind
(spatial column bind) function in the maptools
libary.
library(maptools)
## Checking rgeos availability: TRUE
?spCbind
parks3 <- spCbind(parks,parks_w_nhoods)
head(parks3@data)
## ParkName ParkType Acreage
## 1 29TH/DIAMOND OPEN SPACE Neighborhood Park or Playground 0.82
## 2 ADAM ROGERS PARK Neighborhood Park or Playground 2.74
## 3 ALAMO SQUARE Neighborhood Park or Playground 12.70
## 4 ALICE MARBLE TENNIS COURTS Neighborhood Park or Playground 0.84
## 5 ALLYNE PARK Neighborhood Park or Playground 0.75
## 6 ALTA PLAZA Neighborhood Park or Playground 11.91
## ParkID location
## 1 194 Diamond\nSan Francisco, CA\n(37.74360211, -122.43523589)
## 2 46 Ingalls\nSan Francisco, CA\n(37.73101645, -122.38385466)
## 3 117 Hayes\nSan Francisco, CA\n(37.77634875, -122.43467396)
## 4 151 Greenwich\nSan Francisco, CA\n(37.80142776, -122.42034327)
## 5 131 2609 Gough St\nSan Francisco, CA\n(37.79746066, -122.42759992)
## 6 129 Jackson\nSan Francisco, CA\n(37.79117333, -122.43766978)
## nhood
## 1 Noe Valley
## 2 Bayview Hunters Point
## 3 Hayes Valley
## 4 Russian Hill
## 5 Marina
## 6 Pacific Heights
plot(parks3)
Challenge Map the parks3 and sf data in leaflet. Update the code below to do this and change the popup so that you can confirm that the neighborhood results from the overlay operation match the sf_nhoods data.
leaflet() %>% addTiles() %>%
setView(lng = mean(parks$lon), lat = mean(parks$lat), zoom = 12) %>%
addPolygons(
data=sf_nhoods,
stroke = TRUE, fillColor="transparent", smoothFactor = 0.5,
color = "red",
popup = sf_nhoods$nhood
) %>%
addCircleMarkers(data = parks3, lng = ~parks3$lon,
lat = ~parks3$lat, radius = 5, stroke=F,
color = "purple", fillOpacity = 1,
popup = paste0(parks3$ParkName, "<br>", parks3$nhood)
)
We just did what’s called a type of spatial overlay query called a Point-in-Polygon
query. We asked “in what neighborhood is each park located?”. This is a very common operation. For example, you may ask “In what census tracts are my address points located?”. Once you answer that question with a spatial overlay you can link your address data to census data!
We now know the neighborhood for each park. Now let’s think about this question from the neighborhood perspective. First, let’s ask the question:
How might we answer that?
# attribute query because we already made the spatial association
parks3[which(parks3$nhood=='Noe Valley'),]$ParkName
## [1] "29TH/DIAMOND OPEN SPACE" "DOUGLASS PLAYGROUND"
## [3] "DUNCAN/CASTRO OPEN SPACE" "NOE VALLEY COURTS"
## [5] "UPPER NOE RECREATION CENTER"
What about asking this question spatially from the get go using a spatial overlay?
parks_in_noe1 <-over(sf_nhoods[sf_nhoods$nhood=='Noe Valley',],parks)
parks_in_noe1
## ParkName ParkType Acreage ParkID
## 21 29TH/DIAMOND OPEN SPACE Neighborhood Park or Playground 0.82 194
## location
## 21 Diamond\nSan Francisco, CA\n(37.74360211, -122.43523589)
parks_in_noe2 <- over(sf_nhoods[sf_nhoods$nhood=='Noe Valley',],parks, returnList=T)
parks_in_noe2
## $`21`
## ParkName ParkType Acreage
## 1 29TH/DIAMOND OPEN SPACE Neighborhood Park or Playground 0.82
## 27 DOUGLASS PLAYGROUND Neighborhood Park or Playground 7.45
## 29 DUNCAN/CASTRO OPEN SPACE Neighborhood Park or Playground 0.55
## 85 NOE VALLEY COURTS Neighborhood Park or Playground 0.93
## 113 UPPER NOE RECREATION CENTER Neighborhood Park or Playground 2.51
## ParkID location
## 1 194 Diamond\nSan Francisco, CA\n(37.74360211, -122.43523589)
## 27 8 26th\nSan Francisco, CA\n(37.74727823, -122.43893283)
## 29 178 Duncan\nSan Francisco, CA\n(37.74594931, -122.43339409)
## 85 72 24th\nSan Francisco, CA\n(37.75140324, -122.43892096)
## 113 22 Day\nSan Francisco, CA\n(37.74243091, -122.42779751)
returnList=TRUE
in our first over query?We know what parks are in Noe Valley - but let’s make it spatial so we can plot it!
over.list <- over(parks,sf_nhoods[sf_nhoods$nhood=='Noe Valley',] )
head(over.list,2)
## nhood
## 1 Noe Valley
## 2 <NA>
noe_parks <- parks[!is.na(over.list[,1]),]
plot(sf_nhoods[sf_nhoods$nhood=='Noe Valley',])
points(noe_parks, pch=21, col='red', bg='black')
We know what parks are in each neighborhood but we may also want to summarize the count of parks by neighborhood. Since our parks3 SPDF has a dataframe with this information we can summarize it aspaitally.
# using the work we have already done - aspatial
nhood_park_count <- as.data.frame(table(parks3$nhood))
head(nhood_park_count)
## Var1 Freq
## 1 Bayview Hunters Point 9
## 2 Bernal Heights 3
## 3 Castro/Upper Market 5
## 4 Chinatown 2
## 5 Excelsior 2
## 6 Financial District/South Beach 2
# relabel the columsn
names(nhood_park_count) <- c('nhood','park_count')
# look at it again
head(nhood_park_count)
## nhood park_count
## 1 Bayview Hunters Point 9
## 2 Bernal Heights 3
## 3 Castro/Upper Market 5
## 4 Chinatown 2
## 5 Excelsior 2
## 6 Financial District/South Beach 2
# Attach the data to the sf_nhoods SPDF and save to a new SPDF
# so that we can compare the two data frames
sf_nhoods2 <- merge(sf_nhoods,nhood_park_count, by.y='nhood', by.x='nhood')
head(sf_nhoods2@data)
## nhood park_count
## 1 Bayview Hunters Point 9
## 2 Bernal Heights 3
## 3 Castro/Upper Market 5
## 4 Chinatown 2
## 5 Excelsior 2
## 6 Financial District/South Beach 2
# compare to
head(sf_nhoods@data)
## nhood
## 0 Bayview Hunters Point
## 1 Bernal Heights
## 2 Castro/Upper Market
## 3 Chinatown
## 4 Excelsior
## 5 Financial District/South Beach
Important note*: Always pass the merge
command your SpatialDataFrame object. Never merge the data.frame associated with your Spatial DataFrame object directly!
That attribute join works great but what if we didn’t do the earlier operations. How can we answer that query spatially from the get go? We can do it with over
.
over.list <- over(sf_nhoods, parks, returnList = TRUE)
head(over.list,1)
## $`0`
## ParkName ParkType
## 2 ADAM ROGERS PARK Neighborhood Park or Playground
## 12 BAY VIEW PLAYGROUND Neighborhood Park or Playground
## 43 GILMAN PLAYGROUND Neighborhood Park or Playground
## 53 HILLTOP PARK Neighborhood Park or Playground
## 54 Hunter's Point Recreation Center Neighborhood Park or Playground
## 63 JOSEPH LEE RECREATION CENTER Neighborhood Park or Playground
## 87 PALOU/PHELPS PARK Neighborhood Park or Playground
## 103 SILVER TERRACE PLAYGROUND Neighborhood Park or Playground
## 121 YOUNGBLOOD COLEMAN PLAYGROUND Neighborhood Park or Playground
## Acreage ParkID
## 2 2.74 46
## 12 3.40 38
## 43 4.40 44
## 53 3.46 47
## 54 0.30 165
## 63 0.78 40
## 87 2.63 188
## 103 5.47 39
## 121 6.13 45
## location
## 2 Ingalls\nSan Francisco, CA\n(37.73101645, -122.38385466)
## 12 3rd\nSan Francisco, CA\n(37.72593906, -122.39329436)
## 43 Gilman Ave\nSan Francisco, CA\n(37.71698202, -122.38809021)
## 53 La Salle\nSan Francisco, CA\n(37.73291728, -122.38347479)
## 54 200 Middle Point Rd\nSan Francisco, CA\n(37.73735, -122.37834)
## 63 1395 Mendell St\nSan Francisco, CA\n(37.73453491, -122.389412)
## 87 Palou\nSan Francisco, CA\n(37.73551382, -122.39472336)
## 103 Thornton\nSan Francisco, CA\n(37.73285911, -122.40314958)
## 121 1398 Hudson St\nSan Francisco, CA\n(37.73876792, -122.38450221)
# We can sum the parks by neighborhood (NROW) and add the count to the sf_nhoods2 SPDF:
sf_nhoods2$park_count2 <- sapply(over.list, NROW)
head(sf_nhoods2@data)
## nhood park_count park_count2
## 1 Bayview Hunters Point 9 9
## 2 Bernal Heights 3 3
## 3 Castro/Upper Market 5 5
## 4 Chinatown 2 2
## 5 Excelsior 2 2
## 6 Financial District/South Beach 2 2
# we could have done that in one step:
sf_nhoods2$park_count2 <- sapply(over(sf_nhoods2, parks, returnList = TRUE), NROW)
head(sf_nhoods2@data)
## nhood park_count park_count2
## 1 Bayview Hunters Point 9 9
## 2 Bernal Heights 3 3
## 3 Castro/Upper Market 5 5
## 4 Chinatown 2 2
## 5 Excelsior 2 2
## 6 Financial District/South Beach 2 2
The aggregate
function provids another way to count the number of points by polygons. I mention it here because you will see it used in other tutorials.
# and yet another way
# after: http://robinlovelace.net/r/2014/01/10/spatial-clipping-and-aggregation-in-R.html
parkNhoods <- aggregate(parks, sf_nhoods, length)
#head(parkNhoods)
head(parkNhoods@data)
## ParkName ParkType Acreage ParkID location
## 0 9 9 9 9 9
## 1 3 3 3 3 3
## 2 5 5 5 5 5
## 3 2 2 2 2 2
## 4 2 2 2 2 2
## 5 2 2 2 2 2
nrow(parkNhoods)
## [1] 41
nrow(parks)
## [1] 121
nrow(sf_nhoods)
## [1] 41
plot(parkNhoods)
Often the data we can find online is kind of but not exactly what we want. So being able to subset, merge or combine data sets is super useful. For example, what if we really only care about the parks in the Financial District Neighborhoods and we want to consider them together? We can create a new polygon that represents our area of interest. We can do this with the rgeos gUnaryUnion
function.
library(rgeos)
## rgeos version: 0.3-14, (SVN revision 511)
## GEOS runtime version: 3.4.2-CAPI-1.8.2 r3921
## Linking to sp version: 1.2-1
## Polygon checking: TRUE
fd_nhoods <- sf_nhoods[sf_nhoods$nhood=='Chinatown' | sf_nhoods$nhood=='North Beach' | sf_nhoods$nhood=='Nob Hill',]
plot(fd_nhoods)
fd_nhoods2 <- gUnaryUnion(fd_nhoods)
plot(fd_nhoods2)
We could now use our financial district polygon to ask spatial questions instead of asking them of all neighborhoods.
So far we have considered the count of parks by neighborhood. Why is that comparison misleading?
Densities, or counts normalized by area, are used frequently in spatial analysis. So knowing how to compute density is key.
We may want to ask the question:
What is our strategy to approach this problem?
We can calculate area of a geometry using the rgeos gArea
function. Applied to an SPDF it gives us the area of all areas.
?gArea
gArea(sf_nhoods)
## Warning in RGEOSMiscFunc(spgeom, byid, "rgeos_area"): Spatial object is not
## projected; GEOS expects planar coordinates
## [1] 0.0124847
Oh snap! that didn’t work.
Important - All spatial data being compared need to be in the same projected CRS. - Almost all spatial measurement operations in R require data to be in a projected CRS.
Let’s take a brief interlude to explain map projections and projected CRSs, especially, why these are necessary for spatial measures like area, length.
Because the world isn’t flat Map Projections are used to convert locations on a 3D surface to a 2d plane. A map projection is a mathematical model that enables this coordinate transformation. Needless to say, this process introduces error sinces it is impossible to convert a 3D surface to 2D plane without introducing distortion.
There are three main types of projections:
A projected CRS defines how a map projection is applied to coordinate data in a GCS. It specifies the map projection, the input GCS of the data, the origin, and any other needed parameters that enabled the transformation.
Good projected CRSs for CA data include:
Here are some useful references:
# Transfrom the data to a projected CRS using sp package function spTransform (requires rgdal)
?spTransform
## Help on topic 'spTransform' was found in the following packages:
##
## Package Library
## rgdal /Library/Frameworks/R.framework/Versions/3.2/Resources/library
## sp /Library/Frameworks/R.framework/Versions/3.2/Resources/library
##
##
## Using the first match ...
sf_nhoods2_utm10 <- spTransform(sf_nhoods2, CRS("+init=epsg:26910"))
#use rgeos function gArea to compute area
?gArea
gArea(sf_nhoods2_utm10)
## [1] 122021458
# conversion factor: sqmeters to sqmiles
sqm_to_sqmiles <- 3.861e-7
#compute area of SF
sf_area_sqmiles <- gArea(sf_nhoods2_utm10) * sqm_to_sqmiles
sf_area_sqmiles
## [1] 47.11248
Did that area calculation work? To find out, check the total area in square miles with that listed on the San Franciciso wikipedia page.
To compute the area of each neighborhood we set the byid
argument of the function to TRUE
. We can add the result to the sf_nhoods2_utm10 SPDF.
sf_nhoods2_utm10$area_sqm <- gArea(sf_nhoods2_utm10, byid = TRUE)
sf_nhoods2_utm10$area_sqm
## [1] 13390445.7 2790169.1 2218768.5 581470.3 3603577.8 2908697.5
## [7] 1730217.7 4463920.9 1441068.8 1269356.7 1925946.0 3685058.8
## [13] 312392.8 7456415.0 1021782.8 1499388.3 2648543.0 1593673.7
## [19] 4875571.0 2104564.4 1053085.4 2526691.1 1292437.0 2726378.8
## [25] 2599206.5 4635167.3 2058387.2 2137432.0 2943798.8 6118982.7
## [31] 1300202.4 1278685.0 551104.6 2290739.4 10945010.5 1016762.8
## [37] 2300826.7 1715247.6 1583637.4 7916524.9 1510120.9
# convert the area to sq miles
sf_nhoods2_utm10$area_sqmiles <- sf_nhoods2_utm10$area_sqm * sqm_to_sqmiles
sf_nhoods2_utm10$area_sqmiles
## [1] 5.1700511 1.0772843 0.8566665 0.2245057 1.3913414 1.1230481 0.6680371
## [8] 1.7235199 0.5563967 0.4900986 0.7436077 1.4228012 0.1206149 2.8789218
## [15] 0.3945103 0.5789138 1.0226024 0.6153174 1.8824580 0.8125723 0.4065963
## [22] 0.9755554 0.4990099 1.0526549 1.0035536 1.7896381 0.7947433 0.8252625
## [29] 1.1366007 2.3625392 0.5020081 0.4937003 0.2127815 0.8844545 4.2258686
## [36] 0.3925721 0.8883492 0.6622571 0.6114424 3.0565703 0.5830577
head(sf_nhoods2_utm10@data)
## nhood park_count park_count2 area_sqm
## 1 Bayview Hunters Point 9 9 13390445.7
## 2 Bernal Heights 3 3 2790169.1
## 3 Castro/Upper Market 5 5 2218768.5
## 4 Chinatown 2 2 581470.3
## 5 Excelsior 2 2 3603577.8
## 6 Financial District/South Beach 2 2 2908697.5
## area_sqmiles
## 1 5.1700511
## 2 1.0772843
## 3 0.8566665
## 4 0.2245057
## 5 1.3913414
## 6 1.1230481
# 5 biggest neighborhoods?
head(sf_nhoods2_utm10@data[order(-sf_nhoods2_utm10$area_sqmiles),], 5)
## nhood park_count park_count2 area_sqm area_sqmiles
## 1 Bayview Hunters Point 9 9 13390446 5.170051
## 35 Sunset/Parkside 6 6 10945011 4.225869
## 40 West of Twin Peaks 8 8 7916525 3.056570
## 14 Lakeshore 1 1 7456415 2.878922
## 30 Presidio 1 1 6118983 2.362539
Challenges - Confirm the biggest neighborhood in your leaflet map - Confirm the area of the presidio neighborhood in wikipedia
Always check your output to make sure your calculation method is correct.
Now that we have the area and count of parks for each neighborhood we can calculate park density.
sf_nhoods2_utm10$park_per_sqmile <- sf_nhoods2_utm10$park_count / sf_nhoods2_utm10$area_sqmiles
head(sf_nhoods2_utm10@data)
## nhood park_count park_count2 area_sqm
## 1 Bayview Hunters Point 9 9 13390445.7
## 2 Bernal Heights 3 3 2790169.1
## 3 Castro/Upper Market 5 5 2218768.5
## 4 Chinatown 2 2 581470.3
## 5 Excelsior 2 2 3603577.8
## 6 Financial District/South Beach 2 2 2908697.5
## area_sqmiles park_per_sqmile
## 1 5.1700511 1.740795
## 2 1.0772843 2.784780
## 3 0.8566665 5.836577
## 4 0.2245057 8.908460
## 5 1.3913414 1.437462
## 6 1.1230481 1.780868
# what?sf_nhoods2_utm10[c('nhood','park_count','park_per_sqmile'),]
Question: Which neighborhoods have the lowest park density (bottom 5)? - Show the numbers - Map the neighborhoods by park density
# 5 highest park densities
head(sf_nhoods2_utm10@data[order(-sf_nhoods2_utm10$park_per_sqmile),], 5)
## nhood park_count park_count2 area_sqm area_sqmiles
## 7 Glen Park 7 7 1730217.7 0.6680371
## 32 Russian Hill 5 5 1278685.0 0.4937003
## 4 Chinatown 2 2 581470.3 0.2245057
## 13 Japantown 1 1 312392.8 0.1206149
## 41 Western Addition 4 4 1510120.9 0.5830577
## park_per_sqmile
## 7 10.478461
## 32 10.127602
## 4 8.908460
## 13 8.290853
## 41 6.860385
# 5 lowest park densities
head(sf_nhoods2_utm10@data[order(sf_nhoods2_utm10$park_per_sqmile),], 5)
## nhood park_count park_count2 area_sqm area_sqmiles
## 8 Golden Gate Park 0 0 4463920.9 1.7235199
## 15 Lincoln Park 0 0 1021782.8 0.3945103
## 20 Mission Bay 0 0 2104564.4 0.8125723
## 33 Seacliff 0 0 551104.6 0.2127815
## 37 Treasure Island 0 0 2300826.7 0.8883492
## park_per_sqmile
## 8 0
## 15 0
## 20 0
## 33 0
## 37 0
Choropleth maps are a type of data map - they symbolize areas by data values.
leaflet() %>% addTiles() %>%
setView(lng = mean(parks$lon), lat = mean(parks$lat), zoom = 12) %>%
addPolygons(
data=sf_nhoods2_utm10,
stroke = TRUE, fillColor="transparent", smoothFactor = 0.5,
color = "red"
)
Where are the data? Why aren’t we seeing the neighborhood polygons?
sf_nhoods2_geo<- spTransform(sf_nhoods2_utm10, CRS("+init=epsg:4326"))
# What is another way to do this with spTransform
#http://zevross.com/blog/2015/10/14/manipulating-and-mapping-us-census-data-in-r-using-the-acs-tigris-and-leaflet-packages-3/
leaflet() %>% addTiles() %>%
setView(lng = mean(parks$lon), lat = mean(parks$lat), zoom = 12) %>%
addPolygons(
data=sf_nhoods2_geo,
stroke = T, fillOpacity= 0.7,smoothFactor = 0.5,
color = "black",
weight= 0.5,
fillColor = ~colorQuantile("YlOrRd", sf_nhoods2_geo$area_sqmiles)(area_sqmiles),
popup=paste0(sf_nhoods2_geo$nhood, " : ", sf_nhoods2_geo$area_sqmiles)
)
Where are the people?
The number or density of parks in a neighborhood is interesting but parks serve people. So it is important to bring in the populations served by these parks. To do that we may wish to know: - bring in some population data for SF - calculate the populate within each neighborhood - calculate a rate of people per park by neighborhood - calculate the pop served within walking distance of each park - calculate the pop within 1/2 mile of each neighborhood park - identify the most/least accessible parks - compute the neighborhood population within walking distance of a neighborhood park.
We can get population information and other demographic data from the US Census web site. The US Census database is perhaps the largest and most well used database in the world. Needless to say, it’s complicated. So I will walk through one method for getting these data. You do not need to follow along, just load the resulting shapefile.
The census stores its geospatial data in TIGER/Line
files. It makes these available in ESRI Shapefile format.
You can use the tigris
library to fetch the census tracts for SF.
library(tigris) # used to fetch TIGER data (shapefiles)
sf_tracts <- tracts('CA', 'San Francisco', cb=TRUE) #state abbrev & countie name
class(sf_tracts)
str(sf_tracts)
head(sf_tracts@data)
# Map the tracts
leaflet() %>% addTiles() %>%
setView(lng = mean(parks$lon), lat = mean(parks$lat), zoom = 12) %>%
addPolygons(
data=sf_tracts,
stroke = TRUE, fillColor="transparent", smoothFactor = 0.5,
color = "red"
#popup = paste0("<strong>Neighborhood:</strong><br>", sf_nhoods2$nhood)
)
Once you have the TIGER data you can download the census demographic data and link it to the census tracts.
You can figure out what variable you want to fetch by going to the Census API web page at:
You will see that the variable B01003_001E
contains total population by census tract.
Below we use the acs14lite
library to fetch the census data via the census api. This requires a census api key.
library(acs14lite)
my_census_api_key <- "f2d6f4f743545d3a42a67412b05935dc7712c432" # get your own key please!
set_api_key(my_census_api_key)
sf_pop <- acs14(geography = 'tract', state = 'CA', county = 'San Francisco',
variable = c('B01003_001E'))
# Look at the data
head(sf_pop)
tot_sf_pop <- sum(sf_pop$B01003_001E)
tot_sf_pop
Check your result for tot_sf_pop - does it match what Wikipedia says is the population for SF?
This is never fun! And there are many ways to do it. The basic approach is a table join by a unique identifier, in this case the tract id. With census data the GEOID
is typically used . For tract level data, this is a concatenation of the state, county and tract codes.
sf_pop$GEOID <- paste0(sf_pop$state, sf_pop$county, sf_pop$tract)
head(sf_pop)
# let's simplyfy the data so that we just have pop for the tracts
# and the area data
sf_pop2 <- sf_pop[c('GEOID','B01003_001E')]
head(sf_pop2)
# relabel the columsn
names(sf_pop2) <- c('GEOID','pop14')
head(sf_pop2)
# Now join the tabular data to the tracts
# using the tigris function geo_join
pop_by_tracts<- geo_join(sf_tracts, sf_pop2, "GEOID", "GEOID")
head(pop_by_tracts@data)
pop_by_tracts$GEOID.1 <- NULL # get rid of duplicative column data
head(pop_by_tracts@data)
# save the data to a shapefile so we do not need to do that again!
# http://zevross.com/blog/2016/01/13/tips-for-reading-spatial-files-into-r-with-rgdal/
# write to current directory as sf_pop_by_tracts.shp
writeOGR(pop_by_tracts, dsn=".", layer="sf_pop_by_tracts", driver="ESRI Shapefile")
dir()
Ok - you can just read in the sf_pop_by_tracts.shp shapefile instead of doing all that work!
pop_by_tracts <- readOGR(dsn=".", layer="sf_pop_by_tracts")
## OGR data source with driver: ESRI Shapefile
## Source: ".", layer: "sf_pop_by_tracts"
## with 196 features
## It has 10 fields
head(pop_by_tracts@data)
## STATEFP COUNTYFP TRACTCE AFFGEOID GEOID NAME LSAD
## 0 06 075 010700 1400000US06075010700 06075010700 107 CT
## 1 06 075 012201 1400000US06075012201 06075012201 122.01 CT
## 2 06 075 013102 1400000US06075013102 06075013102 131.02 CT
## 3 06 075 013500 1400000US06075013500 06075013500 135 CT
## 4 06 075 015500 1400000US06075015500 06075015500 155 CT
## 5 06 075 016300 1400000US06075016300 06075016300 163 CT
## ALAND AWATER pop14
## 0 183170 0 5311
## 1 92048 0 4576
## 2 237886 0 2692
## 3 235184 0 2592
## 4 311339 0 3918
## 5 245867 0 4748
Ok, so now that we have pop by census tract, how do we compute the population of each neighborhood?
Start by mapping the neighborhoods & the census tracts together. Note - we want to work with the neighborhood data that has the area and density info - sf_nhoods2_utm10
.
plot(pop_by_tracts)
plot(sf_nhoods2_utm10, add=T, border="red")
The neighborhoods don’t show on the map - why?
#transform the census data to utm10
pop_by_tracts@proj4string
## CRS arguments:
## +proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0
tracts_utm10 <- spTransform(pop_by_tracts, CRS(proj4string(sf_nhoods2_utm10)))
plot(tracts_utm10)
plot(sf_nhoods2_utm10, border="red", add=T)
# remove census tracts not in our neighborhoods
nhood_tracts_utm10 <- tracts_utm10[sf_nhoods2_utm10, ]
# map them again
plot(nhood_tracts_utm10)
plot(sf_nhoods2_utm10, border="red", add=T)
We can see that our two data sets overlap. They are both in a projected coordinate system (utm10). We now need to transfer the population data from the census tracts to the neighborhoods. To do this we need to:
Although this operation is common, tehre are two things to keep in mind:
#These are the data
## sf_nhoods2_utm10
head(sf_nhoods2_utm10@data)
## nhood park_count park_count2 area_sqm
## 1 Bayview Hunters Point 9 9 13390445.7
## 2 Bernal Heights 3 3 2790169.1
## 3 Castro/Upper Market 5 5 2218768.5
## 4 Chinatown 2 2 581470.3
## 5 Excelsior 2 2 3603577.8
## 6 Financial District/South Beach 2 2 2908697.5
## area_sqmiles park_per_sqmile
## 1 5.1700511 1.740795
## 2 1.0772843 2.784780
## 3 0.8566665 5.836577
## 4 0.2245057 8.908460
## 5 1.3913414 1.437462
## 6 1.1230481 1.780868
## nhood_tracts_utm10
head(nhood_tracts_utm10@data)
## STATEFP COUNTYFP TRACTCE AFFGEOID GEOID NAME LSAD
## 0 06 075 010700 1400000US06075010700 06075010700 107 CT
## 1 06 075 012201 1400000US06075012201 06075012201 122.01 CT
## 2 06 075 013102 1400000US06075013102 06075013102 131.02 CT
## 3 06 075 013500 1400000US06075013500 06075013500 135 CT
## 4 06 075 015500 1400000US06075015500 06075015500 155 CT
## 5 06 075 016300 1400000US06075016300 06075016300 163 CT
## ALAND AWATER pop14
## 0 183170 0 5311
## 1 92048 0 4576
## 2 237886 0 2692
## 3 235184 0 2592
## 4 311339 0 3918
## 5 245867 0 4748
# The polygon objects have arbitray ids.
# Let set them to GEOID for tracts
# But nhood is too long a string to use for sf_nhoods
# So let's set it to the row.names
sf_nhoods2_utm10$nhood_id <- row.names(sf_nhoods2_utm10@data)
head(sf_nhoods2_utm10@data)
## nhood park_count park_count2 area_sqm
## 1 Bayview Hunters Point 9 9 13390445.7
## 2 Bernal Heights 3 3 2790169.1
## 3 Castro/Upper Market 5 5 2218768.5
## 4 Chinatown 2 2 581470.3
## 5 Excelsior 2 2 3603577.8
## 6 Financial District/South Beach 2 2 2908697.5
## area_sqmiles park_per_sqmile nhood_id
## 1 5.1700511 1.740795 1
## 2 1.0772843 2.784780 2
## 3 0.8566665 5.836577 3
## 4 0.2245057 8.908460 4
## 5 1.3913414 1.437462 5
## 6 1.1230481 1.780868 6
# Buffer around the polygons a distance of zero (no buffer)
# We do this only to specify the column for the ID of each polygon!!!!
sfhoods_buf0 <- gBuffer(sf_nhoods2_utm10, width=0, byid=TRUE, id=sf_nhoods2_utm10$nhood_id)
sftracts_buf0 <-gBuffer(nhood_tracts_utm10, width=0, byid=TRUE, id=nhood_tracts_utm10$GEOID)
head(sfhoods_buf0@data)
## nhood park_count park_count2 area_sqm
## 1 Bayview Hunters Point 9 9 13390445.7
## 2 Bernal Heights 3 3 2790169.1
## 3 Castro/Upper Market 5 5 2218768.5
## 4 Chinatown 2 2 581470.3
## 5 Excelsior 2 2 3603577.8
## 6 Financial District/South Beach 2 2 2908697.5
## area_sqmiles park_per_sqmile nhood_id
## 1 5.1700511 1.740795 1
## 2 1.0772843 2.784780 2
## 3 0.8566665 5.836577 3
## 4 0.2245057 8.908460 4
## 5 1.3913414 1.437462 5
## 6 1.1230481 1.780868 6
head(sftracts_buf0@data)
## STATEFP COUNTYFP TRACTCE AFFGEOID GEOID
## 06075010700 06 075 010700 1400000US06075010700 06075010700
## 06075012201 06 075 012201 1400000US06075012201 06075012201
## 06075013102 06 075 013102 1400000US06075013102 06075013102
## 06075013500 06 075 013500 1400000US06075013500 06075013500
## 06075015500 06 075 015500 1400000US06075015500 06075015500
## 06075016300 06 075 016300 1400000US06075016300 06075016300
## NAME LSAD ALAND AWATER pop14
## 06075010700 107 CT 183170 0 5311
## 06075012201 122.01 CT 92048 0 4576
## 06075013102 131.02 CT 237886 0 2692
## 06075013500 135 CT 235184 0 2592
## 06075015500 155 CT 311339 0 3918
## 06075016300 163 CT 245867 0 4748
plot(sftracts_buf0, col="black")
plot(sfhoods_buf0, col="red", add=T)
# Calculate fraction of polygon that overlaps each point buffer
#After: http://rstudio-pubs-static.s3.amazonaws.com/6577_3b66f8d8f4984fb2807e91224defa854.html
#
# first intersect the two geometries (tracts and neighborhoods)
t2 <- gIntersection(sfhoods_buf0,sftracts_buf0, byid = T)
plot(t2) # take a look
#
head(t2,1) # CHECK OUT THE ID SLOT!!!
## An object of class "SpatialPolygons"
## Slot "polygons":
## [[1]]
## An object of class "Polygons"
## Slot "Polygons":
## [[1]]
## An object of class "Polygon"
## Slot "labpt":
## [1] 554229.2 4176527.4
##
## Slot "area":
## [1] 489397.3
##
## Slot "hole":
## [1] FALSE
##
## Slot "ringDir":
## [1] 1
##
## Slot "coords":
## x y
## [1,] 553978.1 4176561
## [2,] 553997.6 4176668
## [3,] 553877.3 4176753
## [4,] 554160.7 4177173
## [5,] 554490.3 4176940
## [6,] 554271.5 4176871
## [7,] 554373.1 4176538
## [8,] 554615.2 4176335
## [9,] 554517.5 4176486
## [10,] 554637.1 4176415
## [11,] 554652.5 4176128
## [12,] 554506.4 4176103
## [13,] 554448.9 4176109
## [14,] 554331.4 4176008
## [15,] 554167.1 4176124
## [16,] 553835.3 4176358
## [17,] 553978.1 4176561
##
##
##
## Slot "plotOrder":
## [1] 1
##
## Slot "labpt":
## [1] 554229.2 4176527.4
##
## Slot "ID":
## [1] "1 06075023102"
##
## Slot "area":
## [1] 489397.3
##
##
##
## Slot "plotOrder":
## [1] 1
##
## Slot "bbox":
## min max
## x 553835.3 554652.5
## y 4176008.5 4177172.6
##
## Slot "proj4string":
## CRS arguments:
## +init=epsg:26910 +proj=utm +zone=10 +datum=NAD83 +units=m +no_defs
## +ellps=GRS80 +towgs84=0,0,0
#
# second, calculate the areas of the output polygons formed by the intersection
# A combo id (tracts-nhoods) is associated with each area
overlap.area <- gArea(t2, byid = TRUE)
head(overlap.area) # take a look
## 1 06075023102 1 06075025701 1 06075061400 1 06075025800 1 06075023200
## 489397.2910 2361.9119 12785.4676 45.4304 846310.3119
## 1 06075025702
## 325.5555
n.overlaps <- length(overlap.area)
n.overlaps #this is the number of resultant overlaps (intersections)
## [1] 553
# NOTE: the type and order of calls to R functions below ensures that
# resulting dataframe retains IDs in character format; and the area of
# each overlap in numeric format
tmp <- as.character(scan(text = names(overlap.area), what="character"))
head(tmp)
## [1] "1" "06075023102" "1" "06075025701" "1"
## [6] "06075061400"
tmp <- matrix(tmp, nrow = n.overlaps, ncol = 2, byrow = TRUE, dimnames = list(c(NULL), c("nhood.id", "tract.id")))
tmp <- as.data.frame(tmp, stringsAsFactors = FALSE)
overlaps <- cbind(tmp, overlap.area)
rownames(overlaps) <- NULL
# We now have a dataframe of nhood id + tract id + area of overlap for these two geometries
head(overlaps) # Take a look
## nhood.id tract.id overlap.area
## 1 1 06075023102 489397.2910
## 2 1 06075025701 2361.9119
## 3 1 06075061400 12785.4676
## 4 1 06075025800 45.4304
## 5 1 06075023200 846310.3119
## 6 1 06075025702 325.5555
# Calculate the Proportion of area of overlap
# First compute the area of each neighborhood polygon
tmp <- gArea(sfhoods_buf0, byid = TRUE)
nhood.area <- data.frame(nhood.id = names(tmp), nhood.area = tmp, stringsAsFactors = FALSE)
head(nhood.area, 1)
## nhood.id nhood.area
## 1 1 13390446
# Second add the area of each polygon to the overlaps table
overlaps <- data.frame(overlaps, nhood.area = nhood.area[match(overlaps[, "nhood.id"],
nhood.area[, "nhood.id"]), "nhood.area"])
head(overlaps)
## nhood.id tract.id overlap.area nhood.area
## 1 1 06075023102 489397.2910 13390446
## 2 1 06075025701 2361.9119 13390446
## 3 1 06075061400 12785.4676 13390446
## 4 1 06075025800 45.4304 13390446
## 5 1 06075023200 846310.3119 13390446
## 6 1 06075025702 325.5555 13390446
# Now compute the area of each sftracts polygon
tmp <- gArea(sftracts_buf0, byid = TRUE)
tract.area <- data.frame(tract.id = names(tmp), tract.area = tmp, stringsAsFactors = FALSE)
head(tract.area, 1)
## tract.id tract.area
## 06075010700 06075010700 183022.4
# Then add the area of each polygon to the overlaps table
overlaps <- data.frame(overlaps, tract.area = tract.area[match(overlaps[, "tract.id"],
tract.area[, "tract.id"]), "tract.area"])
#check it out
head(overlaps)
## nhood.id tract.id overlap.area nhood.area tract.area
## 1 1 06075023102 489397.2910 13390446 489397.3
## 2 1 06075025701 2361.9119 13390446 638558.3
## 3 1 06075061400 12785.4676 13390446 805193.3
## 4 1 06075025800 45.4304 13390446 240480.9
## 5 1 06075023200 846310.3119 13390446 850098.4
## 6 1 06075025702 325.5555 13390446 516727.8
tail(overlaps)
## nhood.id tract.id overlap.area nhood.area tract.area
## 548 41 06075016100 3.686220e+05 1510121 368635.6
## 549 41 06075016200 2.059215e-02 1510121 272357.3
## 550 41 06075016400 4.729286e+01 1510121 308717.7
## 551 41 06075012402 4.680288e+02 1510121 359515.4
## 552 41 06075015700 9.575137e+02 1510121 830881.0
## 553 41 06075015802 1.702923e+05 1510121 170306.1
Now that we know the area of overlap and the area of the input polygons, we can compute the proportion of each census tract that falls within a neighborhood. We can then use that proportion to compute the population within the neighborhood.
# Divide the area of overlap by the total area to get the proportion of overlap
overlaps$tract.area.share <- overlaps$overlap.area/overlaps$tract.area
head(overlaps)
## nhood.id tract.id overlap.area nhood.area tract.area tract.area.share
## 1 1 06075023102 489397.2910 13390446 489397.3 1.0000000000
## 2 1 06075025701 2361.9119 13390446 638558.3 0.0036988195
## 3 1 06075061400 12785.4676 13390446 805193.3 0.0158787563
## 4 1 06075025800 45.4304 13390446 240480.9 0.0001889148
## 5 1 06075023200 846310.3119 13390446 850098.4 0.9955439123
## 6 1 06075025702 325.5555 13390446 516727.8 0.0006300330
We assign to each neighborhood the sum of the proportion of the population within each census tract that is within the neighborhood.
#sfhoods_buf0,sftracts_buf0
sfhoods_buf0$pop14 <- 0
for (i in 1:n.overlaps) {
sfhoods_buf0@data[overlaps$nhood.id[i], "pop14"] <- sfhoods_buf0@data[overlaps$nhood.id[i],
"pop14"] + (overlaps$tract.area.share[i] * sftracts_buf0@data[overlaps$tract.id[i],
"pop14"])
}
head(sfhoods_buf0@data)
## nhood park_count park_count2 area_sqm
## 1 Bayview Hunters Point 9 9 13390445.7
## 2 Bernal Heights 3 3 2790169.1
## 3 Castro/Upper Market 5 5 2218768.5
## 4 Chinatown 2 2 581470.3
## 5 Excelsior 2 2 3603577.8
## 6 Financial District/South Beach 2 2 2908697.5
## area_sqmiles park_per_sqmile nhood_id pop14
## 1 5.1700511 1.740795 1 37101.20
## 2 1.0772843 2.784780 2 25775.41
## 3 0.8566665 5.836577 3 20134.59
## 4 0.2245057 8.908460 4 14608.08
## 5 1.3913414 1.437462 5 39449.89
## 6 1.1230481 1.780868 6 13861.97
Did it work? Let’s check.
#tot_sf_pop
#sum(sfhoods_buf0$pop14)
#tot_sf_pop - sum(sfhoods_buf0$pop14)
### 11,323 difference with method A & B - same answer
Ok - so now that we have the population within each Neighborhood we can compute the number of people per park ratio for each neighborhood
head(sfhoods_buf0@data)
## nhood park_count park_count2 area_sqm
## 1 Bayview Hunters Point 9 9 13390445.7
## 2 Bernal Heights 3 3 2790169.1
## 3 Castro/Upper Market 5 5 2218768.5
## 4 Chinatown 2 2 581470.3
## 5 Excelsior 2 2 3603577.8
## 6 Financial District/South Beach 2 2 2908697.5
## area_sqmiles park_per_sqmile nhood_id pop14
## 1 5.1700511 1.740795 1 37101.20
## 2 1.0772843 2.784780 2 25775.41
## 3 0.8566665 5.836577 3 20134.59
## 4 0.2245057 8.908460 4 14608.08
## 5 1.3913414 1.437462 5 39449.89
## 6 1.1230481 1.780868 6 13861.97
sfhoods_buf0$pop_per_park <- ifelse(sfhoods_buf0$park_count==0, 0, sfhoods_buf0$pop14/sfhoods_buf0$park_count)
head(sfhoods_buf0@data)
## nhood park_count park_count2 area_sqm
## 1 Bayview Hunters Point 9 9 13390445.7
## 2 Bernal Heights 3 3 2790169.1
## 3 Castro/Upper Market 5 5 2218768.5
## 4 Chinatown 2 2 581470.3
## 5 Excelsior 2 2 3603577.8
## 6 Financial District/South Beach 2 2 2908697.5
## area_sqmiles park_per_sqmile nhood_id pop14 pop_per_park
## 1 5.1700511 1.740795 1 37101.20 4122.356
## 2 1.0772843 2.784780 2 25775.41 8591.805
## 3 0.8566665 5.836577 3 20134.59 4026.917
## 4 0.2245057 8.908460 4 14608.08 7304.041
## 5 1.3913414 1.437462 5 39449.89 19724.944
## 6 1.1230481 1.780868 6 13861.97 6930.984
#top 5
head(sfhoods_buf0@data[order(-sfhoods_buf0$pop_per_park),c('nhood','pop_per_park')], 10)
## nhood pop_per_park
## 21 Nob Hill 25534.38
## 11 Inner Richmond 21322.08
## 5 Excelsior 19724.94
## 16 Lone Mountain/USF 17193.91
## 28 Portola 15893.31
## 36 Tenderloin 13798.50
## 14 Lakeshore 13313.46
## 35 Sunset/Parkside 13276.49
## 25 Outer Mission 11864.77
## 27 Pacific Heights 11799.00
#bottom 5
head(sfhoods_buf0@data[order(sfhoods_buf0$pop_per_park),c('nhood','pop_per_park')], 10)
## nhood pop_per_park
## 8 Golden Gate Park 0.0000
## 15 Lincoln Park 0.0000
## 20 Mission Bay 0.0000
## 33 Seacliff 0.0000
## 37 Treasure Island 0.0000
## 18 McLaren Park 470.7181
## 7 Glen Park 1209.5866
## 32 Russian Hill 3463.6078
## 38 Twin Peaks 3491.6903
## 30 Presidio 3509.2412
Question If you had the money to fund one new neigbborhood park, in what neighborhood would you suggest it be built and why?
Coming soon!