Introduction
A number of UCB students and researchers have asked the D-Lab about tools for geocoding street address. These variety of departments and schools these researchers represent - including Public Health, ESPM (Environmental Science, Policy, and Management), Music, History and Sociology, among others - indicate the widespread usefulness of geocoding. In response to this interest, the D-Lab is taking a number of steps to help the UCB community with its geocoding needs. This semester we have two workshops on geocoding which you can checkout on our website: 1) Introductory Geocoding and Mapping (April 6, 2015) , which emphasizes programming and the Data Science Toolkit (DSTK), and 2) Address Geocoding with ArcGIS, which focuses on geocoding without programming (April 21, 2015). As part of our consulting services, D-Lab provides individual consultations on geocoding and other spatial analysis methods and tools. Additionally, with funding from Prof. Rachel Morello-Frosch, who is in the School of Public Health & ESPM, the D-Lab has set up a secure server to help UCB researchers geocode restricted access data.
Below we provide an overview of geocoding and geocoding software options to help you get started. Contact the D-Lab if you have any questions.
What is Geocoding
For those who may not be familiar with geocoding, here is a very brief introduction. Geocoding is the process of computing the geographic coordinates for a place reference such as a city name, zip code or street address. I will focus on address geocoding although the methods for geocoding other types of places are similar. Address geocoding requires (1) an input file that contains the list of addresses to be geocoded, (2) a reference database of geographic features, such as streets with address ranges, against which these places are compared, and (3) software for determining the match between these two, e.g., if the input address can be located in the reference database.
The output of geocoding is a point for each input address or place for which a match was found, typically expressed as latitude and longitude. Most geocoders will also output metadata on the matching process, including how the software parsed and standardized the input place, whether or not it found a match for that place in the reference database, and the quality of that match. For address geocoding, the matching process is typically one of linear interpolation where the coordinates for the input address are estimated using a reference database of street centerlines for which address ranges are known for each street segment and, in many cases, for side of street as well. A key assumption of this approach is that parcels are uniformly distributed along a street segment, e.g. are the same size and consistently numbered. In the image shown below the geocoding software assumes nine similarly sized parcels on the left side of the street between 5640 and 5656 Spring Garden Rd and thus the location of 5650 would be close to the halfway point on the line segment for that address range.
Source: http://www.directionsmag.com/articles/three-standard-geocoding-methods/1...
Why Geocode?
There are three main reasons for geocoding addresses and other named places:
- to display places on a map,
- to determine relationships between places, such as distance, direction, or pattern of dispersion, and
- to use geographic location to associate places with other geographically referenced data such census demographic data, sales transactions, locations of historic settlements, pollution concentrations, etc.
For a detailed, though dated, example of a public health application see the Harvard Public Health Geocoding Project.
Related Terms
It’s a good idea to get a handle on a set of related terms so that you can seek help or reference materials more effectively.
- Batch Geocoding - geocoding more than one address at the same time.
- Reverse Geocoding - determining the place name, zip code or address for a location specified by geographic coordinates.
- Address matching - a synonym for address geocoding.
- Geolocating - determining the geographic coordinates of an object or feature on or near the surface of the earth, for example with GPS.
- Georeferencing - making a non-geospatial object, like a scanned map or photograph, geospatial by referencing points within it to known geographic locations.
Geocoding Multiple Addresses
It is pretty easy to geocode one address. You can put the address in the search bar of Google Maps and the returned URL will contain the longitude and latitude of the geocoded address. However, this approach does not scale well to multiple addresses. For more than a few, you will need to use an address geocoder.
What is an address geocoder?
An address geocoder is software that:
- parses input addresses into identifiable components, such as street name and number, city, state and zip code,
- matches input addresses to a reference database of spatial features (e.g. streets or parcels) having those same identifiable components
- determines the geographic coordinates for the matched input addresses,
- evaluates (or scores) the matches,
- outputs coordinates for the matched addresses along with metadata that describes the match.
Geocoding software comes in many forms. It can be one of the features of a desktop software package installed locally on your computer, such as ArcGIS or QGIS. It can be a mapping tool like Google Earth Pro (now free!) or an online service like the US Census Geocoder. It can be an online service with an application programming interface (or API) that allows you to write a program in R, Python, or Javascript, among other languages, to geocode one or more addresses. The Google Geocoding API, Mapquest Open Geocoding Service, Photon and Data Science Toolkit (DSTK) are all examples of popular online geocoders with APIs. Geopy is a python library for programmatically accessing several popular geocoding APIs. GGmap is an R package for doing the same.
Geocoder Reference Database
The reference database is an extremely important component of the geocoder. It’s geographic coverage, completeness, currentness and spatial resolution are the primary determinants of the quality of the output - as well as the cost of the geocoder. A lookup table, sometimes called an address locator, is used by the geocoder to identify the data components, or fields, that are common to both the input addresses and reference database, e.g., street name and number, city, state, and zip code, and enable the matching process The lookup table may operate behind the scenes - on the backend - and the end user may not be aware of it. It is primarily important in that it determines the way in which input addresses need to be formatted in order for the geocoder to work. If a geocoder allows for the input of a user supplied reference database then the geocoder also will have software for creating a custom lookup table. This level of customization greatly increases the functionality of a geocoder but also its complexity.
The reference database is often bundled with the geocoding software - both online or in a local desktop software package. In other configurations, the geocoding software may be installed locally on your computer but the reference database may be an online resource. Examples of this type of setup include the QGIS plugin MMQGIS, which links to the Google Maps Geocoder, and ArcGIS Desktop which links by default to the ArcGIS Online World Geocoding Service. Both QGIS and ArcGIS can be configured to use a local reference database although this approach is much more robust and sophisticated in ArcGIS.
Address Cleaning and Standardization
The geocoding process always begins with a set of addresses to be geocoded. From the user perspective, the most important, tedious, and time consuming step in the geocoding process is data cleaning. This includes tasks like:
- Reducing addresses to their core components like street & number, city, state, zip and removing non-essentials like suite or Apt #.
- Standardizing the format of the address components, e.g., highway becomes HWY, SO becomes S. (south), remove extra spaces, etc.
- Standardizing the format of the addresses prior to geocoding, e.g., one address per line with/without delimited columns and column headers.
The standardized address format is geocoder specific and thus this step should not be taken until you have selected your geocoder and have tested a small sample of your addresses with it to make sure you have identified the specific input address format. If you have lots of addresses you may need to use software like MS Excel or write a short computer program or script to help automate the cleaning process.
Geocoding Output
Although geographic coordinates are typically returned by a geocoder, the coordinate system can vary. Most online geocoders will return geographic coordinates referenced to a coordinate system based on either the WGS84 or NAD83 datum. In most but not all use cases the difference between data in these two coordinate systems will be negligible. We caution you to be aware of the issue and consider its impact on your research. Whenever possible you should identify and document the coordinate system used by your digital geographic data. Moreover, point data in geographic coordinates should almost always be converted to a projected coordinate system before they are mapped or used to determine spatial relationships or associations with other data sets the points. If you are unfamiliar with maps projections and coordinate systems we suggest you read up on this topic before post-processing your geocoded addresses. A good, basic reference which you should be able to find online is the ESRI document, Understanding Map Projections.
Most geocoders return a score or confidence value that indicates the strength of the match between the input address and the reference database. For example, the score may range from 0 to 1, where a value of 1 indicates a strong match. However, a strong match with the reference database is not necessarily a strong match on the ground, especially if the quality of the reference database is poor. On the other hand, a low confidence score, say .75, may be a fantastic match on the ground and only a poor match with the reference database if, for example, the input street name was spelled incorrectly but correctly matched. In this case the lower score indicates the geocoder’s uncertainty due to the spelling issue and is not a locational accuracy issue. A geocoder may also provide additional information on the output quality. For example, a geocoded address may have address, street or city level accuracy where address is the most desirable. The important point here is that a researcher needs to carefully examine and understand the geocoding output and assess whether it meets the research needs.
Geocoding output quality will depend on the following factors:
- the detail, accuracy and currency of the geocoder’s reference database,
- the accuracy, completeness and standardization of the input addresses,
- the geocoder’s address parsing sophistication (i.e. ability to parse non-standard address components, handle misspellings, and abbreviations, etc),
- the sophistication of the geocoder’s matching algorithm, and
- the researcher’s review of the output, with revision of unmatched addresses and iteration as needed.
Address standardization, geocoder selection, output review, revision and iteration are the parts of the process that the researcher can control. Getting access to a suitable geocoder with a high quality reference database is much harder.
Tips for processing a large number of addresses ( ~ more than 100,000)
- Test process on small sample of addresses to get an understanding of what types of addresses might not match or have poor matches and why. This will also help you figure out the type of geocoder’s output format and metadata.
- Preprocess your addresses to standardize formatting.
- Sort your addresses by state, city, zip as this will speed up the geocoding process.
- Provide all the address information you know. For example, if all addresses are in CA make sure that CA is listed as the state for all addresses.
- Chunk input addresses into smaller files, e.g. of 25,000 addresses each, that can be processed sequentially.
- Assess output quality of geocoding with simple summary statistics to see if the output meets your research needs. Summary statistics may include:
- count of matched vs unmatched address,
- count of addresses matched at each score or confidence level range, eg .75-.85, .85 - 1.
- Review a random sample of your output on a map with the reference data to get a visual impression of the quality of the output. Check a few of these results against what you would get from Google Maps.
- Consider the geocoding output format and the format in which you will need these data for further analysis.
Geocoding Considerations
Before you can select a geocoder you need to identify your project needs. Here are some of the factors you will need to consider.
- What level of locational specificity do you need from your output - street block, within a few houses, on the property, on the structure?
- What level of completeness do you require? Do you need to geocode all of your addresses or will 90% suffice?
- What is your geocoding budget? Are you trying to do this for free? Due to its business intelligence value and system costs, a for-fee geocoding service can be very expensive. For example, it would cost $40 to geocode one thousand or $4,000 to geocode one million addresses using the ArcGIS Online geocoder.
- Do you know and want to use a specific geocoding tool, like ArcGIS or QGIS, or programming language, like R or Python?
- What is the geographic scope of your addresses - are they all in the U.S? Most free geocoders have limited geographic coverage outside of the U.S.
- What is the temporal scope of your addresses? For example, are they gathered from historical documents or are they from 2010 survey data? Consider that the geocoder’s reference database will reflect a specific time period, say 2010. Streets developed with new homes after that period will not be in the database. Changes over time, e.g., properties replaced by freeways or address range changes on streets, will not be captured in the database. You may need to create a custom reference database or use several developed at different times.
- Are you geocoding restricted access data? If this is the case you may not be able to use an online geocoding service (unless it can meet your security and price point needs). You may instead need to use local geocoding software with a local reference database. This greatly limits your options.
Which Geocoder to Use?
Below is a table of characteristics of popular geocoders that I would recommend to the UCB community. This table is not comprehensive and it may be out of date as the terms of use, particularly for the online services, change frequently. Moreover, new online geocoding tools are continually being released. However, it indicates the key factors to consider and the relative merits of listed options.
Address Geocoding: Some Options for the UCB Community
Google Geocoding service | ESRI ARCGIS ONLINE | ESRI ARCGIS DESKTOP | DATA SCIENCE TOOLKIT (DSTK | CENSUS GEOCODER | OPENSTREETMAP VIA MAPQUES OPEN, PHOTON, OR OPENCAGE | |
---|---|---|---|---|---|---|
REQUIRES PROGRAMMING | Not if used via Google Earth Pro or QGIS mmqgis plugin | Not if used via ArcGIS Destop | No | Yes | No | YES |
LIMITS ON FREE GEOCODING | 2,500 addresses per day | 1,250 addresses with new account | None if using local reference database | No | No, but only 1,000 addresses at a time | No |
RELATIVE OUTPUT QUALITY | High | High | High | Medium | Medium | Medium |
GEOGRAPHIC COVERAGE | global | global | depends on reference database | USA, UK | USA | Global, with inconsistent coverage outside USA |
ONLINE OR LOCAL SERVICE | online | online | either | either | online | online |
ESTIMATED CURRENCY | relatively current | relatively current | depends on reference database | varies, ~2010 or later for US locations | varies, ~2010 or later for US locations | varies, ~2010 or later for US locations |
Recommendations
My specific geocoder recommendations are as follows. If you can use a Google geocoder you should. In my opinion, Google’s geocoding output quality, speed, currency, and geographic scope are unmatched. It is a bit tricky to get the output in a simple CSV file without programming but you can do it via the following workflow: Input addresses to Google Earth Pro > Output to KML > Import KML to geojson.io > Save as CSV.
If you cannot use the Google Geocoder because you have many more than the 2,500 addresses that can be processed per day, you can try the US Census Geocoder. The Census Geocoder is super simple to use and it has the added benefit of outputting Census FIPS codes - making it easier to link your addresses to census data. If you cannot use an online geocoder due to restrictions on the use of your data or if you have more addresses than you can reasonably geocode using the Google or Census geocoders then you should try the ESRI ArcGIS geocoder. It’s fast, accurate, customizable, robust, outputs rich metadata, has a user friendly interface, and, if used with the ESRI streets database, is free to the UCB Community for research & educational uses.
You can install the ArcGIS software on your personal computer by requesting a license via the UCB Geospatial Innovation Facility (GIF) website. If you are installing ArcGIS on a campus computer talk to your unit’s system administrator for information on how to proceed. Note, ArcGIS runs on a Microsoft Windows-based PC or on a Mac with Windows installed via Bootcamp, Parallels, or VMFusionWare. ArcGIS is memory intensive software and you may need to run it on a computer with more memory and processing power than what you have in your personal computer, especially if you are running it on a Mac. If this is the case, you can try the GIS workstation in the D-Lab or UCB Earth Sciences & Map Library. Another big benefit to using ArcGIS is that ESRI provides a ton of online documentation for this tool and on geocoding in general.
The ArcGIS Desktop geocoder uses the ArcGIS Online World Geocoding service as the default reference database. This service requires that you first register for an ESRI ArcGIS Online account and either buy ESRI credits (which translate to $0.004 per address at the time of this writing) or use free introductory offer credits. This approach is not viable for geocoding more than approximately 1,000 addresses unless you wish to pay for the credits. A solid and extremely useful free alternative is to access or obtain a copy of the North American Streets data from the UCB Maps Library or the D-Lab. This will give you a local geocoding service with superfast, robust geocoding of unlimited addresses, on the order of 1,000,000 per hour. Unfortunately, the most recent version of the NA Streets data available to the UCB campus community is circa 2009 and thus will not include addresses for housing developments created post-2009. This is the main disadvantage to the ArcGIS geocoder. One alternative is to use a hybrid approach where you geocode most of your addresses in ArcGIS and then use the Google Geocoder for those addresses which ArcGIS cannot match.
If you want to use ArcGIS with non-U.S. or historical addresses you will need to find or create (e.g., digitize from historical maps) your own reference database first. If you are not a member of the UCB community and/or don’t have an ArcGIS license and are looking for a free geocoding service, try the Data Science Toolkit or one of the geocoding options based on OpenStreetMap (see table above). These are all solid tools if you cannot use the Google Geocoder though I prefer DSTK due to it’s speed and ease of use.
Beyond Address Geocoding
For those of you who want to map your geocoded addresses, ArcGIS/ArcMap and QGIS are great desktop tools, CartoDB is a great online mapping tool, and R and Python offer mapping functionality via spatial packages and libraries. CartoDB is also a great tool if you have a dataset that has place name references (as opposed to street addresses) that you want to very easily geocode and map or export to CSV, KML, or Shapefile.
If you want to link your geocoded addresses to census data you need to get the Census FIPS code for each address. If you used the Census Geocoder you can get this automatically, otherwise it requires a bit of work. You can get FIPS codes programmatically via the FCC Census Block Conversions API. In ArcGIS, you can get these using the spatial intersect tool (not Spatial Join as that could be too slow) to join the FIPS codes in census tract, block group, or block level geographic data, which you can download here, with your geocoded addresses. Note, you must first make sure your two data sets have the same spatial reference system (i.e. map projection) before intersecting. D-Lab is planning a workshop for May 2015 on linking points to census data so keep your eyes on our calendar.
Moving forward, ArcGIS offers an unparalleled toolkit for spatial analysis of your geocoded addresses. That said, there are free and open source, or FOSS, options for spatial analysis including QGIS, a very popular ArcGIS alternative, and spatial packages in R and Python. I also highly recommend that you checkout PostGIS, an extension to PostgreSQL for creating a spatially-enabled database, as an alternative spatial analysis environment to ArcGIS. As a detailed example of working with both ArcGIS and PostGIS, the Harvard School for Geographic Analysis published a great blog post on geocoding 53 million addresses.
Getting Help
Please sign up for a consult with me via the D-Lab web site if you have geocoding questions. Check out the D-Lab website Trainings page and join our mailing list if you want be alerted to upcoming geocoding workshops. You can also reach out to Susan Powell, the UCB Maps & GIS librarian, the GIF, or BIDS (Berkeley Institute for Data Science) for consulting on geocoding, GIS, and related topics.
Happy Geocoding!