Simple features in R
Would you like to make a color-coded choropleth map in R? Until recently, working with geospatial objects was a bit more complex than with many other types of data. That's because R "Spatial Polygons Data Frames" objects were structured something like this if you ran an R str() command to see their structure:
Structure of a Spatial Polygons Data Frame in R, without simple features
The non-spatial data -- things like unemployment rates that you might want to map by county -- resided in special @data slots, which were unfamiliar to many R users. Slots weren't that tough to deal with, but they were Yet Another Thing To Learn.
Thanks to the Simple Features for R project, though, it's possible to represent geospatial data in what looks very much like a conventional data frame.
Here's what the same data looks like as a simple features object:
The more human-readable structure of a simple features R object.
The data is now in what appears to be a "regular" R data frame; the geospatial data is tucked into a special list column containing geographies. This makes it easy to merge data sets like employment or median wages with a shapefile containing geographies.
I'll show you one way to do this using R simple features, creating a map of median wages for IT managers by state. As is appropriate for pretty much any real-world analysis and visualization, most of the project involves acquiring and cleaning the data. Once the data is in the right format, mapping itself will be a breeze. If you want to skip to the mapping section, head straight to step 4.
First, though, make sure to install and load two mapping packages with:
install_packages("tmap")
install_packages("tmaptools")
library("tmap")
library("tmaptools")
1. Download geography file of U.S. states and import it into R. (There are several popular geospatial file formats to choose from. For this tutorial, I'll use shapefiles.)
The easiest way to do this is Kyle E. Walker's tigris package.
For this project, you'll need the development version of tigris from GitHub, not the older version available on CRAN, because the CRAN version doesn't yet support simple features. If you've already got the devtools R package on your system, you can install newer tigris with devtools::install_github('walkerke/tigris')
. If you don't have devtools, you can either install devtools first, or a lighter-weight GitHub-installation package called remotes. I suggest opting for remotes; you can install that on your system with install.packages("remotes")
. Next, install tigris with remotes::install_github('walkerke/tigris')
and load it with library("tigris")
.
To import a shapefile of U.S. states into R with tigris as a simple features object, just run the command:
us_geo <- states(class = "sf")
Note that states()
without class = "sf"
will return a Spatial Polygons Data Frame and not a simple features object.
There's more information about the tigris package's capabilities at the tigris GitHub website.
If you have problems with this, as I occasionally had due to issues with the Census Bureau's API this week, you can download a shape file manually from the bureau's TIGER Cartographic Boundary Shapefiles site here: https://www.census.gov/geo/maps-data/data/cbf/cbf_state.html. Choose the cb_2015_us_state_20m.zip file, download and unzip it.
Or, download and unzip the file within R using the code:
download.file("http://www2.census.gov/geo/tiger/GENZ2015/shp/cb_2015_us_state_20m.zip", destfile = "states.zip")
unzip("states.zip")
(Assuming the file downloaded into your working directory. If not, make sure to include the full path to your zip file ).
Then read the geographic information into an R object called us_geo with
us_geo <- read_shape("cb_2015_us_state_20m.shp", as.sf = TRUE, stringsAsFactors = FALSE)
2. Import the data you want to map. I went to the U.S. Bureau of Labor Statistics Occupational Employment Statistics query page, chose "One occupation for multiple geographical areas," and then picked "Computer and Information Systems Managers" by geographic type State, "All states in this list," "Annual median wage" as the data type, and Excel as the output.
If you open the file, you'll see there are 3 rows of meta data at the top (R will count merged rows 2, 3 and 4 as a single row) and 4 rows at the bottom. I moved the footnotes to a separate tab and skipped the first 3 rows when importing (there's no easy way to skip rows at the end).
I like the rio package for reading in data. If you don't have it, you can install it with install.packages("rio")
.
wages <- rio::import("mydatafile.xlsx", skip = 3)
The spreadsheet's column names are somewhat R-unfriendly -- both have spaces in them -- so I'll change them with
names(wages) <- c("Area", "Median.Wage")
More importantly, the wage column imports as character strings, not numbers, I'm guessing because of extra spaces in the data. I trimmed white space and then converted Median.Wages into numbers with
wages$Median.Wage <- trimws(wages$Median.Wage)
wages$Median.Wage <- as.numeric(wages$Median.Wage)
Unfortunately, state names in this data look like "Alabama(0100000)" instead of "Alabama."
There are several ways of getting rid of the (01000000) and similar text in the state names. The most robust way is with a regular expression -- delete anything that's not an alpha character. This code does so:
wages$State <- gsub("[^[:alpha:]]", "", wages$Area)
But if you're not yet ready for regular expressions in R, you could simply count the number of characters in the (#######) portion of each state name -- there are 9 -- and remove that number of characters at the end of each state name. The format for subsetting a string is substr(thestring, startIndex, stopIndex)
. Here, we want just the part of the string that starts with the first character and ends with the last character - 9:
wages$State <- substr(wages$Area, 1, nchar(wages$Area) - 9)
Now, data in the wages State column in wages matches state names in the us_geo NAME column.
Finally, get rid of any rows where Median.Wages are unavailable:
wages <- wages[!is.na(wages$Median.Wage),]
Or, if you use the dplyr package (my preference):
wages <- dplyr::filter(wages, !is.na(Median.Wage))
3. Merge (join) the geospatial and data files. This is easy with the tmaptools package's append_data() function:
wagemap <- append_data(us_geo, wages, key.shp = "NAME", key.data = "State")
4. Finally, Create your map. This part is incredibly easy with tmap's qtm() (quick theme map) function:
qtm(wagemap, fill = "Median.Wage")
Or with the more robust tmap() function:
tm_shape(wagemap) +
tm_polygons("Median.Wage", id = "NAME")
The id = "NAME"
portion isn't actually necessary for this map, but it's useful for the next step.
For this map, it's a little tough to see the contiguous 48 states because the map is zoomed out to include all of Alaska and Hawaii.
Static map with all 50 states is a bit difficult to view.
There are a couple of ways to deal with this. For static maps, you can temporarily zoom in by removing Alaska, Hawaii and Puerto Rico (which is what happens if you zoom in to the contiguous 48), with:
contig_48 <- subset(wagemap, !(NAME %in% c("Alaska", "Hawaii", "Puerto Rico")))
and then mapping the contig_48:
tm_shape(contig_48) +
tm_polygons("Median.Wage", id = "NAME")
Static map of median U.S. wages for computer and IS managers created with the R tmap package.
For presentation purposes, there's an excellent tmap tutorial on creating insets for Alaska and Hawaii.
But the best option for data exploration and Web publication:
5. Create an interactive map with two simple lines of code.
Plot the original version of 50 states plus Puerto Rico again:
tm_shape(wagemap) +
tm_polygons("Median.Wage", id = "NAME")
To make this interactive, you just need to switch tmap's mode from "plot," which is static, to "view", which is interactive, using the tmap_mode() function:
tmap_mode("view")
Then, to re-draw the last map, run last_map()
. That's it! You should have a zoomable interactive map, where clicking on a state gives data details.
The leaflet R mapping package has many more ways to customize an interactive map than tmap's interactive mode offers. In addition, the leaflet package lets you save interactive maps as stand-alone HTML files. The good news: It's easy to turn a tmap interactive map into a leaflet map object, and then run any additional leaflet functions on it. Save the tmap map to a variable
my_interactive_map <- tm_shape(wagemap) +
tm_polygons("Median.Wage", id = "NAME")
and then turn it into a leaflet object with the tmap_leaflet() function
my_interactive_map <- tmap_leaflet(my_interactive_map)
With the htmlwidgets package, save the map with saveWidget(my_interactive_map, "mymap.html")
. You can see a version of the map below, which I centered and zoomed using leaflet's setView() function after loading the leaflet package:
my_interactive_map <- my_interactive_map %>%
setView(-96, 37.8, zoom = 4)
For more on tmap, check out the tmap in a nutshell vignette. For leaflet capabilities, see the leaflet package website.