How to download new Census data with R

2015 American Community Survey data is out. Here's how to get the data you want using R.

Histogram of California median household income by community
Credit: Screenshot by Sharon Machlis of histogram created with R

The Census Bureau collects lots of data between its once-every-10-years census. One of the most popular data sets is the bureau's annual American Community Survey (ACS), which fills in some gaps about the US population between decennial censuses. It also compiles a lot of information not asked in the "regular" census, such as household income, that's of great interest to businesses, government policymakers and journalists.

Results from the 2015 American Community Survey were released this morning. Here's how to pull data for your community, county or state with R.

1. Install the censusapi package if it's not already on your system. You'll need the devtools package to install censusapi, so if you don't have that either, install devtools with install.packages("devtools") and then install censusapi with devtools::install_github("hrecht/censusapi").

2. You need an API key from the US Census Bureau in order to use the bureau's API. They're free and very quick to receive once you request one. If you don't have a key, head to and sign up.

3. Load the censusapi library:


4. Save your API key in a variable, here I'll call it mycensuskey (replace the key string with your actual Census Bureau API key). If you'd rather not store it right in your script, another option is storing it as an environmental variable.

mycensuskey <- "YOURKEYSTRINGHERE"

5. Now you need to know the proper syntax and function arguments. This is the basic format used by censusapi:

`myresults <- getCensus(name="apiname", 
  vintage=year, key="yourcensuskey",
  vars=c("datacolumn1", "datacolumn2","datacolumn3"),

"name" means the name of the api you want to use. You can get a list of all available APIs at the Census Bureau website or by running

availableapis <- listCensusApis()

But I can save you the trouble: The API for getting data from one-year American Community Surveys is called acs1 (there are also 3- and 5-year ACS data releases with acs3 and acs5 APIs.) Vintage is 2015 - store it in a variable myvintage so it's easy to use the script next year: myvintage <- 2015 . You should already have a key.

6. Next up is deciding on variables. You can download all available variables with the functions:

availablevars <- listCensusMetadata(name="acs1", vintage=myvintage)

Unfortunately, this took a fairly long time to load when I tried it -- possibly because it was right around a major data release. It did eventually download, though, so be patient. Once you get the info, you may want to save it to a CSV file for future use, with `write.csv(availablevars, file="availablevars.csv", row.names = FALSE)`.

Take a look at the structure of this data with head(availablevars). Then, you can extract a subset of those variables containing "median household income" with

possible_vars <- subset(availablevars, 
  grepl("median household income", availablevars$label, = TRUE))

If you're not familiar with grepl, the function finds which items in a vector match a given text pattern.

Look at the possible variables, and you'll see the first listing is what you want: B19013_001E, Median household income in the past 12 months. You may also want to take a look at the second listing, B19013_001M, which shows the margin of error. Basic population might be of interest as well, which is simply "Total" if you want to search for it yourself (if you don't want to find it yourself: It's B01003_001E for the data and B01003_001E for the total margin of error).

NAME is also a good variable to include in your query, as it gives the name of your geographic area.

7. Finally, you want to specify a region. You can see available geographies with the same code as when pulling available variables, but add type = "g" to specify geography:

availablegeos <- listCensusMetadata(name="acs1", vintage=myvintage, type="g")

Geography may be the trickiest part of this. To get median income for each state in the US, you can use region="state:*" and run:

median_incomes_us <- getCensus(name="acs1", vintage=myvintage, key=mycensuskey, 
    vars=c("NAME", "B19013_001E", "B19013_001M", "B01003_001E", "B01003_001M"), region="state:*")

But what if you want all the available cities in California? You'll need the FIPS code for California, which conveniently is under the state column in median_incomes_us; get it with

calcode <- median_incomes_us$state[median_incomes_us$NAME=="California"]

(or you can just view the median_incomes_us dataframe since it's not very large.)

Then the format is region="place:*", regionin="state:06" which translates into "All place in the region state number 6":

median_incomes_cal <- getCensus(name="acs1", 
  vintage=myvintage, key=mycensuskey, 
    vars=c("NAME", "B19013_001E", "B19013_001M", "B01003_001E", "B01003_001M"), region="place:*", regionin="state:06")

For counties, you'd replace region="place:*" with region="county:*". Note that only places with 65,000 people or more are available in the one-year ACS. Smaller areas are reported in the 3- and 5-year ACS; the newest 5-year survey should be out in December.

If you'd like to change the 4th column name with your main data point of interest from B19013_001E, you can use the code names(median_incomes_cal)[4] <- "MedianIncome" .

You can get a quick histogram of California median incomes with hist(median_incomes_cal$MedianIncome) or a nicer-looking one with the ggplot2 and scales packages (install them first if they're not on your system) with

ggplot(median_incomes_cal,aes(x=MedianIncome)) +
  geom_histogram(fill = "dark green") +
  scale_x_continuous(labels = comma) +
  ggtitle("California Communities' Median Household Income")

Want to compare to the year before? Get the prior year's data,

median_incomes_cal_prioryear <- getCensus(name="acs1", vintage=myvintage-1, key=mycensuskey, 
    vars=c("NAME", "B19013_001E", "B19013_001M", "B01003_001E", "B01003_001M"), region="place:*", regionin="state:06")
    names(median_incomes_cal_prioryear)[4] <- "MedianIncomePriorYear"


median_incomes_cal_combined <- merge(median_incomes_cal_prioryear, median_incomes_cal, by = c("NAME", "state", "place"))

and calculate the change between years:

median_incomes_cal_combined$change <- median_incomes_cal_combined$MedianIncome - median_incomes_cal_combined$MedianIncomePriorYear

Before making income comparisons between communities or in the same community over time, however, make sure to understand what is a significant difference and what's within the sampling margin of error. The Census Bureau's A Compass for Understanding and Using American Community Data: What the Media Needs to Know PDF guide can help -- see the Making Comparisons appendix.

If you want to map this data in R, check out Create maps in R in 10 (fairly) easy steps. Or, download our free Advanced Beginner's Guide to R PDF, which includes a section on mapping.

To express your thoughts on Computerworld content, visit Computerworld's Facebook page, LinkedIn page and Twitter stream.
Fix Windows 10 problems with these free Microsoft tools
Shop Tech Products at Amazon
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.