Useful new R packages for data visualization and analysis

1 2 Page 2
Page 2 of 2

Now that we’ve enjoyed some eye candy, I want to spend the rest of the session today on dplyr, a relatively new package by Hadley Wickham. Hadley is the author of a number of popular R packages including the ggplot2 visualization library.

The goal of dplyr is to offer a fairly easy, rational data manipulation. He talks about 5 basic, core things you want to do when manipulating data:

To choose only certain observations or rows by 1 or more criteria: filter()

To choose only certain variables or columns: select()

To sort: arrange()

To add new columns: mutate()

To summarize or otherwise analyze by subgroups: group_by() and summarise()

To apply a function to data by subgroups: group_by() and do()

There are other useful functions, such as ranking functions top_n() for the top n items in a group, min_rank() and dense_rank(), lead() and lag().

dplyr creates class of data frame called tbl_df that behaves largely like a data frame but has some convenience functionality, such as not accidentally printing out hundreds of rows if you type its name.

Hadley has a sample data package called nycflights13 for learning dplyr, but let’s see if we can load in my CSV file of domestic flights in and out of Georgia airports.

Note that there was just data available for January through November for 2014 when I downloaded this.

library(dplyr)
ga <- read.csv("GAontime.csv", stringsAsFactors = FALSE, header = TRUE)

NOTE: read.csv can take awhile to process large data files. In a hurry?

Use the data.table package’s fread function. data.table has its own object classes and own ecosystem of functions. If you’re not planning to use those (I don’t), just convert the object back to a data frame or dplyr tbl_df object

ga <- data.table::fread("GAontime.csv")
# We can turn this into a dplyr class tbl_df object with
ga <- tbl_df(ga)
# Now see what happens if you just type the variable name
ga
# We'll look at the structure:
str(ga)
# There's also a dplyr-specific function glimpse() with a slightly better format
glimpse(ga)
# Let's just get Hartfield data. We want to filter for either ORIGIN or DEST being Hartsfield with code ATL
atlanta <- filter(ga, ORIGIN == "ATL" | DEST == "ATL")

Now there are all sorts of questions we can answer with this data

What’s the average, median and longest delay for flights to a specific place by carrier? Feel free to pick the airport you’re flying to from Atlanta if it’s domestic, I’m going to use Boston’s Logan Airport

bosdelays1 <- atlanta %>%
  filter(DEST == "BOS") %>%
  group_by(CARRIER) %>%
  summarise(
    avgdelay = mean(DEP_DELAY, na.rm = TRUE),
    mediandelay = median(DEP_DELAY, na.rm = TRUE),
    maxdelay = max(DEP_DELAY, na.rm = TRUE)
    )

bosdelays1

# Or just the average delay by airline to Boston? 

avg_delays <- atlanta %>%
  filter(DEST == "BOS") %>%
  group_by(CARRIER) %>%
  summarise(avgdelay = mean(DEP_DELAY, na.rm=TRUE))

avg_delays

# What's the average delay by airline for each month to a specific destination? 

avg_delays_by_month <- atlanta %>%
  filter(DEST == "BOS") %>%
  group_by(CARRIER, MONTH) %>%
  summarise(avgdelay = round(mean(DEP_DELAY, na.rm=TRUE),1))

avg_delays_by_month

# Not as easy to see those, let's make a datatable:

datatable(avg_delays_by_month)

Miss Excel pivot tables? You can do them in R too!

First let’s get a subset of the data we want

bos_delays <- subset(atlanta, DEST=="BOS", select=c("CARRIER", "DEP_DELAY", "MONTH"))
library("rpivotTable")
rpivotTable(bos_delays)

Let’s select Average from the dropdown, and then DEP_DELAY.Drag Carrier to the row box. Want to see average delay by month? Drag month to the column header. Want more visuals? Select a heatmap by column

But back to “regular” R….

What were the top 5 longest delays per airline?

delays <- atlanta %>%
  select(CARRIER, DEP_DELAY, DEST, FL_NUM, FL_DATE) %>%    # columns I want
  group_by(CARRIER) %>%
  top_n(5, DEP_DELAY) %>%
  arrange(CARRIER, desc(DEP_DELAY))

View(delays)

# Which are the unlucky destinations in those top 5?

table(delays$DEST)

# What were the top 5 longest  delays per destination?

delays2 <- atlanta %>%
  select(CARRIER, DEP_DELAY, DEST, FL_NUM, FL_DATE) %>%    # columns I want
  group_by(DEST) %>%
  top_n(5, DEP_DELAY) %>%
  arrange(CARRIER, desc(DEP_DELAY))

View(delays2)

# Can do basics such as percentage delayed flights by airline
# Can use either subset or the true dplyr-way below

atlanta_delays1 <- subset(atlanta, select=c("CARRIER", "DEP_DEL15")) %>%  
  group_by(CARRIER) %>%
  summarize(
    Percent = sum(DEP_DEL15, na.rm = TRUE) / n()
    )

atlanta_delays2 <- atlanta %>%
  group_by(CARRIER) %>%
    summarize(
    Delays = sum(DEP_DEL15, na.rm = TRUE),
    Total = n(),
    Percent = round((Delays / Total) * 100,1)
  ) %>%
  arrange(desc(Percent)) 

# and a basic bar chart of the percentages

library(ggplot2)
ggplot(data = atlanta_delays2, aes(x=CARRIER, y=Percent))  + geom_bar(stat="identity")

# If you want to order by Percent and not alphabetical. Plus add color and a title:

ggplot(data = atlanta_delays2, aes(x=reorder(CARRIER, Percent), y=Percent))  + geom_bar(stat="identity", fill="lightblue", color="black") + 
  xlab("Airline") + ggtitle("Percent delayed flights from Atlanta Jan-Nov 2014")

Not a new package, but if you’re not familiar GoogleVis and want to see the code - this generates an HTML page:

library("googleVis")
# get just the data we want - carrier and percent
 delay_subset <- subset(atlanta_delays2, select=c("CARRIER", "Percent"))
gchart <- gvisColumnChart(delay_subset, options = list(title="Percent ATL delays by carrier"))
plot(gchart)

Are there specific airplanes that flew in/out of Atlanta most often?

by_plane <- count(atlanta, TAIL_NUM) %>%
  arrange(desc(n))

# what's the distribution?

by_plane %>%
  ggvis(x = ~n, fill := "gray") %>% 
  layer_histograms(width =  input_slider(10, 200, step = 10, label = "binwidth")) 

# How might delays be related to distances flown? Hadley shows this code:

by_tailnum <- group_by(atlanta, TAIL_NUM)
delay <- summarise(by_tailnum,
                   count = n(),
                   dist = mean(DISTANCE, na.rm = TRUE),
                   delay = mean(ARR_DELAY, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)

ggplot(delay, aes(dist, delay)) +
  geom_point(aes(size = count), alpha = 1/2) +
  geom_smooth() +
  scale_size_area()

Hadleyalso found not much correlation between distance flown and delays in the NYC data.

The googleVis library may not be a new library, but it’s got some new options. Let’s read in some Atlanta weather data:

atlwx <- read.csv("AtlantaTemps.csv", stringsAsFactors = FALSE)
atlwx$date <- as.Date(atlwx$date, format="%Y-%m-%d")

# Run this code and see what happens:

dataviz <- gvisCalendar(atlwx, datevar="date", numvar="max",
                  options=list(title="Daily high temps in Atlanta",
                               width = 1000, height = 8000))
plot(dataviz)

We won’t run this now because everyone would need a plot.ly account and API key – both free. But with 2 lines of code you can turn a ggplot image into an interactive JavaScript and embed that.

library("plotly")

# if you had a plot.ly API key installed:

myplotly <- plotly()
myplotly$ggplotly()

For more on R, see my Beginner’s Guide to R:

PDF download http://cwrld.us/LearnRpdf HTML version: http://cwrld.us/IntroToR

Other packages I would have liked us to work with if we had more time:

rvest: Easy Web scraping package by Hadley Wickham. Step-by-step instructions on using it with the Selectorgadget bookmarklet: http://bit.ly/1zgq8JW

FredR: More exploring of FRED data from the Federal Reserve of St. Louis; you need a free API key from the Federal Reserve site. https://github.com/jcizel/FredR

plot.ly: Turn static ggplot2 graphics into interactive JavaScript visualizations easily embedded on the Web. Free plot.ly account and API key needed. https://plot.ly/ggplot2/

rbokeh: implements an R version of the Python bokeh interactive Web plotting library. It’s still under development but already well documented. http://hafen.github.io/rbokeh/rd.html

metricsgraphics: another interesting graphing project, interfacing to the MetricsGraphics.js D3 JavaScript library https://github.com/hrbrmstr/metricsgraphics

Additional things we could have done with more time using the packages we tried:

A script created by a TCU prof lets you create choropleth maps of World Bank data with Leaflet a single line of code! More info here:

http://rpubs.com/walkerke/wdi_leaflet

You can do considerably more sophisticated GIS work with Leaflet and R.

Draw circles with a 2km radius around each marker, for example. Tutorial by TCU assistant prof Kyle Walker http://rpubs.com/walkerke/rstudio_gis

Tutorial on creating choropleth maps with your own shapefiles and data

http://rpubs.com/walkerke/leaflet_choropleth

More info about Leaflet on the Leaflet project page http://rstudio.github.io/leaflet/

Do you use the ggplot2 visualization package? Save typing -- and syntax-lookup -- time with our free ggplot2 code snippets.

Copyright © 2015 IDG Communications, Inc.

1 2 Page 2
Page 2 of 2
  
Shop Tech Products at Amazon