Closed Captioning Closed captioning available on our YouTube channel

How to use the dtplyr package

InfoWorld | Nov 15, 2019

The dtplyr 1.0 package lets you write dplyr R code and access speedy data.table performance. Find out how

Copyright © 2019 IDG Communications, Inc.

Hi. I’m Sharon Machlis at IDG, here with episode 37 of Do More With R: dplyr syntax + data.table speed with the new dtplyr 1.0 package.

The R worlds of tidyverse and data.table moved a little closer this week with the release of dtplyr 1.0 on CRAN. It lets fans of tidyverse’s dplyr syntax access data.table speed on the back end – all without having to learn a new format for their code.

But there’s also good news for dplyr users who want to learn how to use data.table. That’s because you can see the data.table code that dtplyr generates from your dplyr functions.

Let’s take a look.

I’ll start by importing a data set with two and a half million rows: flight delays at US airports between May and August this year.

Next, I’ll use regular dplyr to find the average arrival delay at each destination airport, by airline, ordered first by airport and then average delay within each airport (from largest to smallest).

Now, how would I do that with dtplyr? First I’d create a “lazy” data table version of my data. (“Lazy” just means that code won’t be executed right away, but only when specifically requested.)

This “lazy” object is a dtplyr “step” object as you can see from checking its class. If I print it out.

Check the message in the last line. If I want to access data in this lazy dtplyr object, I need to turn it into a data.table, data frame, or tibble. It’s not a type of regular data frame.

OK, next, I can run the same dplyr code that I did before, but on my lazy object. If I print out the result, I see I have another lazy object. Again, if I want to access the data, I need to turn it back into a data table, data frame, or tibble. But look above where it says Call: That shows the data.table code which was generated by dtplyr.

You can use that code to run the same commands using the data.table package itself. First I need the data as a data table Now if I copy the Call: code and change DT2 to mydt, it should work. And, there we are.

To recap, here’s full code for using dtplyr: Create a lazy data table, run your usual dplyr code, then turn the result back into a data frame, tibble, or data table.

In this code I don’t create a second, stand-alone lazy copy of the object, which I did before. I wouldn’t want to do with large data. Instead, I create the lazy data table in the first step of piped commands as hopefully you can see here.

Let me run this. I hope you could see how fast that was. I ran some crude benchmarking, and dtplyr was 4 to 5 times faster for this particular task. As always, speed comparisons depend on the data set and operations used. But it’s a good bet that dtplyr will usually be faster than dplyr. If your data set is large enough that speed is important, and you’re a dplyr syntax fan, dtplyr may be a good option. I’m thinking of rewriting some of my shiny Web app code from dplyr to dtplyr to improve performance.

And if you’re a tidyverse user who wants to learn data.table, dtplyr may be a good tool to translate code you know into code you’re learning.

That’s it for this episode, thanks for watching! For more R tips, head to the Do More With R page at go dot infoworld dot com slash more with R, all lowercase except for the R.

You can also find the Do More With R playlist on the YouTube IDG Tech Talk channel -- where you can subscribe so you never miss an episode. Hope to see you next time, when I’ll be talking about joining data three ways: base R, dplyr, and data.table.
Featured videos from