Closed Captioning Closed captioning available on our YouTube channel

Get month-over-month comparisons in R

InfoWorld | Jun 26, 2019

See various ways to calculate month-over-month, week-over-week, and year-over-year comparisons in R using dplyr and lubridate.

Similar
Hi. I’m Sharon Machlis at IDG Communications, here with episode 30 of Do More With R: Month-Over-Month and Year-Over-Year Calculations.
If you ever have to analyze changes like “last month vs the prior month” or “last month vs the same month a year earlier,” R is a nice choice. It’s easy to do those calculations. And, you don’t have to worry whether a spreadsheet formula was properly clicked and dragged to cover all your needed cells.
Like so many things in R, there are multiple ways to do this. I’ll show you a couple.
First, I’ll import some data about daily cycling trips on Bluebikes, the bicycle-share system in Boston, Cambridge, and 3 other nearby cities. If you want to follow along, the associated article at InfoWorld has the data.

My initial run-through will have a few extra steps so you can see what’s going on.
The first block of code adds a new column to the data, called YearMonth. It uses base R’s format() function; and it creates a yyyy-mm format for each trip starting date. Next is a typical dplyr group_by and summarize. In this case, I’m grouping by YearMonth and then, within each group, creating a column called MonthlyTrips that has the sum of all that month’s trips. Finally, I make sure the data is arranged by YearMonth.
Now that I’ve got monthly subtotals, I’ll calculate the month-over-month and year-over-year changes by using dplyr’s lag() function. lag() defaults to “the previous value in the data frame column as currently ordered.” You can change the number of items back you want to lag so it’s more than 1 back. To compare with the previous month, a default of 1 is fine. To compare with the previous year, I’ll want the lag to be 12, for 12 items back. Note that this only works if there aren’t any missing months. You might want to add in some code to make sure of that.
You can also set the column you want to order and lag by if the data isn’t ordered by that column. (I don’t need to in this case, since my data is already ordered by YearMonth.)
All the data is here, although the format could be nicer. This code multiplies the fractions by 100 to create percent format, and then rounds to 1 decimal place.
That’s the easiest example of monthly reporting – only one data point per month. Sometimes, though, you’ll have multiple categories you want to track each month, such as comparisons by city, or age group, or customer type. It’s pretty easy to tweak this code for that, you just group by your category in addition to the month before summarizing.
See here, I’m doing the same thing as I did before, except I’d also like to compare trips by month and user type. There are two types of users in this data set: “customer” and “subscriber”. I’m reading in the data, adding the YearMonth column, grouping by YearMonth AND usertype, and then doing everything else like before: adding up trips within each month, and arranging. You don’t have to arrange by category as well as month, but I find it easier to read results that way
You see now I’ve got month-over-month and year-over-year comparisons by both Subscriber user type and Customer user type.
This could also use some data formatting to turn the fractions into percents.
Now it’s easy to graph the data with ggplot2 since it’s the right format. I’ll load the library and create the graph.
If you just need “last month” for a report, use dplyr’s filter() function and set the YearMonth column to equal whatever the maximum value is that data.
Week over week is slightly different. For that, I generally use the “week starting date” instead of a format like year hyphen week number. Week numbers can get complicated when there are partial weeks at the start and end of years
For weekly reporting, I’ll use the lubridate package and its floor_date() function. You can also use base R’s cut.Date() function, but that’s a bit more complicated since it returns factors instead of dates, so then I usually end up running extra code to turn them back to dates. Floor_date() is one step for what I want.
To create my “WeekStarting” column, I’m using floor_date, the date column as the first argument, and “weeks” as my unit.
Next, the usual calculations, this time for week-over-week and year-over-year. Note that for year over year, the lag is 52 instead of 12. I also need to know my data, and that I’ve got rows for every week.
If you’re thinking – OK, that’s straightforward, but it’s still a lot to type just for a weekly report … make an RStudio code snippet! If you don’t know how they work, check out the Do More With R episode on code snippets.
Here’s the snippet I made for monthly reports:
You can also download it at the associated article – URL should be on screen. If I open a new R script file and load the snippet. I have a script template for my monthly report. I can now tab for each variable.
There you have it: Once it’s set up, maybe even faster than Excel.
That’s it for this episode, thanks for watching! For more R tips, head to the Do More With R page at go dot infoworld dot com slash more with R, all lowercase except for the R
You can also find the Do More With R playlist on the YouTube IDG Tech Talk channel.
Hope to see you next episode!
Featured videos from IDG.tv