Data manipulation tricks: Even better in R

Anything Excel can do, R can do -- at least as well.

big data charts graphs analysis woman user
Thinkstock

After covering a recent session on data munging with Excel, I wanted to see how those tasks could be accomplished in R. Surely anything you can do in a spreadsheet should be doable in a platform designed for heavy-duty statistical analysis!

(New to R? You can get up and running with our Beginner's guide to R series.)

To more easily follow these tips and code, we have a companion PDF that you can download here. (Free registration is required.)

To get started, you can download the Excel Magic PDF and sample data spreadsheet and then follow along. (The original Excel tips come from MaryJo Webster, senior data reporter with Digital First Media. You may see an error message when trying to view the Excel file in her Dropbox account, but the download link should still work.)

If you want to follow along, you'll first want to load data from her sample spreadsheet into R. There are several ways to do this, including:

Note: If a package I reference is not already installed on your system, you'll need to install it first by using R's install.packages() function. Here's how to install the xlsx package:

install.packages("xlsx", dependencies=TRUE)

This package can be a little finicky in Windows due to Java issues.

You only need to install a package once on a system. However, in order to use it, you need to load it in each session. Here's how to load the xlsx package with library():

library(xlsx)

If you downloaded the spreadsheet to run R code on that sample data, set your working R directory to whatever directory holds the spreadsheet. Replace DIRECTORY with your actual directory; keep in mind that capitalization matters:

setwd("DIRECTORY")

If you are using the RStudio IDE for R, you can create a new RStudio project in the directory with the spreadsheet, and then automatically be switched to your working directory each time you load that project. See more about projects in RStudio.

Finally: Let's start coding!

Dates: Extract month, day and year from each date in a column

We'll start by parsing a single example date: "4/3/04." If you load in an Excel spreadsheet with dates, your dates may already be R date objects. If you've pulled in a CSV file, though, they may just be character strings. If your date is just a text string, first we'd need to turn that text into a date object and store it in a variable. The package lubridate is helpful for date parsing.

If lubridate isn't already installed on your system, make sure to run:

install.packages("lubridate", dependencies=TRUE)

Then we'll load lubridate with library(lubridate) and use lubridate's mdy() function to let R know that the date format is month/day/year -- and not, say, the European day/month/year. We can then use lubridate functions such as year() and month() to parse the date, similar to functions in Excel:

library(lubridate)

## Warning: package 'lubridate' was built under R version 3.0.3

mydate <- mdy("4/3/04") #get year<br> year(mydate)

## [1] 2004

# get month<br> month(mydate)<br> #as number

## [1] 4

month(mydate, label=TRUE)#as name of month

## [1] Apr ## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

day(mydate)

## [1] 3

#day of week as number<br> wday(mydate)

## [1] 7

#day of week as name of day

wday(mydate, label=TRUE)

## [1] Sat<br> ## Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat

# number of the week<br> week(mydate)

## [1] 14

Calculating ages and other date arithmetic

The lone tricky thing about doing date arithmetic in R is making sure you've got your data in the correct date format for the package and function you decide to use.

The eeptools package has an extremely handy and elegant age_calc() function. It requires R Date objects as input, which are easy to create with base R's as.Date() function. When using as.Date(), you just need to remember to tell R the format of your character string, such as %m/%d/%y' for mm/dd/yy and %m/%d/%Y' for mm/dd/yyyy. There's a list of how to describe some common date formats at Quick-R.

In this test, we'll calculate how many days of summer there are between Memorial Day (May 26) and Labor Day (Sept. 1) in 2014. Remember to install eeptools with install.packages("eeptools") if it's not already on your system, then load it with:

library(eeptools)

## Loading required package: ggplot2 <br> ## Loading required package: MASS

## Warning: package 'MASS' was built under R version 3.0.3

## Loading required namespace: car

MemorialDay <- as.Date("5/26/2014", format="%m/%d/%Y")#Create date object for Memorial Day LaborDay <- as.Date("9/1/2014", format="%m/%d/%Y")#Create date object for Labor Day<br> summerdays <- age_calc(MemorialDay, LaborDay, units="days")#The difference between the two dates in units of days<br> summerdays #This variable is an object of class difftime

## Time difference of 98 days

#To see the number of days as an integer, use as.integer():<br> as.integer(summerdays)

## [1] 98

Calculating ages, as MaryJo did in her Excel sheet, is even easier with age_calc(), because if no second date is given, the function defaults to the current system date. So, you don't even have to explicitly state you want today's date to calculate someone's age as of "today," as you need to do in Excel.

Here's how to do it:

dob <- as.Date("2/4/1982", format="%m/%d/%Y")#Create a test date of birth date as a date object<br> age <- age_calc(dob, units='years')# Find today's age as of when I wrote this script and saved the results to an html file<br> #Round off the age to whole years with the floor() function<br> wholeyears <- floor(age) # Now let's try this with an entire column of birth dates from MaryJo's spreadsheet. <br><br> # Read in data from the ExcelTricks2014 Dates worksheet using the xlsx package:<br> library(xlsx)

## Loading required package: rJava<br> ## Loading required package: xlsxjars

testdates <- read.xlsx("ExcelTricks2014.xlsx", sheetName="Dates") #What does the structure of that testdates object look like?<br> str(testdates)

## 'data.frame': 58 obs. of 9 variables:<br> ## $ Player.: Factor w/ 58 levels "Adrian Awasom",..: 47 33 14 2 40 49 30 55 23 36 ...<br> ## $ Pos. : Factor w/ 11 levels "Center ","Defensive Back ",..: 7 7 7 8 8 8 8 8 11 11 ...<br> ## $ Status.: Factor w/ 2 levels "Active ","Out ": 1 1 1 1 1 1 1 1 1 1 ...<br> ## $ Ht. : Factor w/ 12 levels "5'10' ","5'11' ",..: 6 8 6 5 1 7 8 5 5 4 ...<br> ## $ Wt. : num 215 220 229 217 191 240 258 237 190 204 ...<br> ## $ DOB. : Date, format: "1985-07-02" "1986-11-14" ...<br> ## $ DATEDIF: logi NA NA NA NA NA NA ...<br> ## $ YEAR : logi NA NA NA NA NA NA ...<br> ## $ WEEKDAY: logi NA NA NA NA NA NA ...

#Excellent, the DOB column was already read in as date objects!<br> #We want MaryJo's DATEDIF column to have the ages:<br> testdates$DATEDIF <- round(age_calc(testdates$DOB., units='years'))<br> <br>#While we're at it, let's add year and weekday columns:<br> testdates$YEAR <- year(testdates$DOB.)<br> testdates$WEEKDAY <- wday(testdates$DOB., label=TRUE)<br> <br>#We can add week numbers per MaryJo's discussion of seeking patterns in the data:<br> testdates$WEEKNUMs <- strftime(testdates$DOB., format="%W")<br> table(testdates$WEEKDAY, testdates$WEEKNUMs)

## <br> ## 00 02 03 04 05 06 07 08 11 12 13 14 15 17 18 19 21 23 24 25 26 27<br> ## Sun 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0<br> ## Mon 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0<br> ## Tues 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0<br> ## Wed 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0<br> ## Thurs 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0<br> ## Fri 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1<br> ## Sat 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0<br> ## <br> ## 28 29 30 31 32 33 34 35 37 38 39 40 41 43 44 45 46 50 51 52<br> ## Sun 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0<br> ## Mon 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0<br> ## Tues 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 1 0<br> ## Wed 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 0 0 0 0<br> ## Thurs 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 1 0 2 1<br> ## Fri 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0<br> ## Sat 0 1 0 0 2 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0

#or just a frequency table for days of the week:<br> table(testdates$WEEKDAY)

## <br> ## Sun Mon Tues Wed Thurs Fri Sat <br> ## 8 3 13 7 11 6 10

#And let's save that to a new spreadsheet:<br> write.xlsx(testdates, "ExcelToR.xlsx", sheetName = "Dates")

A final note about dates: The lubridate mdy() function creates an object of the class POSIXct. If you are familiar with Unix (or some other programming languages that handle POSIX dates), you can probably already guess what that means: POSIXct stores the date as the number of seconds since January 1, 1970. If you try to print this object, it will show in R as a human-readable date such as "2004-04-03 UTC."

But don't be fooled: It's not actually an R Date object. So, not all R date arithmetic functions that require an object of R class "Date" will work, because they're trying to use the wrong type of object.

lubridate's mdy() and ymd() functions parse most date-like character strings into POSIXct objects, not R Date objects. You can turn POSIXct objects into R Date objects with as.Date():

mydate <- mdy("2/28/14") mydateAsDate <- as.Date(mydate)

Does it annoy you that R takes two steps (or one more complex single step) to do something that Excel does in a single function? Well, that's the beauty of using a scripting language: If you don't want to repeat multiple lines of code, you can write your own function to simplify it.

Here's one way to create a function called myDateFunc() that combines the lubridate mdy() and base R's as.Date() into a simpler single line of code:

myDateFunc <- function(dateliketext){<br> #Reminder: This requires text to be in some month-date-year format<br> require(lubridate)#Load the lubridate package<br> thedate <- mdy(dateliketext)#Create a POSIXct object from the date-like string<br> thedate <- as.Date(thedate)#Turn the POSIXct object into a Date object<br> }

Voila! Now if we want to create a date object from "February 28, 2014," we run just a single line of code using the new function:

mynewdate <- myDateFunc("February 28, 2014")<br> #See what that mynewdate object looks like:<br> print(mynewdate)

## [1] "2014-02-28"

#Check the class of mynewdate:<br> class(mynewdate)

## [1] "Date"

You can put that function in a separate file -- mynewdate.R, for example -- and then add the code:

source("mynewdate.R")

to your script file. That tells the script file to run all the code in the mynewdate.R file. (This assumes that the script file is in your working directory. If not, just include the full path to the file such as "C:/Rscripts/mynewdate.R.")

Text functions: Search and substring extraction

R's substr() function performs the same task as Excel's LEFT and MID, using the syntax:

substr(thestring, start, stop)

where start and stop are integers. So, substring("Computerworld", 1, 8) would return "Computer": It slices the string starting at position 1 and stopping at position 8.

Searching is much more robust in R than in Excel, in part because R can use powerful regular expressions. In addition, there are many ways to handle and process strings in R.

One demonstrated Excel task was to extract a two-letter state abbreviation from a city and state when there's no comma separating them -- but you know the state is always the last two letters of the character string -- using LEFT and MID.

We can use the same technique to find the last two letters of "New York NY" by finding the length of the string with nchar() and two characters before the end of the string with nchar() - 2, like so:

mytext <- "New York NY"<br> substr(mytext, nchar(mytext) - 2, nchar(mytext))

## [1] " NY"

#Get the rest of the string before the space and two-letter state abbreviation:<br> mytext <- "New York NY"<br> substr(mytext, 1, nchar(mytext)-3)

## [1] "New York"

As you probably guessed, nchar() returns the number of characters in a character string, including number of spaces.

If you are familiar with regular expressions, you can also search for a more complex pattern than "last two characters," such as "all characters except a space and the last two letters." Base R handles regular expressions, but I find the >stringr package to be more convenient for some text operations, including matching regular expressions with str_match():

1 2 3 Page 1
Page 1 of 3
It’s time to break the ChatGPT habit
Shop Tech Products at Amazon