Skip the navigation

4 data wrangling tasks in R for advanced beginners

November 5, 2013 06:30 AM ET

Syntax 4. mapply() and the simpler sapply() also can apply a function to some -- but not necessarily all -- columns in a data frame, without having to worry about numbering each item like x[1] and x[2] above. The mapply() format to create a new column in a data frame is:

dataFrame$newColumn <- mapply(someFunction, dataFrame$column1, dataFrame$column2, dataFrame$column3)

The code above would apply the function someFunction() to the data in column1, column2 and column3 of each row of the data frame.

Note that the first argument of mapply() here is the name of a function, not an equation or formula. So if we want (profit/revenue) * 100 as our result, we could first write our own function to do this calculation and then use it with mapply().

Here's how to create a named function, profitMargin(), that takes two variables -- in this case we're calling them netIncome and revenue just within the function -- and return the first variable divided by the second variable times 100, rounded to one decimal place:

profitMargin <- function(netIncome, revenue)
{ mar <- (netIncome/revenue) * 100
mar <- round(mar, 1)
return(mar)
}

Now we can use that user-created named function with mapply():

companiesData$margin <- mapply(profitMargin, companiesData$profit, companiesData$revenue)

Or we could create an anonymous function within mapply():

companiesData$margin <- mapply(function(x, y) round((x/y) * 100, 1), companiesData$profit, companiesData$revenue)

One advantage mapply() has over transform() is that you can use columns from different data frames (note that this may not always work if the columns are different lengths). Another is that it's got an elegant syntax for applying functions to vectors of data when a function takes more than one argument, such as:

mapply(someFunction, vector1, vector2, vector3)

sapply() has a somewhat different syntax from mapply, and there are yet more functions in R's apply family. I won't go into them further here, but this may give you a sense of why R maestro Hadley Wickham created his own package called plyr with functions all having the same syntax in order to try to rationalize applying functions in R. (We'll get to plyr in the next section.)

For a more detailed look at base R's various apply options, A brief introduction to 'apply' in R by bioinformatician Neil Saunders is a useful starting point.

Update: dplyr

Hadley Wickham's dplyr package, released in early 2014 to rationalize and dramatically speed up operations on data frames, is another worthy option worth learning. To add a column to an existing data frame with dplyr, first install the package with install.packages("dplyr") -- you only need to do this once -- and then load it with library("dplyr"). To add a column using dplyr:

companiesData <- mutate(companiesData, margin = round((profit/revenue) * 100, 1))

Getting summaries by subgroups of your data

It's easy to find, say, the highest profit margin in our data with max(companiesData$margin). To assign the value of the highest profit margin to a variable named highestMargin, this simple code does the trick.

highestMargin <- max(companiesData$margin)

That just returns:

[1] 33.09838

but you don't know anything more about the other variables in the row, such as year and company.

To see the entire row with the highest profit margin, not only the value, this is one option:

highestMargin <- companiesData[companiesData$margin == max(companiesData$margin),]

and here's another:

highestMargin <- subset(companiesData, margin==max(margin))

(For an explanation on these two techniques for extracting subsets of your data, see Get slices or subsets of your data from the Computerworld Beginner's Guide to R.)

But what if you want to find rows with the highest profit margin for each company? That involves applying a function by groups -- what R calls factors.

The plyr package created by Hadley Wickham considers this type of task "split-apply-combine": Split up your data set by one or more factors, apply some function, then combine the results back into a data set.

plyr's ddply() function performs a "split-apply-combine" on a data frame and then produces a new separate data frame with your results. That's what the first two letters, dd, stand for in ddply(), by the way: Input a data frame and get a data frame back. There's a whole group of "ply" functions in the plyr package: alply to input an array and get back a list, ldply to input a list and get back a data frame, and so on.

To use ddply(), first you need to install the plyr package if you never have, with:

install.packages("plyr")

Then, if you haven't yet for your current R session, load the plyr package with:

library("plyr")

The format for splitting a data frame by multiple factors and applying a function with ddply would be:

ddply(mydata, c('column name of a factor to group by', 'column name of the second factor to group by'), summarize OR transform, newcolumn = myfunction(column name(s) I want the function to act upon))

Let's take a more detailed look at that. The ddply() first argument is the name of the original data frame and the second argument is the name of the column or columns you want to subset your data by. The third tells ddply() whether to return just the resulting data points (summarize) or the entire data frame with a new column giving the desired data point per factor in every row. Finally, the fourth argument names the new column and then lists the function you want ddply() to use.

If you don't want to have to put the column names in quotes, an alternate syntax you'll likely see frequently uses a dot before the column names:

myresult <- ddply(mydata, .(column name of factor I'm splitting by, column name second factor I'm splitting by), summarize OR transform, newcolumn = myfunction(column name I want the function to act upon))



Our Commenting Policies
Internet of Things: Get the latest!
Internet of Things

Our new bimonthly Internet of Things newsletter helps you keep pace with the rapidly evolving technologies, trends and developments related to the IoT. Subscribe now and stay up to date!