4 data wrangling tasks in R for advanced beginners

Learn how to add columns, get summaries, sort your results and reshape your data.

1 2 3 4 5 6 7 8 Page 8
Page 8 of 8

It's actually fairly simple after you understand the basic concept. Here, the code assumes that all the other columns except fy and company are measurements -- items you might want to plot.

You can be lengthier in your code if you prefer, especially if you think that will help you remember what you did down the road. The statement below lists all the column in the data frame, assigning them to either id.vars or measure.vars, and also changes the new column names from the default "variable" and "value":

I find it a bit confusing that reshape2 calls category variables "id.vars" (short for ID variables) and not categories or factors, but after a while you'll likely get used to that. Measurement variables in reshape2 are somewhat more intuitively called measure.vars.

companiesLong <- melt(companiesData, id.vars=c("fy", "company"), measure.vars=c("revenue", "profit", "margin"), variable.name="financialCategory", value.name="amount")

This produces:

  fy company financialCategory amount
1 2010 Apple revenue 65225.0
2 2011 Apple revenue 108249.0
3 2012 Apple revenue 156508.0
4 2010 Google revenue 29321.0
5 2011 Google revenue 37905.0
6 2012 Google revenue 50175.0
7 2010 Microsoft revenue 62484.0
8 2011 Microsoft revenue 69943.0
9 2012 Microsoft revenue 73723.0
10 2010 Apple profit 14013.0
11 2011 Apple profit 25922.0
12 2012 Apple profit 41733.0
13 2010 Google profit 8505.0
14 2011 Google profit 9737.0
15 2012 Google profit 10737.0
16 2010 Microsoft profit 18760.0
17 2011 Microsoft profit 23150.0
18 2012 Microsoft profit 16978.0
19 2010 Apple margin 21.5
20 2011 Apple margin 23.9
21 2012 Apple margin 26.7
22 2010 Google margin 29.0
23 2011 Google margin 25.7
24 2012 Google margin 21.4
25 2010 Microsoft margin 30.0
26 2011 Microsoft margin 33.1
27 2012 Microsoft margin 23.0

Reshaping: Long to wide

Once your data frame is "melted," it can be "cast" into any shape you want. reshape2's dcast() function takes a "long" data frame as input and allows you to create a reshaped data frame in return. (The somewhat similar acast() function can return an array, vector or matrix.) One of the best explanations I've seen on going from long to wide with dcast() is from the R Graphics Cookbook by Winston Chang:

"[S]pecify the ID variables (those that remain in columns) and the variable variables (those that get 'moved to the top'). This is done with a formula where the ID variables are before the tilde (~) and the variable variables are after it."

In other words, think briefly about the structure you want to create. The variables you want repeating in each row are your "ID variables." Those that should become column headers are your "variable variables."

Look at this row from the original, "wide" version of our table:

fy company revenue profit margin
2010 Apple 65225 14013 21.5

Everything following fiscal year and company is a measurement relating to that specific year and company. That's why fy and company are the ID variables; while revenue, profit and margin are the "variable variables" that have been "moved to the top" as column headers.

How to re-create a wide data frame from the long version of the data? Here's code, if you've got two columns with ID variables and one column with variable variables:

wideDataFrame <- dcast(longDataFrame, idVariableColumn1 + idVariableColumn2 ~ variableColumn, value.var="Name of column with the measurement values")

dcast() takes the name of a long data frame as the first argument. You need to create a formula of sorts as the second argument with the syntax:

id variables ~ variable variables

The id and measurement variables are separated by a tilde, and if there are more than one on either side of the tilde they are listed with a plus sign between them.

The third argument for dcast() assigns the name of the column that holds your measurement values to value.var.

So, to produce the original, wide data frame from companiesLong using dcast():

companiesWide <- dcast(companiesLong, fy + company ~ financialCategory, value.var="amount")

To break that down piece by piece: companiesLong is the name of my long data frame; fy and company are the columns I want to remain as items in each row of my new, wide data frame; I want to create a new column for each of the different categories in the financialCategory column -- move them up to the top to become column headers, as Chang said; and I want the actual measurements for each of those financial categories to come from the amount column.

Update: Hadley Wickham created the tidyr package to perform a subset of reshape2's capabilities with two main functions: gather() to take multiple values and turn them into key-value pairs and spread() to go from long to wide. I still use reshape2 for these tasks, but you may find tidyr better fits your needs.

Wrap-up

Hopefully instructions for these data-wrangling tasks have helped you solve a particular problem and/or continue on the journey of mastering R for data work. To learn more about R, see Computerworld's 60+ R resources to improve your data skills.

Find this article useful? You can download it as a free PDF or as part of our Advanced Beginner's Guide to R.

1 2 3 4 5 6 7 8 Page 8
Page 8 of 8
First look: Office 2019’s likeliest new features
  
Shop Tech Products at Amazon