4 data wrangling tasks in R for advanced beginners
If you don't want to keep typing the name of the data frame followed by the dollar sign for each of the column names, R's with() function takes the name of a data frame as the first argument and then lets you leave it off in subsequent arguments in one command:
companiesOrdered <- companiesData[with(companiesData, order(fy, -margin)),]
While this does save typing, it can make your code somewhat less readable, especially for less experienced R users.
Packages offer some more elegant sorting options. The doBy package features orderBy() using the syntax
orderBy(~columnName + secondColumnName, data=dataFrameName)
The ~ at the beginning just means "by" (as in "order by this"). If you want to order by descending, just put a minus sign after the tilde and before the column name. This also orders the data frame:
companiesOrdered <- orderBy(~-margin, companiesData)
Both plyr and dplyr have an arrange() function with the syntax
arrange(dataFrameName, columnName, secondColumnName)
To sort descending, use desc(columnName))
companiesOrdered <- arrange(companiesData, desc(margin))
Reshaping: Wide to long (and back)
Different analysis tools in R -- including some graphing packages -- require data in specific formats. One of the most common -- and important -- tasks in R data manipulation is switching between "wide" and "long" formats in order to use a desired analysis or graphics function. For example, it is usually easier to visualize data using the popular ggplot2() graphing package if it's in long format. Wide means that you've got multiple measurement columns across each row, like we've got here:
Each row includes a column for revenue, for profit and, after some calculations above, profit margin.
Long means that there's only one measurement per row. but likely multiple categories, as you see below:
Please trust me on this (I discovered it the hard way): Once you thoroughly understand the concept of wide to long, actually doing it in R becomes much easier.
If you find it confusing to figure out what's a category and what's a measurement, here's some advice: Don't pay too much attention to definitions that say long data frames should contain only one "value" in each row. Why? For people with experience programming in other languages, pretty much everything seems like a "value." If the year equals 2011 and the company equals Google, isn't 2011 your value for year and Google your value for company?
For data reshaping, though, the term "value" is being used a bit differently.
I like to think of a "long" data frame as having only one "measurement that would make sense to plot on its own" per row. In the case of these financial results, would it make sense to plot that the year changed from 2010 to 2011 to 2012? No, because the year is a category I set up in advance to decide what measurements I want to look at.
Even if I'd broken down the financial results by quarter -- and quarters 1, 2, 3 and 4 certainly look like numbers and thus "values" -- it wouldn't make sense to plot the quarter changing from 1 to 2 to 3 to 4 and back again as a "value" on its own. Quarter is a category -- a factor in R -- that you might want to group data by. However, it's not a measurement you would want to plot by itself.
- Data Visualization Techniques: From Basics to Big Data with SAS Visual Analytics This paper discusses some of the basic issues concerning data visualization, from data size and column composition, to solving unique challenges presented by...
- Best Practices in SAS Data Management for Big Data Big data trends and related technologies are becoming important to organizations of all types and sizes. This paper introduces the most important technologies...
- Fast and Furious: How SAS VA Helps IT Deliver BI Platform Read this whitepaper to learn more about the benefits of self-service BI to make business critical decisions.
- Understanding Big Data Quality for Maximum Information Usability In this paper we examine some of the challenges presented by managing the quality and governance of big data, and how those can...
- Cloud BI in Action: Recorded Webinar of Customer, Kony, Inc. See how Kony, Inc., a leading enterprise mobility company, is using TIBCO Jaspersoft for Amazon Web Services and Redshift to achieve embedded analytics...
- Cloud BI Overview: Jaspersoft for AWS Check out this overview of Jaspersoft for AWS, to easily and affordably build business intelligence solutions as well as embed visualizations and analytics... All Business Intelligence/Analytics White Papers | Webcasts
Our new bimonthly Internet of Things newsletter helps you keep pace with the rapidly evolving technologies, trends and developments related to the IoT. Subscribe now and stay up to date!