Data manipulation tricks: Even better in R

Anything Excel can do, R can do -- at least as well.

1 2 3 Page 3
Page 3 of 3

## Source: local data frame [22 x 5]<br> ## <br> ## TEAM sums count average median<br> ## 1 Brooklyn Nets 67699917 4 16924979 16899732<br> ## 2 New York Knicks 57169384 3 19056461 21388953<br> ## 3 Miami Heat 56808000 3 18936000 19067500<br> ## 4 Los Angeles Lakers 49739655 2 24869828 24869828<br> ## 5 Oklahoma City Thunder 44876533 3 14958844 14693906<br> ## 6 Golden State Warriors 40746632 3 13582211 13878000<br> ## 7 Los Angeles Clippers 35109931 2 17554966 17554966<br> ## 8 Houston Rockets 34214428 2 17107214 17107214<br> ## 9 Memphis Grizzlies 33098856 2 16549428 16549428<br> ## 10 Chicago Bulls 32932688 2 16466344 16466344<br> ## 11 Minnesota Timberwolves 26793906 2 13396953 13396953<br> ## 12 Charlotte Bobcats 26700000 2 13350000 13350000<br> ## 13 Dallas Mavericks 22721381 1 22721381 22721381<br> ## 14 Toronto Raptors 17888932 1 17888932 17888932<br> ## 15 Portland Trail Blazers 14878000 1 14878000 14878000<br> ## 16 Phoenix Suns 14487500 1 14487500 14487500<br> ## 17 Indiana Pacers 14283844 1 14283844 14283844<br> ## 18 New Orleans Pelicans 14283844 1 14283844 14283844<br> ## 19 Cleveland Cavaliers 14275000 1 14275000 14275000<br> ## 20 Detroit Pistons 13500000 1 13500000 13500000<br> ## 21 Washington Wizards 13000000 1 13000000 13000000<br> ## 22 San Antonio Spurs 12500000 1 12500000 12500000

# If we want just the Dallas Mavericks and Maimi Heat data, pick one from several syntax options that you like best:<br> mydata <- subset(summaries_by_team, TEAM=="Dallas Mavericks" | TEAM=="Miami Heat")<br> #Or<br> mydata <- summaries_by_team[summaries_by_team$TEAM=="Dallas Mavericks" | <br>summaries_by_team$TEAM=="Miami Heat",]<br> #Or dplyr's filter():<br> mydata <- filter(summaries_by_team, TEAM=="Dallas Mavericks" | TEAM=="Miami Heat")<br> mydata

## Source: local data frame [2 x 5]<br> ## <br> ## TEAM sums count average median<br> ## 1 Dallas Mavericks 22721381 1 22721381 22721381<br> ## 2 Miami Heat 56808000 3 18936000 19067500

#Likewise you can easily count how many teams have at least 3 players in this list<br> <br> subset(summaries_by_team, count >= 3)

## Source: local data frame [5 x 5]<br> ## <br> ## TEAM sums count average median<br> ## 1 Brooklyn Nets 67699917 4 16924979 16899732<br> ## 7 Golden State Warriors 40746632 3 13582211 13878000<br> ## 13 Miami Heat 56808000 3 18936000 19067500<br> ## 16 New York Knicks 57169384 3 19056461 21388953<br> ## 17 Oklahoma City Thunder 44876533 3 14958844 14693906

R also has round() and rank() functions.

rank() gives the numerical rank by whatever column you want, while a stackexchange thread suggested this easy function for percentile rank:

perc.rank <- function(x) trunc(rank(x))/length(x)

Lookup tables

I confess: I have indeed used combinations of VLOOKUP, INDEX and MATCH in Excel to look up the value of a key on one worksheet to insert a related value in another. However, in general I'm not a fan of trying to use Excel as a relational database unless there's a good reason for keeping my data in Excel (such as I'm sharing a spreadsheet with colleagues who don't use MySQL or R).

With several different robust lookup options, R is a much better tool than Excel for using lookup tables. One choice: You can run SQL commands on a data frame with the sqldf package, much like running SQL queries on a relational database.

Another option: The data.table package, which comes highly recommended for its speed with large data sets, creates index keys for data frames and many join options.

Finally, there are several R functions that offer SQL-like joins, such as dplyr's inner join and left join options and base R's merge() function. (Some options require the common column in each table to have the same name.) You can read more about all these options in this stackoverflow thread.

In the Excel Tricks example, there is a Lookups table with a fipscty column that holds a numerical code for each county. MaryJo wants to add the county name to this worksheet (a separate table, Lookup2, has a list of all the codes and county names).

I'll use dplyr's left_join() to accomplish this task (several other techniques will work well too). Why a "left join"? That's a SQL database term which means join two tables by one or more common columns, keeping all the rows in the left table (here, "left" means the first one mentioned in the join statement) and adding whatever matches there are from the right column.

Here's the code:

#Read in data from spreadsheet:<br> Lookups <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "Lookups", startRow = 2, stringsAsFactors = FALSE)<br> Lookup2 <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "Lookup2", startRow = 2, stringsAsFactors = FALSE)<br> <br> #I am going to rename the first column in the lookup table to match the name in the Lookups table:<br> names(Lookup2)[1] <- "fipscty"<br> <br> #One line of code adds the county name from Lookup2 to the Lookups table:<br> Lookups <- left_join(Lookups, Lookup2, by="fipscty")<br> <br> #Check our results:<br> head(Lookups)

## fipstate fipscty Tot.Employ An.Payroll Num.Estab. n1_4 n5_9 n10_19<br> ## 1 27 085 16223 522051 987 495 231 126<br> ## 2 27 135 7721 215072 414 240 84 43<br> ## 3 27 129 5517 133031 556 337 99 76<br> ## 4 27 127 4742 113020 572 348 115 60<br> ## 5 27 125 835 18330 109 74 17 8<br> ## 6 27 143 3050 71132 399 258 76 36<br> ## n20_49 n50_99 n100_249 n250_499 n500_999 n1000 COUNTY.NAME County<br> ## 1 86 30 13 4 0 2 NA McLeod<br> ## 2 34 9 1 1 0 2 NA Roseau<br> ## 3 29 8 5 1 1 0 NA Renville<br> ## 4 32 10 7 0 0 0 NA Redwood<br> ## 5 7 2 1 0 0 0 NA Red Lake<br> ## 6 20 5 3 1 0 0 NA Sibley

#Get rid of the blank COUNTY.NAMES column:<br> Lookups$COUNTY.NAME <- NULL

Note that left_join will work regardless of where the columns are located within the tables, unlike VLOOKUP in Excel.

Copying down a date versus date sequence

In Excel, if you click and drag a date down a column, Excel will assume you want to increment the date by 1 each row. So you need a special technique to copy the same date down a column. In R, if you want a column to all be one date, you just assign it, such as:

df$mycolumn <- as.Date("2014-03-21")

With the code above, every row in the df dataframe will have the date value 2014-03-21 in the mycolumn column.

But what if you want the default Excel behavior in R: Adding one to each day in a column? Use the seq() function. For instance, to get a date sequence of 15 days incremented by 1 day:

seq(as.Date("2014-03-21"), by="day", length.out=15 )

## [1] "2014-03-21" "2014-03-22" "2014-03-23" "2014-03-24" "2014-03-25" ## [6] "2014-03-26" "2014-03-27" "2014-03-28" "2014-03-29" "2014-03-30" ## [11] "2014-03-31" "2014-04-01" "2014-04-02" "2014-04-03" "2014-04-04"

If you want to do this for a data frame column, you have to tell seq() how many items your column needs. You can do this by changing the hard-coded number 15 from the example above to the number of rows in your data frame using the nrow() function:

df$mycolumn <- seq(as.Date("2014-03-21"), by="day", length.out=nrow(df) )

You can create date sequences by week, month, quarters and years as well.

Using column names

In Excel, you need to explicitly create names in a spreadsheet in order to use column names in formulas. In R, you can use either the column name or its numerical index position.

Reshaping data

The sample spreadsheet features an example of Affordable Health Care premium data where each plan's row had age group data across multiple columns. The desired format was to have one plan price per age group per row, not many age groups in the same row. This means the data needs to be reshaped. In R lingo, we want to reshape the data frame from wide to long.

Webster demonstrated a very useful free add-in for Excel from Tableau to perform this kind of reshaping. To use the Tableau reshaping add-in for Excel, all the columns you want to be moved down from being column headers must be on the right side of your spreadsheet; and all the columns you want to keep as column headers must be on the left. In addition, you need to manually open the sheet and click on the correct cell -- fine if you're working on a one-time project, but less ideal if this is data you process frequently (or if you want others to be able to easily reproduce and check your work).

With an R script, the columns can be in any order, and a script that's written once can be run from a batch file.

Please see my detailed explanation of Reshaping: Wide to long (and back) in R for a full run-through of this type of reshaping. But in brief, you want to use the reshape2 package and tell it which column headers you want to move down so they're no longer separate columns.

In other words, if a dataframe had column headers for "young," "middle age" and "old" with a price for each but you wanted only one price per row, you'd want to move those three column headers into one new variable column, perhaps called something like "age group."

To go from wide to long you use reshape2's melt() function and tell melt either which columns you want to move into a new variable column or which columns you want to stay as ID variables and not move. In this sample data, there are far fewer ID variables that don't need to move than there are column variables that do need to move, so I'll specify the id variables.

In addition, we have the option of naming what we want the variable column and value column to be called, which I'll do below:

library(reshape2)<br> widedata <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "Reshaper", header = TRUE)<br> #id.vars are the columns to keep as column headers. There's then no need to identify all the age group column headers that are moving from being column headers to being part of a new agegroup column; premium is the value column. <br> reshaped <- melt(widedata, id.vars <- c("Company", "PlanName", "Metal", "RatingArea", "RateAreatxt"), variable.name="agegroup", value.name="premium" )

#Check results:<br> head(reshaped)

## Company PlanName Metal RatingArea RateAreatxt agegroup<br> ## 1 All Savers Lowest cost Silver Silver 3 Colorado3 X0.20.<br> ## 2 All Savers Lowest cost Silver Silver 7 Colorado7 X0.20.<br> ## 3 All Savers Lowest cost Silver Silver 8 Colorado8 X0.20.<br> ## 4 All Savers Lowest cost Silver Silver 9 Colorado9 X0.20.<br> ## 5 All Savers Lowest cost Silver Silver 10 Colorado10 X0.20.<br> ## 6 Cigna Lowest cost Silver Silver 3 Colorado3 X0.20.<br> ## premium<br> ## 1 189.1<br> ## 2 186.5<br> ## 3 196.8<br> ## 4 245.8<br> ## 5 242.7<br> ## 6 158.1

#There's an X in front of the all the age groups because R columns can't start with a number and those columns in the spreadsheet all started with numbers. In addition the - in 0-20 was turned into a period because - is not a legal character for an R dataframe column name. If that bothers us, we can use some search-and-replace strategies we learned above to remove X from the age groups and return the - to 0-20:<br> <br> reshaped$agegroup <- str_replace_all(reshaped$agegroup, "X", "")<br> reshaped$agegroup <- str_replace_all(reshaped$agegroup, "0.20.", "0-20")<br> <br> #We can see how many unique values of reshaped$agegroup there are with unique():<br> unique(reshaped$agegroup)

## [1] "0-20" "21" "22" "23" <br> ## [5] "24" "25" "26" "27" <br> ## [9] "28" "29" "30" "31" <br> ## [13] "32" "33" "34" "35" <br> ## [17] "36" "37" "38" "39" <br> ## [21] "40" "41" "42" "43" <br> ## [25] "44" "45" "46" "47" <br> ## [29] "48" "49" "50" "51" <br> ## [33] "52" "53" "54" "55" <br> ## [37] "56" "57" "58" "59" <br> ## [41] "60" "61" "62" "63" <br> ## [45] "64.and.other."

#I'll change "64.and.other." to "64+" -- like so:<br> reshaped$agegroup <- str_replace_all(reshaped$agegroup, "64.and.other.", "64+")<br> <br> #Check unique values again:<br> unique(reshaped$agegroup)

## [1] "0-20" "21" "22" "23" "24" "25" "26" "27" "28" "29" <br> ## [11] "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" <br> ## [21] "40" "41" "42" "43" "44" "45" "46" "47" "48" "49" <br> ## [31] "50" "51" "52" "53" "54" "55" "56" "57" "58" "59" <br> ## [41] "60" "61" "62" "63" "64+"

To more easily follow these tips and code, we have a companion PDF that you can download here. (Free registration is required.)

1 2 3 Page 3
Page 3 of 3
8 highly useful Slack bots for teams
  
Shop Tech Products at Amazon