Data manipulation tricks: Even better in R

Anything Excel can do, R can do -- at least as well.

1 2 3 Page 2
Page 2 of 3

library(stringr) <br> #This pattern says the first group in parentheses is "everything up until a space and two capital letters." The second group in parentheses is "two capital letters."<br> mypattern <- "(.*?) ([A-Z]{2})"<br> parsed <- str_match(mytext, mypattern)<br> #The first column of the parsed object contains the entire match. The second column is the first group -- that is, the match just within the first parentheses, which in this case is the city. The third column is the match within the second parentheses, in this case the state.<br> parsed

## [,1] [,2] [,3]<br> ## [1,] "New York NY" "New York" "NY"

To perform this task on the sample spreadsheet, we can read in data from the CityState worksheet and then populate the blank CITY and STATE columns:

cities <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "CityState")<br> parsed <- str_match(cities$CITY.STATE, mypattern) <br> cities$CITY <- parsed[,2]<br>

#The second column of parsed has all the matches of the first group -- in this case, everything before space and two capital letters:

cities$STATE <- parsed[,3]<br>

#Append this as a sheet to our new ExcelToR.xlsx spreadsheet:

write.xlsx(cities, "ExcelToR.xlsx", sheetName = "CityState", append = TRUE)

Text functions: Search and replace

For Excel's SUBSTITUTE replacing old text with new text, there is base R's:

gsub("pattern to search for", "patern to replace it with", CharacterString)<br> <br>#And stringr's <br> str_replace_all(CharacterString, "pattern to search for", "pattern to replace it with")

Pick a syntax and structure you like, and off you go.

To create a new column in a dataframe named df that removes "PUBLIC SCHOOL DISTRICT" from a SchoolDistricts column, you could run stringr's str_replace_all() function on the SchoolDistricts column:

df$SchoolDistrictsEdited <- str_replace_all(df$SchoolDistricts, "PUBLIC SCHOOL DISTRICT", "")

Miscellaneous text functions

  • For Excel's EXACT to see if two strings are identical, base R has identical().
  • For Excel's LEN(text) to get the length of a string, base R has nchar() and stringr has str_length().
  • For Excel's REPT(text, number) to repeat a text string a certain number of times, the stringr package has str_dup()

To capitalize the first letter of each word, use the existing toupper() function's help file to write and load your own function, here called titleCase:

titleCase <- function(x) {<br> s <- strsplit(x, " ")[[1]]<br> paste(toupper(substring(s, 1, 1)), substring(s, 2),<br> sep = "", collapse = " ")<br> }<br> titleCase("hello there, world!")

Using a wildcard search

Because R supports regular expressions, wildcard searching is a bit simpler than Excel's somewhat convoluted

=IF(ISERROR(SEARCH("Texas",B4,1)>0)=FALSE, "X","")

which adds a column marking with an X all rows where one column includes "Texas." In R, you can just use an if-else statement such as:

ifelse(str_detect(mycolumn, "Texas"), "X", "")

However, there isn't always a need to add a column to do this, since you can easily filter a dataframe by searching for a string within a column. Here's some code to find all rows that include the phrase "WESTERN DISTRICT" from the BasicIF tab:

#Note we need to tell R to start reading on row 5 here because rows 1-4 are not part of the table:<br> df <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "BasicIF", stringsAsFactors = FALSE, startRow=5, header=TRUE)<br> <br>#Now we just want rows where the SUBDEPT includes the phrase "WESTERN DISTRICT": <br> justWestern <- subset(df, str_detect(SUBDEPT, "WESTERN DISTRICT"))<br> <br>#Check the first 20 rows & first 5 columns:<br> head(justWestern[,1:5], n=20)

## LASTNAME FIRSTNAME DEPT SUBDEPT YRS.EXP<br> ## 9 ANDERSON NEIL SPPD WESTERN DISTRICT - SOUTH 3<br> ## 10 ANDERSON ALLEN SPPD WESTERN DISTRICT-NORTH 5<br> ## 15 ANDERSON STEVE SPPD WESTERN DISTRICT-NORTH 18<br> ## 16 ANDERSON ERIC SPPD WESTERN DISTRICT 20<br> ## 17 ARNOLD THOMAS SPPD WESTERN DISTRICT - SOUTH 14<br> ## 27 BAILEY SARA SPPD WESTERN DISTRICT-NORTH 20<br> ## 33 BARABAS MICHAEL SPPD WESTERN DISTRICT-NORTH 19<br> ## 40 BAUMHOFER AMY SPPD WESTERN DISTRICT - SOUTH 15<br> ## 50 BENNETT CONSTANCE SPPD WESTERN DISTRICT 4<br> ## 51 BENNETT BRUCE SPPD WESTERN DISTRICT-NORTH 13<br> ## 55 BITNEY TERRANCE SPPD WESTERN DISTRICT 6<br> ## 60 BOERGER DARRYL SPPD WESTERN DISTRICT - SOUTH 4<br> ## 62 BOHN TIM SPPD WESTERN DISTRICT - SOUTH 11<br> ## 69 BOYLE JEFFERY SPPD WESTERN DISTRICT-NORTH 20<br> ## 76 BRODT MARY SPPD WESTERN DISTRICT - SOUTH 1<br> ## 80 BROWN ANTHONY SPPD WESTERN DISTRICT-NORTH 17<br> ## 96 CARTER MICHAEL SPPD WESTERN DISTRICT 16<br> ## 104 CHERRY LYNETTE SPPD WESTERN DISTRICT - SOUTH 12<br> ## 111 CLEVELAND KENT SPPD WESTERN DISTRICT-NORTH 1<br> ## 115 CONROY MICHAEL SPPD WESTERN DISTRICT-NORTH 16

If statements

Excel's basic IF statement is similar to R's ifelse(): Both use the format (logical test, result if true, result if false). This code can determine whether a home or visiting team won a game based on points scored by each, using the sample spreadsheet's More BasicIF worksheet:

scores <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "More BasicIF")<br> str(scores)

## 'data.frame': 256 obs. of 8 variables:<br> ## $ Date : Date, format: "2003-09-04" "2003-09-07" ...<br> ## $ WeekNum : num 1 1 1 1 1 1 1 1 1 1 ...<br> ## $ Visit.Team : Factor w/ 32 levels "ARI","ATL","BAL",..: 22 1 10 14 3 13 15 18 19 26 ...<br> ## $ Visit.Score: num 13 24 30 9 15 21 23 30 0 14 ...<br> ## $ Home.Team : Factor w/ 32 levels "ARI","ATL","BAL",..: 32 11 7 8 25 17 5 12 4 16 ...<br> ## $ Home.Score : num 16 42 10 6 34 20 24 25 31 27 ...<br> ## $ Winner : logi NA NA NA NA NA NA ...<br> ## $ WinTeam : logi NA NA NA NA NA NA ...

#The team names are coming in as "factors" and not characters. We'll re-import the data, this time adding stringsAsFactors = FALSE to the function arguments:<br> scores <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "More BasicIF", stringsAsFactors = FALSE)<br> str(scores)

## 'data.frame': 256 obs. of 8 variables:<br> ## $ Date : Date, format: "2003-09-04" "2003-09-07" ...<br> ## $ WeekNum : num 1 1 1 1 1 1 1 1 1 1 ...<br> ## $ Visit.Team : chr "NYJ" "ARI" "DEN" "IND" ...<br> ## $ Visit.Score: num 13 24 30 9 15 21 23 30 0 14 ...<br> ## $ Home.Team : chr "WAS" "DET" "CIN" "CLE" ...<br> ## $ Home.Score : num 16 42 10 6 34 20 24 25 31 27 ...<br> ## $ Winner : logi NA NA NA NA NA NA ...<br> ## $ WinTeam : logi NA NA NA NA NA NA ...

#That's better.<br> #Note that if there's a space in a column name, R converts it to a period.<br> <br> #Use an ifelse statement to find whether the home or visiting team had more points and thus won the game:<br> scores$Winner <- ifelse(scores$Home.Score > scores$Visit.Score, "Home", "Visitor")<br> <br>#Find out which team had more points:<br> scores$WinTeam <- ifelse(scores$Home.Score > scores$Visit.Score, scores$Home.Team, scores$Visit.Team)<br> <br>#Save to our new spreadsheet:<br> write.xlsx(scores, "ExcelToR.xlsx", sheetName = "MoreBasicIF", append = TRUE)

As with Excel IF, R ifelse statements can be nested.

Data where column headers are rows within the data

Look at the "Copy Down" tab on the ExcelTricks2014.xlsx spreadsheet, and you'll see the problem: There's a single row with the name of a team, the players on that team, the name of a second team, a list of players on that team and so on. This interspersing of categories and values means that if you do any sorting or aggregating of that column, you'll no longer know which player is on what team. What's needed is a way to add a new column identifying which team each player is on.

I'm sure there's a more elegant "R way"" to do this, but here I'll use a simple for loop instead. For loops are discouraged in R, with vectorized functions preferred. However, those of us with experience in languages where loops are common do find them a handy go-to.

#Read player data into R:<br> players <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "Copy Down", stringsAsFactors = FALSE)

#See the structure of the data:<br> str(players)

## 'data.frame': 451 obs. of 3 variables:<br> ## $ Name : chr "Arizona Cardinals" "Starks, Duane" "Stone, Michael" "Ransom, Derrick" ...<br> ## $ Position: chr NA "DB" "DB" "DT" ...<br> ## $ NA. : chr NA NA "" "" ...

#Not sure what that .NA column is about, but we can get rid of it by setting it to NULL:<br> players$NA. <- NULL<br> str(players)

## 'data.frame': 451 obs. of 2 variables:<br> ## $ Name : chr "Arizona Cardinals" "Starks, Duane" "Stone, Michael" "Ransom, Derrick" ...<br> ## $ Position: chr NA "DB" "DB" "DT" ...

#That's better.<br> <br> #To see if a value is missing in R, use the is.na() function. Here we'll create a new column called Team. If there's no value in the Position column, we'll use the value of the Players$Name column. If there is a value in the Position column, we'll use the value of the Team column one row higher.<br> for(i in 1:length(players$Name)){<br> players$Team[i] <- ifelse(is.na(players$Position[i]), players$Name[i], players$Team[i-1])<br> } <br> <br> #We can delete rows with the team names by using the handly na.omit(dataframe) function; that will eliminate all rows in a data frame that have at least one missing value:<br> players <- na.omit(players)<br> <br> #Here we can add the reformatted data to our new spreadsheet. Don't forget append=TRUE or the spreadsheet will be overwritten.<br> write.xlsx(players, "ExcelToR.xlsx", sheetName = "CopyDownTeams", <br>append=TRUE)

Functions by groups: SUMIF and COUNTIF equivalents

This is one of many areas where R shines over Excel -- grouping items for any purpose, not just subtotals or counts.

Assuming all the data is in rows 2 to 424, the team name is in column c and the salaries are in column e, the Excel tip was to use:

=sumif(c2:c424, "Dallas Mavericks", e2:e424)

to get just the Mavericks total and

=sumif(Salaries!c2:Salaries!c424, a3, Salaries!$e2:Salaries!e$424)

to get subtotals by all teams where Team names are in column a of another worksheet.

I prefer something that's not hardcoded with total row numbers and that's more easily reproducible on slightly different data.

In R, there are numerous ways to apply functions to a data set by group. My current favorite is the relatively new dplyr package for R because of its consistent and (to me) fairly human-readable functions.

Since salary data isn't included in the sample spreadsheet, I'm going to load a short table of top 40 salaries from ESPN using the incredibly handy readHTMLTable() function in R's XML package. Note that I'm just scraping the top 40 and not all the salaries to save time. Also note that it is indeed possible to scrape and clean data from the Web using R :-).

#Load in data from table 1 at ESPN with XML package's readHTMLTable()<br> library(XML)<br> url <- 'http://espn.go.com/nba/salaries'<br> salaries <- readHTMLTable(url, stringsAsFactors = FALSE, which=1, header=TRUE) #which=1 means load the first table on the page<br> str(salaries)

## 'data.frame': 43 obs. of 4 variables:<br> ## $ RK : chr "1" "2" "3" "4" ...<br> ## $ NAME : chr "Kobe Bryant, SG" "Dirk Nowitzki, PF" "Amar'e Stoudemire, PF" "Joe Johnson, SG" ...<br> ## $ TEAM : chr "Los Angeles Lakers" "Dallas Mavericks" "New York Knicks" "Brooklyn Nets" ...<br> ## $ SALARY: chr "$30,453,805" "$22,721,381" "$21,679,893" <br>"$21,466,718" ...

#It's necessary to remove dollar signs and commas to turn SALARY character strings into integers for R<br> <br> #Removes dollar sign:<br> salaries$SALARY <- str_replace_all(salaries$SALARY, '\\$', '')<br> <br> #A handy decomma() function in the eeptools package removes commas and turns the numerical character strings into numbers:<br> salaries$SALARY <- decomma(salaries$SALARY)

## Warning: NAs introduced by coercion

#Rows that don't contain numbers will appear in R as NA; we can remove those rows with na.omit():<br> salaries <- na.omit(salaries)<br> <br> #Now that we have the data, it's time to sum and count top salaries by team -- and let's add mean and median for good measure:<br> library(dplyr)

## Warning: package 'dplyr' was built under R version 3.0.3 <br> ## <br> ## Attaching package: 'dplyr'<br> ## <br> ## The following object is masked from 'package:MASS':<br> ## <br> ## select<br> ## <br> ## The following objects are masked from 'package:lubridate':<br> ## <br> ## intersect, setdiff, union<br> ## <br> ## The following objects are masked from 'package:stats':<br> ## <br> ## filter, lag<br> ## <br> ## The following objects are masked from 'package:base':<br> ## <br> ## intersect, setdiff, setequal, union

#Probably self-explanatory: Create a new variable salaries_grouped_by_team that uses dplyr's group_by() function to group the salaries data by the TEAM column:<br> salaries_grouped_by_team <- group_by(salaries, TEAM) <br> <br> #This creates new columns in a new variable, summaries_by_team, with summaries by group:<br> summaries_by_team <- summarise(salaries_grouped_by_team,<br> sums = sum(SALARY),<br> count = n(),<br> average = mean(SALARY),<br> median = median(SALARY)) <br> # Finally, the arrange() function sorts by sums descending: <br> arrange(summaries_by_team, desc(sums))

1 2 3 Page 2
Page 2 of 3
6 tips for scaling up team collaboration tools
  
Shop Tech Products at Amazon