Beginner's guide to R: Easy ways to do basic data analysis

Part 3 of our hands-on series covers pulling stats from your data frame, and related topics.

1 2 3 4 5 6 Page 6
Page 6 of 6

If you're finding that your selection statement is starting to get unwieldy, you can put your row and column selections into variables first, such as:

mpg20 <- mtcars$mpg > 20

cols <- c("mpg", "hp")

Then you can select the rows and columns with those variables:

mtcars[mpg20, cols]

making for a more compact select statement but more lines of code.

Getting tired of including the name of the data set multiple times per command? If you're using only one data set and you are not making any changes to the data that need to be saved, you can attach and detach a copy of the data set temporarily.

The attach() function works like this:

attach(mtcars)

So, instead of having to type:

mpg20 <- mtcars$mpg > 20

You can leave out the data set reference and type this instead:

mpg20 <- mpg > 20

After using attach() remember to use the detach function when you're finished:

detach()

Some R users advise avoiding attach() because it can be easy to forget to detach(). If you don't detach() the copy, your variables could end up referencing the wrong data set.

Alternative to bracket notation

Bracket syntax is pretty common in R code, but it's not your only option. If you dislike that format, you might prefer the subset() function instead, which works with vectors and matrices as well as data frames. The format is:

subset(your data object, logical condition for the rows you want to return, select statement for the columns you want to return)

So, in the mtcars example, to find all rows where mpg is greater than 20 and return only those rows with their mpg and hp data, the subset() statement would look like:

subset(mtcars, mpg>20, c("mpg", "hp"))

What if you wanted to find the row with the highest mpg?

subset(mtcars, mpg==max(mpg))

If you just wanted to see the mpg information for the highest mpg:

subset(mtcars, mpg==max(mpg), mpg)

If you just want to use subset to extract some columns and display all rows, you can either leave the row conditional spot blank with a comma, similar to bracket notation:

subset(mtcars, , c("mpg", "hp"))

Or, indicate your second argument is for columns with select= like this:

subset(mtcars, select=c("mpg", "hp"))

Update: The dplyr package, released in early 2014, is aimed at making manipulation of data frames faster and more rational, with similar syntax for a variety of tasks. To select certain rows based on specific logical criteria, you'd use the filter() function with the syntax filter(dataframename, logical expression). As with subset(), column names stand alone after the data frame name, so mpg>20 and not mtcars$mpg > 20.

filter(mtcars, mpg>20)

To choose only certain columns, you use the select() function with syntax such as select(dataframename, columnName1, columnName2). No quotation marks are needed with the column names:

select(mtcars, mpg, hp)

You can also combine filter and subset with the dplyr %>% "chaining" operation that allows you to string together multiple commands on a data frame. The chaining syntax in general is:

dataframename %>%
firstfunction(argument for first function) %>%
secondfunction(argument for second function) %>%
thirdfunction(argument for third function)

So viewing just mpg and hp for rows where mpg is greater than 20:

mtcars %>% 
filter(mpg > 20) %>%
select(mpg, hp)

No need to keep repeating the data frame name. To order those results from highest to lowest mpg, add the arrange() function to the chain with desc(columnName) for descending order:

mtcars %>%
filter(mpg > 20) %>%
select(mpg, hp) %>%
arrange(desc(mpg))

You can find out more about dplyr in the dplyr package's introduction vignette.

Counting factors

To tally up counts by factor, try the table command. For the diamonds data set, to see how many diamonds of each category of cut are in the data, you can use:

table(diamonds$cut)

This will return how many diamonds of each factor -- fair, good, very good, premium and ideal -- exist in the data. Want to see a cross-tab by cut and color?

table(diamonds$cut, diamonds$color)

R's table function
R's table function returns a count of each factor in your data.

If you are interested in learning more about statistical functions in R and how to slice and dice your data, there are a number of books and academic downloads with more details. One with a lot of information about both R and statistics is Statistics (the Easier Way) With R, by Nicole M. Radziwill.

Next: Painless data visualization.

Copyright © 2017 IDG Communications, Inc.

1 2 3 4 5 6 Page 6
Page 6 of 6
  
Shop Tech Products at Amazon