Beginner's guide to R: Easy ways to do basic data analysis

Part 3 of our hands-on series covers pulling stats from your data frame, and related topics.

1 2 3 4 5 6 Page 3
Page 3 of 6

Oddly, the mode() function returns information about data type instead of the statistical mode; there's an add-on package, modeest, that adds a mfv() function (most frequent value) to find the statistical mode.

R also contains a load of more sophisticated functions that let you do analyses with one or two commands: probability distributions, correlations, significance tests, regressions, ANOVA (analysis of variance between groups) and more.

As just one example, running the correlation function cor() on a dataframe such as:

cor(mydata)

will give you a matrix of correlations for each column of numerical data compared with every other column of numerical data.

R's correlation function
Results of the correlation function on the sample data set of U.S.arrests.

Note: Be aware that you can run into problems when trying to run some functions on data where there are missing values. In some cases, R's default is to return NA even if just a single value is missing. For example, while the summary() function returns column statistics excluding missing values (and also tells you how many NAs are in the data), the mean() function will return NA if even only one value is missing in a vector.

In most cases, adding the argument:

na.rm=TRUE

to NA-sensitive functions will tell that function to remove any NAs when performing calculations, such as:

mean(myvector, na.rm=TRUE)

If you've got data with some missing values, read a function's help file by typing a question mark followed by the name of the function, such as:

?median

The function description should say whether the na.rm argument is needed to exclude missing values.

Checking a function's help files -- even for simple functions -- can also uncover additional useful options, such as an optional trim argument for mean() that lets you exclude some outliers.

Not all R functions need a robust data set to be useful for statistical work. For example, how many ways can you select a committee of 4 people from a group of 15? You can pull out your calculator and find 15! divided by 4! times 11! ... or you can use the R choose() function:

choose(15,4)

Or, perhaps you want to see all of the possible pair combinations of a group of 5 people, not simply count them. You can create a vector with the people's names and store it in a variable called mypeople:

mypeople <- c("Bob", "Joanne", "Sally", "Tim", "Neal")

In the example above, c() is the combine function.

1 2 3 4 5 6 Page 3
Page 3 of 6
  
Shop Tech Products at Amazon