4 data wrangling tasks in R for advanced beginners

Learn how to add columns, get summaries, sort your results and reshape your data.

1 2 3 4 5 6 7 8 Page 2
Page 2 of 8

Syntax 2: R's transform() function

This is another way to accomplish what we did above. Here's the basic transform() syntax:

dataFrame <- transform(dataFrame, newColumnName = some equation)

So, to get the sum of two columns and store that into a new column with transform(), you would use code such as:

dataFrame <- transform(dataFrame, newColumn = oldColumn1 + oldColumn2)

To add a profit margin column to our data frame with transform() we'd use:

companiesData <- transform(companiesData, margin = (profit/revenue) * 100)

We can then use the round() function to round the column results to one decimal place. Or, in one step, we can create a new column that's already rounded to one decimal place:

companiesData <- transform(companiesData, margin = round((profit/revenue) * 100, 1))

One brief aside about round(): You can use negative numbers for the second, "number of decimal places" argument. While round(73842.421, 1) will round to one decimal, in this case 73842.42, round(73842.421, -3) will round to the nearest thousand, in this case 74000.

Syntax 3: R's apply() function

As the name helpfully suggests, this will apply a function to a data frame (or several other R data structures, but we'll stick with data frames for now). This syntax is more complicated than the first two but can be useful for some more complex calculations.

The basic format for apply() is:

dataFrame$newColumn <- apply(dataFrame, 1, function(x) { . . . } )

The line of code above will create a new column called "newColumn" in the data frame; the contents will be whatever the code in { . . . } does.

Here's what each of those apply() arguments above is doing. The first argument for apply() is the existing data frame. The second argument — 1 in this example — means "apply a function by row." If that argument was 2, it would mean "apply a function by column" — for example, if you wanted to get a sum or average by columns instead of for each row.

The third argument, function(x), should appear as written. More specifically the function( ) part needs to be written as just that; the "x" can be any variable name. This means "What follows after this is an ad-hoc function that I haven't named. I'll call its input argument x." What's x in this case? It's each item (row or column) being iterated over by apply().

Finally, { . . . } is whatever you want to be doing with each item you're iterating over.

Keep in mind that apply() will seek to apply the function on every item in each row or column. That can be a problem if you're applying a function that works only on numbers if some of your data frame columns aren't numbers.

That's exactly the case with our sample data of financial results. For the data variable, this won't work:

apply(companiesData, 1, function(x) sum(x))

Why? Because apply() will try to sum every item per row, and company names can't be summed.

To use the apply() function on only some columns in the data frame, such as adding the revenue and profit columns together (which, I'll admit, is an unlikely need in the real world of financial analysis), we'd need to use a subset of the data frame as our first argument. That is, instead of using apply() on the entire data frame, we just want apply() on the revenue and profit columns, like so:

apply(companiesData[,c('revenue', 'profit')], 1, function(x) sum(x))

Where it says:

[c('revenue', 'profit')]

after the name of the data frame, it means "only use columns revenue and profit" in the sum.

You then might want to store the results of apply in a new column, such as:

companiesData$sums <- apply(companiesData[,c('revenue', 'profit')], 1, function(x) sum(x))

That's fine for a function like sum, where you take each number and do the same thing to it. But let's go back to our earlier example of calculating a profit margin for each row. In that case, we need to pass profit and revenue in a certain order — it's profit divided by revenue, not the other way around — and then multiply by 100.

How can we pass multiple items to apply() in a certain order for use in an anonymous function(x)? By referring to the items in our anonymous function as x[1] for the first one, x[2] for the second, etc., such as:

companiesData$margin <- apply(companiesData[,c('revenue', 'profit')], 1, function(x) { (x[2]/x[1]) * 100 } )

That line of code above creates an anonymous function that uses the second item — in this case profit, since it's listed second in companiesData[,c('revenue', 'profit')] — and divides it by the first item in each row, revenue. This will work because there are only two items here, revenue and profit — remember, we told apply() to use only those columns.

1 2 3 4 5 6 7 8 Page 2
Page 2 of 8
It’s time to break the ChatGPT habit
Shop Tech Products at Amazon