IBM, Cloudera join RStudio to create R interface to Apache Spark

r spark

R users can now use the popular dplyr package to tap into Apache Spark big data.

The new sparklyr package is a native dplyr interface to Spark, according to RStudio. After installing the package, users can "interactively manipulate Spark data using both dplyr and SQL (via DBI), according to an RStudio blog post, as well as "filter and aggregate Spark data sets then bring them into R for analysis and visualization." There is also access to Spark distributed machine-learning algorithms.

Connecting to a local Spark cluster would looking something like the code below, according to sparkly deployment documentation:


library(sparklyr)
sc <- spark_connect(master = "spark://local:7077")

And, running SQL with sparklyr might look like this:


library(DBI)
iris_preview <- dbGetQuery(sc, "SELECT * FROM iris LIMIT 10")
To express your thoughts on Computerworld content, visit Computerworld's Facebook page, LinkedIn page and Twitter stream.
Windows 10 annoyances and solutions
Shop Tech Products at Amazon
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.