IBM, Cloudera join RStudio to create R interface to Apache Spark

r spark

R users can now use the popular dplyr package to tap into Apache Spark big data.

The new sparklyr package is a native dplyr interface to Spark, according to RStudio. After installing the package, users can "interactively manipulate Spark data using both dplyr and SQL (via DBI), according to an RStudio blog post, as well as "filter and aggregate Spark data sets then bring them into R for analysis and visualization." There is also access to Spark distributed machine-learning algorithms.

Connecting to a local Spark cluster would looking something like the code below, according to sparkly deployment documentation:

sc <- spark_connect(master = "spark://local:7077")

And, running SQL with sparklyr might look like this:

iris_preview <- dbGetQuery(sc, "SELECT * FROM iris LIMIT 10")
The march toward exascale computers
View Comments
Join the discussion
Be the first to comment on this article. Our Commenting Policies