R users can now use the popular dplyr package to tap into Apache Spark big data.
The new sparklyr package is a native dplyr interface to Spark, according to RStudio. After installing the package, users can "interactively manipulate Spark data using both dplyr and SQL (via DBI), according to an RStudio blog post, as well as "filter and aggregate Spark data sets then bring them into R for analysis and visualization." There is also access to Spark distributed machine-learning algorithms.
Connecting to a local Spark cluster would looking something like the code below, according to sparkly deployment documentation:
library(sparklyr) sc <- spark_connect(master = "spark://local:7077")
And, running SQL with sparklyr might look like this:
library(DBI) iris_preview <- dbGetQuery(sc, "SELECT * FROM iris LIMIT 10")