Using R in Magpie

Magpie supports R Statistical Programming. R is a popular language for data science which allows users to neatly analyze, visualize, and build machine learning models with data.

Magpie’s notebook interface is well-suited to the iterative workflows of data scientists and data analysts. Furthermore, Magpie is built on Apache Spark and makes it easy to integrate with big data environments, a typical pain point for R users.

R Libraries

Many popular R packages are already installed in your Magpie environment, such as tidyversedata.table, and broom. You can install more packages as you normally would in R, running install.packages("NAME"). To see the available libraries run installed.packages() as seen below.


Magpie supports both SparkR and sparklyr R libraries for Apache Spark. Magpie will return a SparkR or sparklyr dataframe depending on what argument you pass to functions in the magpie package. SparkR is used by default. Your default Spark library can be set by updating the variable: mc$defaultLibrary. In the example below, we set the default library to sparklyr and load a few other libraries that can be used for analysis.

mc$defaultLibrary <- "sparklyr"


In addition to the sparklyr and SparkR, some other useful analytical libraries in Magpie include magrittr for piping data, dplyr for transforming data, ggplot for creating visualizations from DataFrames, and mllib to support machine learning use cases.

Magpie Context

The magpie R package provides functions to interact with your Magpie environment within R. The MagpieContext is available as the mc variable, which is a required argument to most functions in the magpie R package. Visit our documentation to learn more.

When using sparklyr, the spark connection is already configured and available as the variable sc.


R scripting blocks in the notebook are denoted by the %r header. In the following code block we will show an example where we use the MagpieContext to get a table as a sparklyr DataFrame.

# Read magpie table into DataFrame
sample_table_df <- getTableDataFrame(mc, "sample_table")

# Read sql into DataFrame 
df <- sql(mc, "SELECT col1, col2 FROM sample_table")

# Show data

# Save result to temp table
execute(mc, "save result as temp table filtered_sample_table")

In the first example, the getTableDataFrame and sql methods were used to load data from a Magpie table into a DataFrame. We also showed how Magpie commands can be executed through the MagpieContext. In the following example, we assume that there is a table called taxi_zone_airbnb_listings which contains data joined to show airbnb listings with the number of drop offs in their borough. Using this table, we will show how R can be used to analyze and plot data in Magpie using the tidyverse and ggplot libraries. After executing the following code block in Magpie, a visualization such as the one below will be displayed.

taxi_zone_and_listing <- magpie::sql(mc, "select zone_id, borough, zone,
  zone_latitude, zone_longitude, dropoffs, price from taxi_zone_airbnb_listings")

summary_df <- taxi_zone_and_listing %>%
    total_dropoffs = dropoffs
  ) %>%
    avg_listed_price = mean(price), 
    total_listings = n()

summary_df %>%
  ggplot(aes(avg_listed_price, total_dropoffs, size=total_listings, color=borough)) + 
  geom_point(alpha=0.5) + 
  scale_size(range=c(1, 15), name="# of Listings") + 
  labs(y = "Total Dropoffs", x = "Average Listed Price", title = "New York ", subtitle = "Airbnb Listing Prices v. Taxi/Rideshare dropoffs") +
  theme_bw() + 

For more in depth examples visit “Using R in Magpie” in the “Magpie Tutorials” notebook on your Magpie cluster.

Was this article helpful?
0 out of 0 found this helpful