R support in Magpie is deprecated as of Version 1.59.0 and will be removed in Version 2.0.0. We recommend using R studio on your local computer and connecting to Magpie as a data source. For more assistance, contact your Silectis support representative.
Magpie supports R Statistical Programming. R is a popular language for data science which allows users to neatly analyze, visualize, and build machine learning models with data.
Magpie’s notebook interface is well-suited to the iterative workflows of data scientists and data analysts. Furthermore, Magpie is built on Apache Spark and makes it easy to integrate with big data environments, a typical pain point for R users.
R Libraries
Many popular R packages are already installed in your Magpie environment, such as tidyverse
, data.table
, and broom
. You can install more packages as you normally would in R, running install.packages("NAME")
. To see the available libraries run installed.packages()
as seen below.
%r installed.packages() |
Magpie supports both SparkR
and sparklyr
R libraries for Apache Spark. Magpie will return a SparkR or sparklyr dataframe depending on what argument you pass to functions in the magpie
package. SparkR is used by default. Your default Spark library can be set by updating the variable: mc$defaultLibrary
. In the example below, we set the default library to sparklyr
and load a few other libraries that can be used for analysis.
%r mc$defaultLibrary <- "sparklyr" library(scales) library(tidyverse) library(sparklyr) |
In addition to the sparklyr
and SparkR
, some other useful analytical libraries in Magpie include magrittr
for piping data, dplyr
for transforming data, ggplot
for creating visualizations from DataFrames, and mllib
to support machine learning use cases.
Magpie Context
The magpie
R package provides functions to interact with your Magpie environment within R. The MagpieContext is available as the mc
variable, which is a required argument to most functions in the magpie
R package. Visit our documentation to learn more.
When using sparklyr, the spark connection is already configured and available as the variable sc
.
Examples
R scripting blocks in the notebook are denoted by the %r
header. In the following code block we will show an example where we use the MagpieContext to get a table as a sparklyr DataFrame.
%r # Read magpie table into DataFrame sample_table_df <- getTableDataFrame(mc, "sample_table") # Read sql into DataFrame df <- sql(mc, "SELECT col1, col2 FROM sample_table") # Show data print(df) # Save result to temp table execute(mc, "save result as temp table filtered_sample_table") |
In the first example, the getTableDataFrame
and sql
methods were used to load data from a Magpie table into a DataFrame. We also showed how Magpie commands can be executed through the MagpieContext
. In the following example, we assume that there is a table called taxi_zone_airbnb_listings
which contains data joined to show airbnb listings with the number of drop offs in their borough. Using this table, we will show how R can be used to analyze and plot data in Magpie using the tidyverse
and ggplot
libraries. After executing the following code block in Magpie, a visualization such as the one below will be displayed.
%r taxi_zone_and_listing <- magpie::sql(mc, "select zone_id, borough, zone, zone_latitude, zone_longitude, dropoffs, price from taxi_zone_airbnb_listings") summary_df <- taxi_zone_and_listing %>% group_by( zone_id, borough, zone, zone_latitude, zone_longitude, total_dropoffs = dropoffs ) %>% summarise( avg_listed_price = mean(price), total_listings = n() ) summary_df %>% ggplot(aes(avg_listed_price, total_dropoffs, size=total_listings, color=borough)) + geom_point(alpha=0.5) + scale_size(range=c(1, 15), name="# of Listings") + labs(y = "Total Dropoffs", x = "Average Listed Price", title = "New York ", subtitle = "Airbnb Listing Prices v. Taxi/Rideshare dropoffs") + theme_bw() + scale_y_continuous(label=comma)
For more in depth examples visit “Using R in Magpie” in the “Magpie Tutorials” notebook on your Magpie cluster.