Magpie supports R Statistical Programming. R is a popular language for data science which allows users to neatly analyze, visualize, and build machine learning models with data.
Magpie’s notebook interface is well-suited to the iterative workflows of data scientists and data analysts. Furthermore, Magpie is built on Apache Spark and makes it easy to integrate with big data environments, a typical pain point for R users.
Many popular R packages are already installed in your Magpie environment, such as
broom. You can install more packages as you normally would in R, running
install.packages("NAME"). To see the available libraries run
installed.packages() as seen below.
Magpie supports both
sparklyr R libraries for Apache Spark. Magpie will return a SparkR or sparklyr dataframe depending on what argument you pass to functions in the
magpie package. SparkR is used by default. Your default Spark library can be set by updating the variable:
mc$defaultLibrary. In the example below, we set the default library to
sparklyr and load a few other libraries that can be used for analysis.
%r mc$defaultLibrary <- "sparklyr" library(scales) library(tidyverse) library(sparklyr)
In addition to the
SparkR, some other useful analytical libraries in Magpie include
magrittr for piping data,
dplyr for transforming data,
ggplot for creating visualizations from DataFrames, and
mllib to support machine learning use cases.
magpie R package provides functions to interact with your Magpie environment within R. The MagpieContext is available as the
mc variable, which is a required argument to most functions in the
magpie R package. Visit our documentation to learn more.
When using sparklyr, the spark connection is already configured and available as the variable
R scripting blocks in the notebook are denoted by the
%r header. In the following code block we will show an example where we use the MagpieContext to get a table as a sparklyr DataFrame.
%r # Read magpie table into DataFrame sample_table_df <- getTableDataFrame(mc, "sample_table") # Read sql into DataFrame df <- sql(mc, "SELECT col1, col2 FROM sample_table") # Show data print(df) # Save result to temp table execute(mc, "save result as temp table filtered_sample_table")
In the first example, the
sql methods were used to load data from a Magpie table into a DataFrame. We also showed how Magpie commands can be executed through the
MagpieContext. In the following example, we assume that there is a table called
taxi_zone_airbnb_listings which contains data joined to show airbnb listings with the number of drop offs in their borough. Using this table, we will show how R can be used to analyze and plot data in Magpie using the
ggplot libraries. After executing the following code block in Magpie, a visualization such as the one below will be displayed.
%r taxi_zone_and_listing <- magpie::sql(mc, "select zone_id, borough, zone, zone_latitude, zone_longitude, dropoffs, price from taxi_zone_airbnb_listings") summary_df <- taxi_zone_and_listing %>% group_by( zone_id, borough, zone, zone_latitude, zone_longitude, total_dropoffs = dropoffs ) %>% summarise( avg_listed_price = mean(price), total_listings = n() ) summary_df %>% ggplot(aes(avg_listed_price, total_dropoffs, size=total_listings, color=borough)) + geom_point(alpha=0.5) + scale_size(range=c(1, 15), name="# of Listings") + labs(y = "Total Dropoffs", x = "Average Listed Price", title = "New York ", subtitle = "Airbnb Listing Prices v. Taxi/Rideshare dropoffs") + theme_bw() + scale_y_continuous(label=comma)
For more in depth examples visit “Using R in Magpie” in the “Magpie Tutorials” notebook on your Magpie cluster.