Using Python in Magpie

November 19, 2020 21:03

Overview

Magpie supports the Python Programming Language. Python is a general-purpose language with powerful data science and data engineering use cases. It has a rich developer community and is continues to grow in popularity year-over-year. Magpie’s notebook interface is well-suited to the iterative workflows of data scientists and data engineers.

Python Libraries

Many popular Python libraries are already installed in your Magpie environment, such as pandas, matplotlib, pyspark, and scikit-learn. You can load packages as you normally would in Python via an import statement and see the available libraries by running help("modules") as seen below. Please reach out to Support if you would like another module added to your cluster.

%python
help("modules")

Magpie Context

You will often interact with data in Magpie using the MagpieContext (available as mc). Read our documentation for the magpie Python package to learn more.

Note that the underlying SparkContext and SparkSession are available within Magpie as well. There is no need to instantiate these objects separately. They are exposed within Magpie as sc for the SparkContext and spark for the SparkSession.

Examples

Python scripting blocks in the notebook are denoted by the %python header. In the following code block we will show an example where we use the Magpie Context to get a table as a DataFrame. You can then see how PySpark can be used to manipulate and display data from a PySpark DataFrame.

%python

# Read magpie table into DataFrame
listings_df = mc.getTableDataFrame("airbnb_listings.listings")

# Manipulate DataFrame
nyc_listings_df = listings_df.filter(col("city") == "New York")

# Show data
nyc_listings_df.show()

In the first example, the mc.getTableDataFrame method was used to load a table into a DataFrame. Another way to load Magpie tables into Spark DataFrames is through mc.sql. In this example, we will show how the MagpieContext can be used to execute SQL on tables in Magpie and then how the pandas and matplotlib Python libraries can be used to create data visualizations. After executing the following code block in Magpie, a visualization such as the one below will be displayed.

%python
import matplotlib.pyplot as plt

# Execute SQL through MagpieContext to get Spark dataframe
# Assume the table listings in schema airbnb_listings exists
df = mc.sql("""
  select neighbourhood_group_cleansed as borough, avg(price) as avg_price
    from airbnb_listings.listings
    group by neighbourhood_group_cleansed""")

# convert Spark dataframe to Pandas
pdf = df.orderBy(df.avg_price.desc()).toPandas()

# Plot through matplotlib
x = list(range(len(pdf)))

plt.bar(x, pdf.avg_price, orientation = 'vertical')
plt.xticks(x, pdf.borough)
plt.ylabel('Average Listed Price [$]')
plt.xlabel('Borough')
plt.title('Average Listed Price by Borough')

For more in depth examples visit “Using Python in Magpie” in the “Magpie Tutorials” on your Magpie cluster.