Magpie supports the Python Programming Language. Python is a general-purpose language with powerful data science and data engineering use cases. It has a rich developer community and is continues to grow in popularity year-over-year. Magpie’s notebook interface is well-suited to the iterative workflows of data scientists and data engineers.
Many popular Python libraries are already installed in your Magpie environment, such as
scikit-learn. You can load packages as you normally would in Python via an
import statement and see the available libraries by running
help("modules") as seen below. Please reach out to Support if you would like another module added to your cluster.
You will often interact with data in Magpie using the MagpieContext (available as
mc). Read our documentation for the
magpie Python package to learn more.
Note that the underlying SparkContext and SparkSession are available within Magpie as well. There is no need to instantiate these objects separately. They are exposed within Magpie as
sc for the SparkContext and
spark for the SparkSession.
Python scripting blocks in the notebook are denoted by the
%python header. In the following code block we will show an example where we use the
Magpie Context to get a table as a DataFrame. You can then see how PySpark can be used to manipulate and display data from a PySpark DataFrame.
%python # Read magpie table into DataFrame listings_df = mc.getTableDataFrame("airbnb_listings.listings") # Manipulate DataFrame nyc_listings_df = listings_df.filter(col("city") == "New York") # Show data nyc_listings_df.show()
In the first example, the
mc.getTableDataFrame method was used to load a table into a DataFrame. Another way to load Magpie tables into Spark DataFrames is through
mc.sql. In this example, we will show how the
MagpieContext can be used to execute SQL on tables in Magpie and then how the
matplotlib Python libraries can be used to create data visualizations. After executing the following code block in Magpie, a visualization such as the one below will be displayed.
%python import matplotlib.pyplot as plt # Execute SQL through MagpieContext to get Spark dataframe # Assume the table listings in schema airbnb_listings exists df = mc.sql(""" select neighbourhood_group_cleansed as borough, avg(price) as avg_price from airbnb_listings.listings group by neighbourhood_group_cleansed""") # convert Spark dataframe to Pandas pdf = df.orderBy(df.avg_price.desc()).toPandas() # Plot through matplotlib x = list(range(len(pdf))) plt.bar(x, pdf.avg_price, orientation = 'vertical') plt.xticks(x, pdf.borough) plt.ylabel('Average Listed Price [$]') plt.xlabel('Borough') plt.title('Average Listed Price by Borough')
For more in depth examples visit “Using Python in Magpie” in the “Magpie Tutorials” on your Magpie cluster.