Getting Started with the Magpie CLI

January 14, 2020 21:19

Our Philosophy

The Magpie notebook is a powerful interface for collaborative analysis, ad hoc exploration, and building pipelines. It’s well-suited to the iterative workflows that data engineers and data scientists experience on a daily basis while creating new analyses and pipelines.

However, we believe that notebooks are not the best medium for deploying pipelines, versioning code, or integrating with external continuous integration or scheduling tools. With that in mind, we’ve created a command line interface (CLI) that surfaces all of the features of Magpie in a portable tool.

Installing the Magpie CLI

The Magpie CLI requires Java to operate. Before proceeding, ensure you have Java installed on the machine you want to use the Magpie CLI on.

1. First, navigate to your Magpie cluster at https://<orgname>.silect.is/#/. In the top right corner, select the “Magpie CLI” items from the drop-down menu. This will begin an automatic download.

2. Unzip the folder in your preferred installation location.

Note: In older versions of MacOS, the default “Archive Utility” does not properly recognize the bin/magpie file as executable. You can get around this by navigating the the ZIP file location and using the unzip command line function, or by using an alternate unzip utility. The issue is fixed in Catalina and newer versions of MacOS.

3. Navigate to the folder where you unzipped the CLI package and test the Magpie CLI installation by displaying the CLI usage information.

Mac/Unix Users:

./bin/magpie --help

Windows Users:

.\bin\magpie --help

4. Optionally, configure environment variables for your Magpie credentials. By default, Magpie will prompt the user for a Magpie host name, username, and password and the CLI also offers command line flags (see the CLI’s README.md file for details) that can be added to specify username and host. However, you can specify environment variables for these parameters, which is the best option for integrating the CLI into an external tool. Doing so will allow Magpie to automatically apply the appropriate set of credentials when starting a session.

Mac/Unix Users:

For username and password login

export MAGPIE_LOGIN_PARADIGM="password"
export MAGPIE_USER="matt@silect.is"
export MAGPIE_PASSWORD="12345"
export MAGPIE_HOST="<orgname>.silect.is"

For access token login

export MAGPIE_LOGIN_PARADIGM="token"
export MAGPIE_ACCESS_TOKEN="ABCDEFG12345678-ABCDEFG12345678"
export MAGPIE_HOST="<orgname>.silect.is"

For single sign on login

export MAGPIE_LOGIN_PARADIGM="sso"
export MAGPIE_USER="matt@silect.is"
export MAGPIE_HOST="<orgname>.silect.is"

Note: if your password contains special characters (i.e., any characters that have special meaning in the shell), you will need to escape special characters using a backslash, \. Special characters include !, #, $, &.

Windows Users: Right click on My Computer and select ‘Properties’. In the System folder, select ‘Advanced system settings’.

Select ‘Environment Variables’, then click ‘New’ to add environment variables on the next window.

5. For ease of use, you can optionally add the bin directory your PATH environment variable, which will allow you to execute the magpie command from anywhere without having to specify the path to the bin directory.

Mac/Unix Users:

export PATH=$PATH:~/magpie/bin

Windows Users:

Using the CLI

An overview of command syntax is available in the CLI’s README.md file. There are three primary ways we can interact with Magpie CLI:

1. By default, the CLI runs in interactive mode, meaning that a user can type in a command to execute
and the CLI submits the command to the cluster and returns the results. Each command must be
terminated by a semicolon, allowing single commands to be entered over multiple lines. The syntax and usage are otherwise the same as within a standard Magpie notebook.

2. The second is by piping commands via the CLI (e.g., to execute jobs already defined within Magpie).

echo "list schemas;" | ./bin/magpie

3. Finally, the most powerful use case is passing files written in Magpie’s domain-specific language through the CLI, allowing for more flexibility to create, execute, and automate data loading tasks. These files can be versioned through version control systems and deployed via continuous integration tools. Magpie files are useful containers for data pipeline code, for example when being triggered by external scheduling or workflow management platforms.

./bin/magpie < sample_script.magpie
## If you have `bin` directory added to path:
magpie < sample_script.magpie

Integrating with External Tools

The process of integrating Magpie with external tools largely follows the steps above.

Instead of deploying the Magpie CLI to your local machine, you’ll want to unpack the CLI into a directory on the external tool’s instance.
Supply environment variables in a secure manner using your preferred secrets management tool
Depending on the installation location of the external tool relative to your Magpie cluster, you may need to separately configure a VPN and/or modify firewall settings of your cloud provider. The CLI connects to your Magpie cluster on port 443 using SSL.

You can view an example integration in the Implementing Your Data Lake with Apache Airflow and Silectis Magpie blog post.

Sourcing Magpie Files

One special CLI command is :source. This allows you to source and execute Magpie script files within other Magpie script files, or from an interactive session. For example, the following command will source and execute the build_etl.magpie script file.

echo ":source build_etl.magpie;" | ./bin/magpie

In turn, the build_etl.magpie script file might source additional files. In this way, you can intuitively structure data pipelining jobs and more easily handle dependencies, improve versioning, and mitigate complexity. Note: script files are specified according to their relative paths.

// build_etl.magpie
:source data_staging/stage_1.magpie
:source data_staging/stage_2.magpie
:source data_staging/stage_3.magpie
:source etl/job_1.magpie
:source etl/job_2.magpie
:source etl/job_3.magpie
:source view_definitions/create_statements.magpie

Uploading Files to Magpie Data Source (Beta)

The :upload command allows you to upload a file from your local machine to any Magpie file system data source (i.e., HDFS, Amazon S3, Google Cloud Storage, Azure Storage). The command is parameterized as follows: :upload <file> <data-source> <path>. The <path> parameter is an optional parameter that can be used to nest the file in a folder within the data source and/or rename the file being uploaded. In the default case, the local file name will be used as the destination file name.

For example, the following command will upload a file test_data.csv to the Magpie data source my_s3_source at the data source root path.

:upload test_data.csv my_s3_source

In this example, we provide the nested folder sample-data to upload the file to. Notice that when submitting a folder as the destination, a / must exist as the final character.

:upload test_data.csv my_s3_source sample-data/

In this final example, we will upload the file to a nested folder with a new file name.

:upload test_data.csv my_s3_source sample-data/my_renamed_test_data.csv