Overview
This article describes how to create and alter clusters within Magpie. Clusters can contain two types of nodes:
Driver - The driver node is the lead node on the cluster. All commands start execution here, and all user Scala, Python, and R scripts execute here. If the driver node is unhealthy, the cluster is not usable and may need to be restarted to continue processing.
Worker - Worker nodes process data on behalf of the driver. Workers contain one or more executor processes that parallelize cluster operations, enabling the processing of much more data than a single node could. If a worker node is unhealthy, the cluster will enter a degraded state, but will be able to continue processing with reduced parallelism.
Single-node clusters only contain a driver, and all data processing happens directly on the driver. This can be sufficient for moderate to large data volumes, but more intensive workloads will require a multi-node cluster with worker nodes.
Once a cluster has been defined, it can started and stopped using the Start Cluster and Stop Cluster commands and used using the Use command. These commands can be executed as part of a job in order to spin up additional processing capacity specifically for an intensive data processing pipeline, for example.
Cluster size and count is limited based on your Magpie subscription agreement. If you attempt to create or start a cluster outside of these limits, Magpie will return an error message. Contact Silectis support to increase your cluster management limits.
Creating
The following is an example of creating a simple single-node cluster on AWS using the Create command.
create cluster { "name": "main", "driverType": "m5.4xlarge" };
Here, we create a cluster called main
that consists of a single AWS m5.4xlarge
node.
Understanding Instance Types
Different instance types are available based on the cloud provider that your Magpie instance operates configured on. To see which instance types are available, run the List Instance Types command.
list instance types;
To see the details of a particular instance type, use the Describe command. The instance type name will likely need to be quoted using backticks due to the presence of special characters in the name.
describe instance type `m5.4xlarge`;
Altering
If we need additional processing capacity, we could alter our cluster to add worker nodes using the Alter command.
alter cluster main { "name": "main", "driverType": "m5.2xlarge", "workerCount": 4, "workerType": "m5.4xlarge" };
Here, we alter the main cluster to reduce the size of the driver node to m5.2xlarge
and add 4 m5.4xlarge
worker nodes. Note that clusters can only be altered when they are in the Stopped state. Additionally, properties like storage volume size can be altered for a cluster. To see all of the available properties, visit the Cluster JSON specification documentation.
Installing Custom Libraries
Magpie clusters support an optional Bootstrap Script that can be used to install Python, Scala, or R libraries or do other node-level customization. Both the Create and Alter commands support a with bootstrap script
option that can be used to perform this customization. For example, to install a Python library:
create cluster main { "name": "main", "driverType": "m5.4xlarge" } with bootstrap script """ pip install simple-salesforce """;
This bootstrap script will install the simple-salesforce
Python package for interacting with the Salesforce API.
To install a Scala or Java library, copy the jar into the /usr/lib/magpie/jars
folder:
create cluster main { "name": "main", "driverType": "m5.4xlarge" } with bootstrap script """ wget -O /usr/lib/magpie/jars/spark-stringmetric_2.11-0.3.0.jar \ https://repo1.maven.org/maven2/com/github/mrpowers/spark-stringmetric_2.11/0.3.0/spark-stringmetric_2.11-0.3.0.jar """;
This bootstrap script will add the spark-stringmetric
library to the cluster.
To install an R package, use Rscript
and install.packages
:
create cluster main { "name": "main", "driverType": "m5.4xlarge" } with bootstrap script """ Rscript -e "install.packages('tidyquant', repos = 'https://cloud.r-project.org'); library(tidyquant)" """;
This bootstrap script installs the tidyquant
R package and verifies the install by attempt to load the package after installation.
Default Clusters
Magpie supports setting a cluster as the default cluster for a particular user or the entire organization. When a user logs in, they will automatically use
their default cluster, if it is started. A user-level default cluster setting will override an organization-level default cluster. For example, to set the cluster main
as the default for the Silectis organization:
alter organization silectis set default cluster main