Creating and Altering Clusters

Overview

This article describes how to create and alter clusters within Magpie. Clusters can contain two types of nodes:

  • Driver - The driver node is the lead node on the cluster. All commands start execution here, and all user Scala, Python, and R scripts execute here. If the driver node is unhealthy, the cluster is not usable and may need to be restarted to continue processing.

  • Worker - Worker nodes process data on behalf of the driver. Workers contain one or more executor processes that parallelize cluster operations, enabling the processing of much more data than a single node could. If a worker node is unhealthy, the cluster will enter a degraded state, but will be able to continue processing with reduced parallelism.

Single-node clusters only contain a driver, and all data processing happens directly on the driver. This can be sufficient for moderate to large data volumes, but more intensive workloads will require a multi-node cluster with worker nodes.

Once a cluster has been defined, it can started and stopped using the Start Cluster and Stop Cluster commands and used using the Use command. These commands can be executed as part of a job in order to spin up additional processing capacity specifically for an intensive data processing pipeline, for example.

Cluster size and count is limited based on your Magpie subscription agreement. If you attempt to create or start a cluster outside of these limits, Magpie will return an error message. Contact Silectis support to increase your cluster management limits.

Creating

The following is an example of creating a simple single-node cluster on AWS using the Create command.

create cluster {
  "name": "main",
  "driverType": "m5.4xlarge"
};

Here, we create a cluster called main that consists of a single AWS m5.4xlarge node.

Understanding Instance Types

Different instance types are available based on the cloud provider that your Magpie instance operates configured on. To see which instance types are available, run the List Instance Types command.

list instance types;

To see the details of a particular instance type, use the Describe command. The instance type name will likely need to be quoted using backticks due to the presence of special characters in the name.

describe instance type `m5.4xlarge`;

Altering

If we need additional processing capacity, we could alter our cluster to add worker nodes using the Alter command.

alter cluster main {
  "name": "main",
  "driverType": "m5.2xlarge",
  "workerCount": 4,
  "workerType": "m5.4xlarge"
};

Here, we alter the main cluster to reduce the size of the driver node to m5.2xlarge and add 4 m5.4xlarge worker nodes. Note that clusters can only be altered when they are in the Stopped state. Additionally, properties like storage volume size can be altered for a cluster. To see all of the available properties, visit the Cluster JSON specification documentation.

Installing Custom Libraries

Magpie clusters support an optional Bootstrap Script that can be used to install Python, Scala, or R libraries or do other node-level customization. Both the Create and Alter commands support a with bootstrap script option that can be used to perform this customization. For example, to install a Python library:

create cluster main {
  "name": "main",
  "driverType": "m5.4xlarge"
} with bootstrap script """
pip install simple-salesforce
""";

This bootstrap script will install the simple-salesforce Python package for interacting with the Salesforce API.

To install a Scala or Java library, copy the jar into the /usr/lib/magpie/jars folder:

create cluster main {
  "name": "main",
  "driverType": "m5.4xlarge"
} with bootstrap script """
wget -O /usr/lib/magpie/jars/spark-stringmetric_2.11-0.3.0.jar \
  https://repo1.maven.org/maven2/com/github/mrpowers/spark-stringmetric_2.11/0.3.0/spark-stringmetric_2.11-0.3.0.jar
""";

This bootstrap script will add the spark-stringmetric library to the cluster.

To install an R package, use Rscript and install.packages:

create cluster main {
  "name": "main",
  "driverType": "m5.4xlarge"
} with bootstrap script """
Rscript -e "install.packages('tidyquant', repos = 'https://cloud.r-project.org'); library(tidyquant)"
""";

This bootstrap script installs the tidyquant R package and verifies the install by attempt to load the package after installation.

Default Clusters

Magpie supports setting a cluster as the default cluster for a particular user or the entire organization. When a user logs in, they will automatically use their default cluster, if it is started. A user-level default cluster setting will override an organization-level default cluster. For example, to set the cluster main as the default for the Silectis organization:

alter organization silectis set default cluster main
Was this article helpful?
0 out of 0 found this helpful