Accessing Data from Distributed File Systems

February 04, 2020 16:57

Another common type of data source used within Magpie is a distributed file system. File system sources can be object stores like AWS S3, Azure Storage, or Google Cloud Storage, or can be an actual HDFS cluster. A single file system can be treated as more than one data source depending on the path specified.

File system sources have a number of attributes that may need to specified during their creation, please view the Data Source JSON specification for more details. The following is an example of creating a data source that points to an S3 file system:

create data source {
  "name": "my_s3_source",
  "sourceType": "hdfs"
  "hdfsType": "S3a",
  "host": "my-magpie-bucket",
  "pathPrefix": "/my_data",
  "defaultPath": "/default_files",
}

Data within a file system can be stored in one of several formats:

Delimited/CSV
- Magpie supports delimited file formats. These formats are particular useful when loading new data. For example, a large delimited source file can be pushed to S3 or HDFS, and then a Magpie table can be created directly from it using Magpie's create table from file capability. These files may also be optionally compressed, improving performance and reducing the storage required.
JSON
- Magpie can automatically parse JSON files and generate a table structure. Other than the storage format, this approach operates similarly to delimited formats.
Parquet
- The Parquet format is a binary, compressed column-oriented format that is optimized for analytics. Parquet differs from other, text-based formats because it imposes a schema on the data stored within it and Magpie's representation of the data maps to that schema. By default, new objects created by Magpie are stored in the host cloud provider's default object store (S3, Google Cloud Storage, or Azure Blob Storage), using Parquet. You can find more information about the Parquet format here.
ORC
- ORC is an alternative columnar format for data storage similar to Parquet in its structure and performance characteristics. It is the default format for Apache Hive and maybe readily available in environments that have existing Hive implementations. You can find more detailed information about the ORC format here.
Avro
- Avro is a row-oriented data serialization framework developed by the Apache Software Foundation. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Avro is the default serializer/deserializer for Kafka Streams.

Regardless of the data format, all of the data appears as tables within Magpie. As a Magpie user, you don't necessarily need to worry about the underlying file format when saving data to a table.

Parquet often has the best performance of these various formats because its columnar structure lends itself to analytical queries. Because data is stored as indexed columns, it is efficient to retrieve and aggregate a subset of columns in wide table without having to retrieve each individual row and parse it.

Once a file system data source has been created, files within that data source can be turned into Magpie tables using the create table command. The following is an example of the syntax used to create a table from a CSV file while inferring the columns and data types present and using the header line in the file to generate column names automatically.

create table my_table 
  from data source my_file_system_data_source 
  file "default_files/my_source_file.csv" 
  with format csv 
  with infer schema 
  with header;

It is also possible to manually specify the columns and persistence mapping at the table level.

Security

To access this type of data source, you may need to adjust the security configuration of your cloud environment. Examples are shown below. Please reach out to a member of the Silectis team with any questions or support requests.

Amazon S3

In AWS, you will configure access by adding a bucket policy to your S3 bucket. To do so using the AWS Console:

Navigate to the S3 service in the AWS Console.
Select the bucket to which you want to add the bucket policy.
1. Choose the Permissions tab.
2. Choose the Bucket Policy section.
3. In the Bucket policy editor field, enter the bucket policy.

The following JSON is the bucket policy that will give Magpie read and write access to your S3 bucket(s). Replace <bucket_name> with the bucket of interest and specify the <role_arn> of your Magpie cluster.

Read/Write Bucket Policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "S3BucketAllow",
      "Effect": "Allow",
      "Principal": {
        "AWS": "<role_arn>"
      },
      "Action": [
        "s3:ListBucket",
        "s3:ListBucketMultipartUploads",
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::<bucket_name>"
      ]
    },
    {
      "Sid": "S3ObjectAllow",
      "Effect": "Allow",
      "Principal": {
        "AWS": "<role_arn>"
      },
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:PutObjectAcl",
        "s3:DeleteObject",
        "s3:AbortMultipartUpload",
        "s3:ListMultipartUploadParts"
      ],
      "Resource": [
        "arn:aws:s3:::<bucket_name>/*"
      ]
    }
  ]
}

Alternatively, you can add the following read-only bucket policy.

Read-Only Bucket Policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "S3BucketAllow",
      "Effect": "Allow",
      "Principal": {
        "AWS": "<role_arn>"
      },
      "Action": [
        "s3:ListBucket",
        "s3:ListBucketMultipartUploads",
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::<bucket_name>"
      ]
    },
    {
      "Sid": "S3ObjectAllow",
      "Effect": "Allow",
      "Principal": {
        "AWS": "<role_arn>"
      },
      "Action": [
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::<bucket_name>/*"
      ]
    }
  ]
}

Google Cloud Storage

In Google Cloud Platform, you will need to create a service account associated with Magpie (if one has not been created for you). Access will be provided by adding a member to a bucket-level policy, as described in Using Cloud IAM permisisons.

You will need to grant two roles to the Magpie cluster service account to allow access:

roles/storage.legacyBucketReader
roles/storage.objectAdmin

To do so using the Google Cloud Console:

Navigate to the Cloud Storage browser in the Google Cloud Console.
Select the bucket to which you want to grant a member a role.
1. Choose Edit bucket permissions.
2. In the Add members field, enter the service account associated with the Magpie cluster.
Select the two relevant roles from the Select a role drop-down menu. The roles you select appear in the pane with a short description of the permissions they grant.
1. Click Add.

Azure

Azure offers multiple types of storage accounts and multiple types of storage resources, which must be interacted with in slightly different ways. Read more in Microsoft’s Introduction to Azure Storage.

At present, Magpie supports the most common storage accounts and services. See the Data Source JSON spec for details on configuring an Azure Storage data source.

General-Purpose v2 Storage Accounts: Basic storage account type for blobs, files, queues, and tables. Recommended for most scenarios using Azure Storage.
- Magpie supports accessing blobs in Azure Data Lake Storage Gen2 using the abfss connector.
- The connector forces a secure connection and requires that hierarchical namespace is enabled.
Azure Data Lake Storage Gen1 Service: Magpie no longer supports Azure Data Lake Storage Gen1. Use Azure Data Lake Storage instead.
General-Purpose Storage Accounts: Basic storage account type for blobs. Recommended for legacy accounts or when hierarchical namespace cannot be enabled on an account.
- Magpie supports accessing blobs in general purpose storage accounts over a secure connection using the wasbs connector.

To configure access to data in Azure Data Lake Storage Gen2 (abfss ) storage containers, we recommend granting access to an Azure service principal associated with the Magpie cluster.

Note: we do not recommend using Shared Access Signature (SAS) tokens because they are harder to manage and must be supplied during routine data loading tasks.

The Magpie Azure Application can be given permissions on the storage account by navigating to the following URL https://login.microsoftonline.com/<tenant_id>/oauth2/authorize?client_id=<app_id>&response_type=code
1. Replace tenant_id with the tenant_id associated with the storage account containing data of interest. You can find this identifier by logging into the Azure Portal and navigating to the Azure Active Directory service. The alphanumeric GUID appears in the “Overview” section.
2. Replace app_id with the app_id (a.k.a. client_id) associated with the Magpie instance and provided to you by Silectis.
This page displays the permissions requested. Click Accept. This generates a token that allows the Magpie service principal on Silectis’ infrastructure to access resources on your tenant. However, permissions must be granted to the service principal before access is actually allowed.
Login to the Azure Portal. Navigate to the Storage Accounts service and select the storage account containing relevant data.
1. Click Access Control (IAM), then click Add > Add Role Assignment
2. Select the following role(s) to grant to the Magpie service principal
  1. For Azure Data Lake Gen2 storage type:
    1. Storage Blob Data Contributor - read and write access. Allows loading data from and unloading data to the storage account.
    2. Storage Blob Data Reader - read only access. Allows loading data from the storage account.
3. Note that your Azure user will need appropriate permissions to grant roles. Reach out to an account administrator for assistance.
  1. Microsoft.Authorization/roleAssignments/write and Microsoft.Authorization/roleAssignments/delete
Repeat step 3 for any other storage accounts containing relevant data.
Role assignments may take several minutes to propagate. Once complete, you should be able to execute a create data source command in Magpie to interact with data in Azure Storage.

To configure access to data in legacy blob storage containers (wasbs), we recommend using Shared Access Signature (SAS) tokens, since access to legacy blob storage cannot be granted using Azure Active Directory.

Follow the Azure instructions for creating a SAS token for the container in the storage account containing the relevant data.
Specify the SAS token in the credentials part of the data source spec in the create data source command.