Data Source

Distributed File System Source (e.g., S3, Google Cloud Storage, Azure Storage)

Specification

{
  "name": "<string>",
  "description": "<string>",
  "sourceType": "hdfs",
  "hdfsType": "<string>",
  "host": "<string>",
  "port": <integer>,
  "pathPrefix": "<string>",
  "defaultPath": "<string>",
  "credentials": <hdfs credentials spec>
}

Structure Values

Field Name

Type

Description

Required

Default

name

String

Data source name. Must be unique within a repository.

Required


description

String

Expanded, user-generated description of the data source.

Optional


sourceType

String

The type of data source. hdfs for file system sources.

Required


hdfsType

String

The type of Hadoop Distributed Filesystem. Supported values:

  • hdfs - Hadoop Cluster

  • s3a - AWS S3

  • gs - Google Cloud Storage

  • abfss - Azure Data Lake Storage Gen2

  • adl - Azure Data Lake Storage Gen1

  • wasbs - Azure Blob Storage

Required


host

String

On HDFS, the host name of the namenode; In cloud storage file systems, this is the bucket.

On Azure, hosts take the following formats:

  • Data Lake Storage Gen2: <container>@<account>.dfs.core.windows.net

  • Data Lake Storage Gen1: <store>.azuredatalakestore.net

  • Blob Storage: <container>@<account>.blob.core.windows.net

Required


port

Integer

On HDFS, the port of the namenode; N/A for cloud storage file systems.

Optional


pathPrefix

String

Restricts the data source to a specified folder/path within the file system.

Optional

/

defaultPath

String

The default path to use when saving data to this data source.

Optional

/

credentials

HDFS Credentials

Optional set of credentials to use to access the filesystem. By default, the default credentials of the cluster are used.

Optional

Example

Amazon S3

{
  "name": "my_s3_source",
  "sourceType": "hdfs"
  "hdfsType": "S3a",
  "host": "my-magpie-bucket",
  "pathPrefix": "/my_data",
  "defaultPath": "/default_files",
}



JDBC Source (e.g., most relational databases)

Specification

{
  "name": "<string>",
  "description": "<string>"
  "sourceType": "jdbc",
  "url": "<string>",
  "user": "<string>",
  "password": "<string>",
  "defaultSchema": "<string>",
  "options": <map<string,string>>
}

Structure Values

Field Name

Type

Description

Required

Default

name

String

Data source name. Must be unique within a repository

Required


description

String

Expanded, user-generated description of the data source

Optional


sourceType

String

The type of data source. jdbc for JDBC sources

Required


url

String

The connection string URL, tailored for your desired DBMS. JDBC URLs must start with jdbc:<subprotocol>:

Required


user

String

Username that Magpie uses to connect to the data source

Required


password

String

Password associated with the username on the data source

Optional


options

Object

A set of string key-value pair options to provide to the JDBC driver

Optional


defaultSchema

String

The default schema from which to execute queries or search for tables.

Optional


Example

JDBC PostgreSQL

{
  "name": "my_db_source",
  "sourceType": "jdbc",
  "url": "jdbc:sqlserver://mydb.mycompany.com:1433/analytics",
  "user": "my_user_name",
  "password": "**********"
}



PostgreSQL Source

Specification

{
  "name": "<string>",
  "description": "<string>"
  "sourceType": "postgres",
  "host": "<string>",
  "port": <integer>,
  "user": "<string>",
  "password": "<string>",
  "defaultSchema": "<string>",
  "options": <map<string,string>>
}

Structure Values

Field Name

Type

Description

Required

Default

name

String

Data source name. Must be unique within a repository

Required


description

String

Expanded, user-generated description of the data source

Optional


sourceType

String

The type of data source. postgres for PostgreSQL sources

Required


host

String

The hostname of the PostgreSQL data source

Required


port

Integer

The port number for connecting to the PostgreSQL data source.

Optional


database

String

The database to connect to within the PostgreSQL data source.

Optional


user

String

Username that Magpie uses to connect to the data source

Required


password

String

Password associated with the username on the data source

Optional


options

Object

A set of string key-value pair options to provide to the PostgreSQL driver

Optional


defaultSchema

String

The default schema from which to execute queries or search for tables.

Optional


Example

PostgreSQL Source

{
  "name": "my_postgres_source",
  "sourceType": "postgres",
  "host": "mydb.mycompany.com",
  "port": 5432,
  "database": "analytics",
  "user": "my_user_name",
  "password": "**********"
}

Amazon Redshift Source

Specification

{
  "name": "<string>",
  "description": "<string>"
  "sourceType": "redshift",
  "endpoint": "<string>",
  "user": "<string>",
  "password": "<string>",
  "database": "<string>",
  "options": <map<string,string>>,
  "defaultSchema": "<string>",
  "iamRole": "<String>",
  "defaultStagingSourceName": "<string>"
}

Structure Values

Field Name

Type

Description

Required

Default

name

String

Data source name. Must be unique within a repository.

Required


description

String

Expanded, user-generated description of the data source.

Optional


sourceType

String

The type of data source. redshift for Redshift data sources

Required


endpoint

String

The endpoint of the Amazon Redshift cluster. Redshift sources use the host name with port for the Redshift cluster as the endpoint.

Required


user

String

Username that Magpie uses to connect to the cluster

Required


password

String

Password associated with the username on the cluster

Optional


database

String

The database that you created for your cluster

Required


options

Object

A set of JDBC options to provide to the Redshift driver

Optional


defaultSchema

String

The default schema from which to execute queries and search for tables. By default, a database in Redshift has a single schema, with the name PUBLIC.

Optional


iamRole

String

A reference to an AWS Identity and Access Management (IAM) role that your cluster will use when performing COPY and UNLOAD operations for RedshiftBulk tables mapped to this data source. See AWS documentation on creating an IAM role for Redshift.

Optional


defaultStagingSourceName

String

The default data source to use as a staging area for RedshiftBulk tables mapped to this data source. If specified, must be an existing S3 filesystem data source.

Optional


Example

Amazon Redshift

{
  "name": "my_redshift_source",
  "sourceType": "redshift",
  "endpoint": "magpie-test.fdas53lkj234y.us-east-1.redshift.amazonaws.com:5439",
  "database": "analytics",
  "user": "my_user_name",
  "password": "**********"
}

Snowflake Source

Specification

{
  "name": "<string>",
  "description": "<string>",
  "sourceType": "snowflake",
  "hostname": "<string>",
  "user": "<string>",
  "password": "<string>",
  "database": "<string>",
  "defaultSchema": "<string>",
  "warehouse": "<string>",
  "role": "<string>",
  "options": <map<string,string>>,
  "isCaseSensitive": <boolean>
}

Structure Values

Field Name

Type

Description

Required

Default

name

String

Data source name. Must be unique within a repository.

Required

description

String

Expanded, user-generated description of the data source.

Optional

sourceType

String

The type of data source. snowflake for Snowflake data sources

Required

hostname

String

The hostname for your Snowflake account in the format <account_name>.snowflakecomputing.com, which may contain additional segments that identify the region and cloud provider for your Snowflake account

Required

user

String

Username that Magpie uses to connect to Snowflake

Required

password

String

Password associated with the username used to connect to Snowflake

Optional

database

String

The Snowflake database that contains the underlying data to be accessed by Magpie

Required

defaultSchema

String

The Snowflake schema containing the underlying data to be accessed by Magpie. By default, the user’s default schema in Snowflake will be used if one exists. If not, no schema will be set, and all table mappings in Magpie must contain a schema name.

Optional

warehouse

String

The default virtual warehouse to use when a query requires compute at Snowflake, such as when a complex query is able to take advantage of query pushdown (enabled by default). By default, the default warehouse assigned to the user in Snowflake will be used.

Optional

role

String

The default security role for magpie to assume when connecting to Snowflake. By default, the default role set on the user account in Snowflake will be used.

Optional

options

Object

A key-value map of additional options that can be passed to the Snowflake Spark Connector.

Optional

isCaseSensitive

Boolean

A boolean flag indicating if schema and table names in Snowflake are case-sensitive.

Optional

false

Example

Snowflake

{
  "name": "my_snowflake_source",
  "sourceType": "snowflake",
  "hostname": "xy12345.us-east-1.snowflakecomputing.com",
  "user": "my_username",
  "password": "********",
  "database": "analytics",
  "defaultSchema": "public"
}

Google BigQuery Source

Specification

{
  "name": "<string>",
  "description": "<string>",
  "sourceType": "bigquery",
  "projectId": "<string>",
  "billedProjectId": "<string>",
  "defaultDatasetId": "<string>",
  "credentials": "<string>"
}

Structure Values

Field Name

Type

Description

Required

Default

name

String

Data source name. Must be unique within a repository.

Required


description

String

Expanded, user-generated description of the data source.

Optional


sourceType

String

The type of data source. bigquery for BigQuery data sources.

Required


projectId

String

The Google Cloud Project ID of the project that contains the BigQuery datasets to access.

Optional

The project of the service account used by this data source

billedProjectId

String

The Google Cloud Project ID of the project to be billed for access charges.

Optional

projectId

defaultDatasetId

String

The ID of the default dataset from which to execute queries and search for tables.

Optional


credentials

String

A base64-encoded string of service account JSON credentials to use for this data source.

Optional

The Magpie cluster service account credentials

Example

Google BigQuery

{
  "name": "my_bigquery_source",
  "sourceType": "bigquery",
  "projectId": "sample-bq-project",
  "billedProjectId": "sample-bq-project"
}



Kafka Source

Specification

{
  "name": "<string>",
  "sourceType": "Kafka",
  "bootstrapServers": [{
    "host": "<string>",
    "port": <integer>
  }]
}

Structure Values

Field Name

Type

Description

Required

Default

name

String

Data source name. Must be unique within a repository.

Required


description

String

Expanded, user-generated description of the data source.

Optional


sourceType

String

The type of data source. kafka for Kafka data sources

Required


bootstrapServers

Array

An array containing the hosts and ports of the bootstrap servers for your Kafka cluster

Required


Example

Kafka

{
  "name": "kafka_source",
  "sourceType": "Kafka",
  "bootstrapServers": [{
    "host": "my.kafka.host.com",
    "port": 9092
  }]
}



Generic Spark Source

Specification

{
  "name": "<string>",
  "sourceType": "<string>",
  "options": <map<string, string>>,
  "usesSchema" <boolean>,
  "supportsTables": <boolean>,
  "supportsStreams": <boolean>
}

Structure Values

Field Name

Type

Description

Required

Default

name

String

Data source name. Must be unique within a repository.

Required


description

String

Expanded, user-generated description of the data source.

Optional


sourceType

String

The format of the underlying Spark connector.

Required


options

Object

An object of string key-value pairs to supply to Spark when using this connector.

Optional


usesSchema

Boolean

Whether this data source uses the schema of the table when reading from the underlying data source. If the data source always supplies the schema, this should be set to false.

Required

true

supportsTables

Boolean

Whether this data source supports batch reads which feed tables in Magpie. A data source must support either tables, streams, or both.

Required

false

supportsStreams

Boolean

Whether this data source supports streaming reads which feed streams in Magpie. A data source must support either tables, streams, or both. 

Required

false

Example

ElasticSearch

{
  "name": "elastic_test",
  "sourceType": "org.elasticsearch.spark.sql",
  "supportsTables": true,
  "options": {
    "es.nodes" : "vpc-magpie-test-afdslk5l23klj2323.us-east-1.es.amazonaws.com:80",
    "es.nodes.wan.only" : "true"
  }
}










Was this article helpful?
0 out of 0 found this helpful