Data Source

January 30, 2020 19:43

Distributed File System Source (e.g., S3, Google Cloud Storage, Azure Storage)

Specification

{
  "name": "<string>",
  "description": "<string>",
  "sourceType": "hdfs",
  "hdfsType": "<string>",
  "host": "<string>",
  "port": <integer>,
  "pathPrefix": "<string>",
  "defaultPath": "<string>",
  "credentials": <hdfs credentials spec>
}

Structure Values

Field Name	Type	Description	Required	Default
name	String	Data source name. Must be unique within a repository.	Required
description	String	Expanded, user-generated description of the data source.	Optional
sourceType	String	The type of data source. `hdfs` for file system sources.	Required
hdfsType	String	The type of Hadoop Distributed Filesystem. Supported values: `hdfs` - Hadoop Cluster `s3a` - AWS S3 `gs` - Google Cloud Storage `abfss` - Azure Data Lake Storage Gen2 `wasbs` - Azure Blob Storage	Required
host	String	On HDFS, the host name of the namenode; In cloud storage file systems, this is the bucket. On Azure, hosts take the following formats: Data Lake Storage Gen2: <container>@<account>.dfs.core.windows.net Blob Storage: <container>@<account>.blob.core.windows.net	Required
port	Integer	On HDFS, the port of the namenode; N/A for cloud storage file systems.	Optional
pathPrefix	String	Restricts the data source to a specified folder/path within the file system.	Optional	/
defaultPath	String	The default path to use when saving data to this data source.	Optional	/
credentials	HDFS Credentials	Optional set of credentials to use to access the filesystem. By default, the default credentials of the cluster are used.	Optional

Example

Amazon S3

{
  "name": "my_s3_source",
  "sourceType": "hdfs"
  "hdfsType": "S3a",
  "host": "my-magpie-bucket",
  "pathPrefix": "/my_data",
  "defaultPath": "/default_files",
}

JDBC Source (e.g., most relational databases)

Specification

{
  "name": "<string>",
  "description": "<string>",
  "sourceType": "jdbc",
  "url": "<string>",
  "user": "<string>",
  "password": "<string>",
  "defaultSchema": "<string>",
  "options": <map<string,string>>
}

Structure Values

Field Name	Type	Description	Required
name	String	Data source name. Must be unique within a repository	Required
description	String	Expanded, user-generated description of the data source	Optional
sourceType	String	The type of data source. `jdbc` for JDBC sources	Required
url	String	The connection string URL, tailored for your desired DBMS. JDBC URLs must start with `jdbc:<subprotocol>:`	Required
user	String	Username that Magpie uses to connect to the data source	Required
password	String	Password associated with the username on the data source	Optional
options	Object	A set of string key-value pair options to provide to the JDBC driver	Optional
defaultSchema	String	The default schema from which to execute queries or search for tables.	Optional

Example

JDBC SQL Server

{
  "name": "my_db_source",
  "sourceType": "jdbc",
  "url": "jdbc:sqlserver://mydb.mycompany.com:1433/analytics",
  "user": "my_user_name",
  "option": {"trustServerCertificate": "true"}
  "password": "**********"
}

Due to breaking changes within the ms-sql jdbc driver, the option trustServerCertificate must be set to true

PostgreSQL Source

Specification

{
  "name": "<string>",
  "description": "<string>"
  "sourceType": "postgres",
  "host": "<string>",
  "port": <integer>,
  "user": "<string>",
  "password": "<string>",
  "defaultSchema": "<string>",
  "options": <map<string,string>>
}

Structure Values

Field Name	Type	Description	Required
name	String	Data source name. Must be unique within a repository	Required
description	String	Expanded, user-generated description of the data source	Optional
sourceType	String	The type of data source. `postgres` for PostgreSQL sources	Required
host	String	The hostname of the PostgreSQL data source	Required
port	Integer	The port number for connecting to the PostgreSQL data source.	Optional
database	String	The database to connect to within the PostgreSQL data source.	Optional
user	String	Username that Magpie uses to connect to the data source	Required
password	String	Password associated with the username on the data source	Optional
options	Object	A set of string key-value pair options to provide to the PostgreSQL driver	Optional
defaultSchema	String	The default schema from which to execute queries or search for tables.	Optional

Example

PostgreSQL Source

{
  "name": "my_postgres_source",
  "sourceType": "postgres",
  "host": "mydb.mycompany.com",
  "port": 5432,
  "database": "analytics",
  "user": "my_user_name",
  "password": "**********"
}

MySQL Source

Specification

{
  "name": "<string>",
  "description": "<string>"
  "sourceType": "mysql",
  "host": "<string>",
  "port": <integer>,
  "user": "<string>",
  "password": "<string>",
  "options": <map<string,string>>
}

Structure Values

Field Name	Type	Description	Required
name	String	Data source name. Must be unique within a repository	Required
description	String	Expanded, user-generated description of the data source	Optional
sourceType	String	The type of data source. `mysql` for MySQL sources	Required
host	String	The hostname of the MySQL data source	Required
port	Integer	The port number for connecting to the MySQL data source.	Optional
database	String	The default database to connect to within the MySQL data source.	Optional
user	String	Username that Magpie uses to connect to the data source	Required
password	String	Password associated with the username on the data source	Optional
options	Object	A set of string key-value pair options to provide to the MySQL driver	Optional

Example

MySQL Source

{
  "name": "my_mysql_source",
  "sourceType": "mysql",
  "host": "mydb.mycompany.com",
  "port": 3306,
  "database": "analytics",
  "user": "my_user_name",
  "password": "**********"
}

Amazon Redshift Source

Specification

{
  "name": "<string>",
  "description": "<string>"
  "sourceType": "redshift",
  "endpoint": "<string>",
  "user": "<string>",
  "password": "<string>",
  "database": "<string>",
  "options": <map<string,string>>,
  "defaultSchema": "<string>",
  "iamRole": "<String>",
  "defaultStagingSourceName": "<string>"
}

Structure Values

Field Name	Type	Description	Required
name	String	Data source name. Must be unique within a repository.	Required
description	String	Expanded, user-generated description of the data source.	Optional
sourceType	String	The type of data source. `redshift` for Redshift data sources	Required
endpoint	String	The endpoint of the Amazon Redshift cluster. Redshift sources use the host name with port for the Redshift cluster as the endpoint.	Required
user	String	Username that Magpie uses to connect to the cluster	Required
password	String	Password associated with the username on the cluster	Optional
database	String	The database that you created for your cluster	Required
options	Object	A set of JDBC options to provide to the Redshift driver	Optional
defaultSchema	String	The default schema from which to execute queries and search for tables. By default, a database in Redshift has a single schema, with the name PUBLIC.	Optional
iamRole	String	A reference to an AWS Identity and Access Management (IAM) role that your cluster will use when performing COPY and UNLOAD operations for `RedshiftBulk` tables mapped to this data source. See AWS documentation on creating an IAM role for Redshift.	Optional
defaultStagingSourceName	String	The default data source to use as a staging area for `RedshiftBulk` tables mapped to this data source. If specified, must be an existing S3 filesystem data source.	Optional

Example

Amazon Redshift

{
  "name": "my_redshift_source",
  "sourceType": "redshift",
  "endpoint": "magpie-test.fdas53lkj234y.us-east-1.redshift.amazonaws.com:5439",
  "database": "analytics",
  "user": "my_user_name",
  "password": "**********"
}

Snowflake Source

Specification

{
  "name": "<string>",
  "description": "<string>",
  "sourceType": "snowflake",
  "hostname": "<string>",
  "loginParadigm": "<string>",
  "user": "<string>",
  "password": "<string>",
  "database": "<string>",
  "defaultSchema": "<string>",
  "warehouse": "<string>",
  "role": "<string>",
  "options": <map<string,string>>,
  "isCaseSensitive": <boolean>
}

Structure Values

Field Name	Type	Description	Required	Default
name	String	Data source name. Must be unique within a repository.	Required
description	String	Expanded, user-generated description of the data source.	Optional
sourceType	String	The type of data source. `snowflake` for Snowflake data sources	Required
hostname	String	The hostname for your Snowflake account in the format `<account_name>.snowflakecomputing.com`, which may contain additional segments that identify the region and cloud provider for your Snowflake account	Required
loginParadigm	String	Select the login paradigm to use for the connection: password or privateKey	Required
user	String	Username that Magpie uses to connect to Snowflake	Required
password	String	Password associated with the username used to connect to Snowflake if loginParadigm set to password	Optional
privateKey	String	Private key associated with the username used to connect to Snowflake if loginParadigm set to privateKey. Since, json specifications do not allow multiline strings this field should not be set during the creation or alteration of a data source when passing the json object, but by using the `alter data source <data_source> privateKey` command.	Optional
database	String	The Snowflake database that contains the underlying data to be accessed by Magpie	Required
defaultSchema	String	The Snowflake schema containing the underlying data to be accessed by Magpie. By default, the user’s default schema in Snowflake will be used if one exists. If not, no schema will be set, and all table mappings in Magpie must contain a schema name.	Optional
warehouse	String	The default virtual warehouse to use when a query requires compute at Snowflake, such as when a complex query is able to take advantage of query pushdown (enabled by default). By default, the default warehouse assigned to the user in Snowflake will be used.	Optional
role	String	The default security role for magpie to assume when connecting to Snowflake. By default, the default role set on the user account in Snowflake will be used.	Optional
options	Object	A key-value map of additional options that can be passed to the Snowflake Spark Connector.	Optional
isCaseSensitive	Boolean	A boolean flag indicating if schema and table names in Snowflake are case-sensitive.	Optional	false

Example

Snowflake

{
  "name": "my_snowflake_source",
  "sourceType": "snowflake",
  "hostname": "xy12345.us-east-1.snowflakecomputing.com",
  "loginParadigm": "password",
  "user": "my_username",
  "password": "********",
  "database": "analytics",
  "defaultSchema": "public"
}

{
  "name": "my_snowflake_source",
  "sourceType": "snowflake",
  "hostname": "xy12345.us-east-1.snowflakecomputing.com",
  "loginParadigm": "privateKey",
  "user": "my_username",
  "database": "analytics",
  "defaultSchema": "public"
}

Google BigQuery Source

Specification

{
  "name": "<string>",
  "description": "<string>",
  "sourceType": "bigquery",
  "projectId": "<string>",
  "billedProjectId": "<string>",
  "defaultDatasetId": "<string>",
  "defaultStagingSourceName": "<string>",
  "credentials": "<string>",
  "options": "<map<string,string>>"
}

Structure Values

Field Name	Type	Description	Required	Default
name	String	Data source name. Must be unique within a repository.	Required
description	String	Expanded, user-generated description of the data source.	Optional
sourceType	String	The type of data source. `bigquery` for BigQuery data sources.	Required
projectId	String	The Google Cloud Project ID of the project that contains the BigQuery datasets to access.	Optional	The project of the service account used by this data source
billedProjectId	String	The Google Cloud Project ID of the project to be billed for access charges.	Optional	`projectId`
defaultDatasetId	String	The ID of the default dataset from which to execute queries and search for tables.	Optional
defaultStagingSourceName	String	The default data source to use as a staging area for BigQuery tables mapped to this data source. If specified, must be an existing Google Cloud Storage filesystem data source.	Optional
credentials	String	A base64-encoded string of service account JSON credentials to use for this data source.	Optional	The Magpie cluster service account credentials
options	Map	A key-value pair of extra options that can be applied to the BigQuery spark connector on reading and writing data to BigQuery	Optional

Example

Google BigQuery

{
  "name": "my_bigquery_source",
  "sourceType": "bigquery",
  "projectId": "sample-bq-project",
  "billedProjectId": "sample-bq-project",
  "options": {
    "materializationDataset": "mat-ds"
  }
}

Kafka Source

Specification

{
  "name": "<string>",
  "sourceType": "Kafka",
  "bootstrapServers": [{
    "host": "<string>",
    "port": <integer>
  }]
}

Structure Values

Field Name	Type	Description	Required
name	String	Data source name. Must be unique within a repository.	Required
description	String	Expanded, user-generated description of the data source.	Optional
sourceType	String	The type of data source. `kafka` for Kafka data sources	Required
bootstrapServers	Array	An array containing the hosts and ports of the bootstrap servers for your Kafka cluster	Required

Example

Kafka

{
  "name": "kafka_source",
  "sourceType": "Kafka",
  "bootstrapServers": [{
    "host": "my.kafka.host.com",
    "port": 9092
  }]
}

Generic Spark Source

Specification

{
  "name": "<string>",
  "sourceType": "<string>",
  "options": <map<string, string>>,
  "usesSchema" <boolean>,
  "supportsTables": <boolean>,
  "supportsStreams": <boolean>
}

Structure Values

Field Name	Type	Description	Required	Default
name	String	Data source name. Must be unique within a repository.	Required
description	String	Expanded, user-generated description of the data source.	Optional
sourceType	String	The `format` of the underlying Spark connector.	Required
options	Object	An object of string key-value pairs to supply to Spark when using this connector.	Optional
usesSchema	Boolean	Whether this data source uses the schema of the table when reading from the underlying data source. If the data source always supplies the schema, this should be set to `false`.	Required	`true`
supportsTables	Boolean	Whether this data source supports batch reads which feed tables in Magpie. A data source must support either tables, streams, or both.	Required	`false`
supportsStreams	Boolean	Whether this data source supports streaming reads which feed streams in Magpie. A data source must support either tables, streams, or both.	Required	`false`

Example

ElasticSearch

{
  "name": "elastic_test",
  "sourceType": "org.elasticsearch.spark.sql",
  "supportsTables": true,
  "options": {
    "es.nodes" : "vpc-magpie-test-afdslk5l23klj2323.us-east-1.es.amazonaws.com:80",
    "es.nodes.wan.only" : "true"
  }
}