Distributed File System Source (e.g., S3, Google Cloud Storage, Azure Storage)
Specification
{ "name": "<string>", "description": "<string>", "sourceType": "hdfs", "hdfsType": "<string>", "host": "<string>", "port": <integer>, "pathPrefix": "<string>", "defaultPath": "<string>", "credentials": <hdfs credentials spec> }
Structure Values
Field Name |
Type |
Description |
Required |
Default |
---|---|---|---|---|
name |
String |
Data source name. Must be unique within a repository. |
Required |
|
description |
String |
Expanded, user-generated description of the data source. |
Optional |
|
sourceType |
String |
The type of data source. |
Required |
|
hdfsType |
String |
The type of Hadoop Distributed Filesystem. Supported values:
|
Required |
|
host |
String |
On HDFS, the host name of the namenode; In cloud storage file systems, this is the bucket. On Azure, hosts take the following formats:
|
Required |
|
port |
Integer |
On HDFS, the port of the namenode; N/A for cloud storage file systems. |
Optional |
|
pathPrefix |
String |
Restricts the data source to a specified folder/path within the file system. |
Optional |
/ |
defaultPath |
String |
The default path to use when saving data to this data source. |
Optional |
/ |
credentials |
Optional set of credentials to use to access the filesystem. By default, the default credentials of the cluster are used. |
Optional |
Example
Amazon S3
{ "name": "my_s3_source", "sourceType": "hdfs" "hdfsType": "S3a", "host": "my-magpie-bucket", "pathPrefix": "/my_data", "defaultPath": "/default_files", }
JDBC Source (e.g., most relational databases)
Specification
{ "name": "<string>", "description": "<string>", "sourceType": "jdbc", "url": "<string>", "user": "<string>", "password": "<string>", "defaultSchema": "<string>", "options": <map<string,string>> }
Structure Values
Field Name |
Type |
Description |
Required |
Default |
---|---|---|---|---|
name |
String |
Data source name. Must be unique within a repository |
Required |
|
description |
String |
Expanded, user-generated description of the data source |
Optional |
|
sourceType |
String |
The type of data source. |
Required |
|
url |
String |
The connection string URL, tailored for your desired DBMS. JDBC URLs must start with |
Required |
|
user |
String |
Username that Magpie uses to connect to the data source |
Required |
|
password |
String |
Password associated with the username on the data source |
Optional |
|
options |
Object |
A set of string key-value pair options to provide to the JDBC driver |
Optional |
|
defaultSchema |
String |
The default schema from which to execute queries or search for tables. |
Optional |
Example
JDBC SQL Server
{ "name": "my_db_source", "sourceType": "jdbc", "url": "jdbc:sqlserver://mydb.mycompany.com:1433/analytics", "user": "my_user_name", "option": {"trustServerCertificate": "true"} "password": "**********" }
Due to breaking changes within the ms-sql jdbc driver, the option trustServerCertificate
must be set to true
PostgreSQL Source
Specification
{ "name": "<string>", "description": "<string>" "sourceType": "postgres", "host": "<string>", "port": <integer>, "user": "<string>", "password": "<string>", "defaultSchema": "<string>", "options": <map<string,string>> }
Structure Values
Field Name |
Type |
Description |
Required |
Default |
---|---|---|---|---|
name |
String |
Data source name. Must be unique within a repository |
Required |
|
description |
String |
Expanded, user-generated description of the data source |
Optional |
|
sourceType |
String |
The type of data source. |
Required |
|
host |
String |
The hostname of the PostgreSQL data source |
Required |
|
port |
Integer |
The port number for connecting to the PostgreSQL data source. |
Optional |
|
database |
String |
The database to connect to within the PostgreSQL data source. |
Optional |
|
user |
String |
Username that Magpie uses to connect to the data source |
Required |
|
password |
String |
Password associated with the username on the data source |
Optional |
|
options |
Object |
A set of string key-value pair options to provide to the PostgreSQL driver |
Optional |
|
defaultSchema |
String |
The default schema from which to execute queries or search for tables. |
Optional |
Example
PostgreSQL Source
{ "name": "my_postgres_source", "sourceType": "postgres", "host": "mydb.mycompany.com", "port": 5432, "database": "analytics", "user": "my_user_name", "password": "**********" }
MySQL Source
Specification
{ "name": "<string>", "description": "<string>" "sourceType": "mysql", "host": "<string>", "port": <integer>, "user": "<string>", "password": "<string>", "options": <map<string,string>> }
Structure Values
Field Name |
Type |
Description |
Required |
Default |
---|---|---|---|---|
name |
String |
Data source name. Must be unique within a repository |
Required |
|
description |
String |
Expanded, user-generated description of the data source |
Optional |
|
sourceType |
String |
The type of data source. |
Required |
|
host |
String |
The hostname of the MySQL data source |
Required |
|
port |
Integer |
The port number for connecting to the MySQL data source. |
Optional |
|
database |
String |
The default database to connect to within the MySQL data source. |
Optional |
|
user |
String |
Username that Magpie uses to connect to the data source |
Required |
|
password |
String |
Password associated with the username on the data source |
Optional |
|
options |
Object |
A set of string key-value pair options to provide to the MySQL driver |
Optional |
Example
MySQL Source
{ "name": "my_mysql_source", "sourceType": "mysql", "host": "mydb.mycompany.com", "port": 3306, "database": "analytics", "user": "my_user_name", "password": "**********" }
Amazon Redshift Source
Specification
{ "name": "<string>", "description": "<string>" "sourceType": "redshift", "endpoint": "<string>", "user": "<string>", "password": "<string>", "database": "<string>", "options": <map<string,string>>, "defaultSchema": "<string>", "iamRole": "<String>", "defaultStagingSourceName": "<string>" }
Structure Values
Field Name |
Type |
Description |
Required |
Default |
---|---|---|---|---|
name |
String |
Data source name. Must be unique within a repository. |
Required |
|
description |
String |
Expanded, user-generated description of the data source. |
Optional |
|
sourceType |
String |
The type of data source. |
Required |
|
endpoint |
String |
The endpoint of the Amazon Redshift cluster. Redshift sources use the host name with port for the Redshift cluster as the endpoint. |
Required |
|
user |
String |
Username that Magpie uses to connect to the cluster |
Required |
|
password |
String |
Password associated with the username on the cluster |
Optional |
|
database |
String |
The database that you created for your cluster |
Required |
|
options |
Object |
A set of JDBC options to provide to the Redshift driver |
Optional |
|
defaultSchema |
String |
The default schema from which to execute queries and search for tables. By default, a database in Redshift has a single schema, with the name PUBLIC. |
Optional |
|
iamRole |
String |
A reference to an AWS Identity and Access Management (IAM) role that your cluster will use when performing COPY and UNLOAD operations for |
Optional |
|
defaultStagingSourceName |
String |
The default data source to use as a staging area for |
Optional |
Example
Amazon Redshift
{ "name": "my_redshift_source", "sourceType": "redshift", "endpoint": "magpie-test.fdas53lkj234y.us-east-1.redshift.amazonaws.com:5439", "database": "analytics", "user": "my_user_name", "password": "**********" }
Snowflake Source
Specification
{ "name": "<string>", "description": "<string>", "sourceType": "snowflake", "hostname": "<string>", "loginParadigm": "<string>", "user": "<string>", "password": "<string>", "database": "<string>", "defaultSchema": "<string>", "warehouse": "<string>", "role": "<string>", "options": <map<string,string>>, "isCaseSensitive": <boolean> }
Structure Values
Field Name |
Type |
Description |
Required |
Default |
---|---|---|---|---|
name |
String |
Data source name. Must be unique within a repository. |
Required |
|
description |
String |
Expanded, user-generated description of the data source. |
Optional |
|
sourceType |
String |
The type of data source. |
Required |
|
hostname |
String |
The hostname for your Snowflake account in the format |
Required |
|
loginParadigm |
String |
Select the login paradigm to use for the connection: password or privateKey |
Required |
|
user |
String |
Username that Magpie uses to connect to Snowflake |
Required |
|
password |
String |
Password associated with the username used to connect to Snowflake if loginParadigm set to password |
Optional |
|
privateKey |
String |
Private key associated with the username used to connect to Snowflake if loginParadigm set to privateKey. Since, json specifications do not allow multiline strings this field should not be set during the creation or alteration of a data source when passing the json object, but by using the |
Optional |
|
database |
String |
The Snowflake database that contains the underlying data to be accessed by Magpie |
Required |
|
defaultSchema |
String |
The Snowflake schema containing the underlying data to be accessed by Magpie. By default, the user’s default schema in Snowflake will be used if one exists. If not, no schema will be set, and all table mappings in Magpie must contain a schema name. |
Optional |
|
warehouse |
String |
The default virtual warehouse to use when a query requires compute at Snowflake, such as when a complex query is able to take advantage of query pushdown (enabled by default). By default, the default warehouse assigned to the user in Snowflake will be used. |
Optional |
|
role |
String |
The default security role for magpie to assume when connecting to Snowflake. By default, the default role set on the user account in Snowflake will be used. |
Optional |
|
options |
Object |
A key-value map of additional options that can be passed to the Snowflake Spark Connector. |
Optional |
|
isCaseSensitive |
Boolean |
A boolean flag indicating if schema and table names in Snowflake are case-sensitive. |
Optional |
false |
Example
Snowflake
{ "name": "my_snowflake_source", "sourceType": "snowflake", "hostname": "xy12345.us-east-1.snowflakecomputing.com", "loginParadigm": "password", "user": "my_username", "password": "********", "database": "analytics", "defaultSchema": "public" }
{ "name": "my_snowflake_source", "sourceType": "snowflake", "hostname": "xy12345.us-east-1.snowflakecomputing.com", "loginParadigm": "privateKey", "user": "my_username", "database": "analytics", "defaultSchema": "public" }
Google BigQuery Source
Specification
{ "name": "<string>", "description": "<string>", "sourceType": "bigquery", "projectId": "<string>", "billedProjectId": "<string>", "defaultDatasetId": "<string>", "defaultStagingSourceName": "<string>", "credentials": "<string>", "options": "<map<string,string>>" }
Structure Values
Field Name |
Type |
Description |
Required |
Default |
---|---|---|---|---|
name |
String |
Data source name. Must be unique within a repository. |
Required |
|
description |
String |
Expanded, user-generated description of the data source. |
Optional |
|
sourceType |
String |
The type of data source. |
Required |
|
projectId |
String |
The Google Cloud Project ID of the project that contains the BigQuery datasets to access. |
Optional |
The project of the service account used by this data source |
billedProjectId |
String |
The Google Cloud Project ID of the project to be billed for access charges. |
Optional |
|
defaultDatasetId |
String |
The ID of the default dataset from which to execute queries and search for tables. |
Optional |
|
defaultStagingSourceName |
String |
The default data source to use as a staging area for BigQuery tables mapped to this data source. If specified, must be an existing Google Cloud Storage filesystem data source. |
Optional |
|
credentials |
String |
A base64-encoded string of service account JSON credentials to use for this data source. |
Optional |
The Magpie cluster service account credentials |
options |
Map |
A key-value pair of extra options that can be applied to the BigQuery spark connector on reading and writing data to BigQuery |
Optional |
Example
Google BigQuery
{ "name": "my_bigquery_source", "sourceType": "bigquery", "projectId": "sample-bq-project", "billedProjectId": "sample-bq-project", "options": { "materializationDataset": "mat-ds" } }
Kafka Source
Specification
{ "name": "<string>", "sourceType": "Kafka", "bootstrapServers": [{ "host": "<string>", "port": <integer> }] }
Structure Values
Field Name |
Type |
Description |
Required |
Default |
---|---|---|---|---|
name |
String |
Data source name. Must be unique within a repository. |
Required |
|
description |
String |
Expanded, user-generated description of the data source. |
Optional |
|
sourceType |
String |
The type of data source. |
Required |
|
bootstrapServers |
Array |
An array containing the hosts and ports of the bootstrap servers for your Kafka cluster |
Required |
Example
Kafka
{ "name": "kafka_source", "sourceType": "Kafka", "bootstrapServers": [{ "host": "my.kafka.host.com", "port": 9092 }] }
Generic Spark Source
Specification
{ "name": "<string>", "sourceType": "<string>", "options": <map<string, string>>, "usesSchema" <boolean>, "supportsTables": <boolean>, "supportsStreams": <boolean> }
Structure Values
Field Name |
Type |
Description |
Required |
Default |
---|---|---|---|---|
name |
String |
Data source name. Must be unique within a repository. |
Required |
|
description |
String |
Expanded, user-generated description of the data source. |
Optional |
|
sourceType |
String |
The |
Required |
|
options |
Object |
An object of string key-value pairs to supply to Spark when using this connector. |
Optional |
|
usesSchema |
Boolean |
Whether this data source uses the schema of the table when reading from the underlying data source. If the data source always supplies the schema, this should be set to |
Required |
|
supportsTables |
Boolean |
Whether this data source supports batch reads which feed tables in Magpie. A data source must support either tables, streams, or both. |
Required |
|
supportsStreams |
Boolean |
Whether this data source supports streaming reads which feed streams in Magpie. A data source must support either tables, streams, or both. |
Required |
|
Example
ElasticSearch
{ "name": "elastic_test", "sourceType": "org.elasticsearch.spark.sql", "supportsTables": true, "options": { "es.nodes" : "vpc-magpie-test-afdslk5l23klj2323.us-east-1.es.amazonaws.com:80", "es.nodes.wan.only" : "true" } }