Creating and Managing Data Sources

July 26, 2018 18:31

Overview

As described in the overview of Magpie's metadata, all tables within Magpie are backed by an underlying data source. There are several types of data sources:

Relational Database Sources: These are typically relational databases or data warehouses. By creating a database or data warehouse source, you can use Magpie to access the data within the source and either import data into Magpie for further analysis, or execute queries against directly where it exists. In Magpie, we provide multiple connectors for these sources:
- JDBC Sources - relational databases like Microsoft SQL Server or MySQL
- PostgreSQL - connection to PostgreSQL relational databases
- Amazon Redshift - cloud-based data warehouse
- Google BigQuery - cloud-based data warehouse
- Snowflake - cloud-based data warehouse
Distributed File Systems: Magpie supports access to files stored in HDFS (Hadoop Distributed File System), Amazon S3 (Simple Storage Service), Google Cloud Storage, and multiple types of Azure Storage. Files stored within a file system can be accessed as tables within Magpie as long as they have a valid persistence mapping. While several of the technologies listed are object stores, they can behave much like a file system and are treated as one by Magpie.
Apache Kafka Clusters: Apache Kafka is a distributed streaming platform that allows users to tore streams of records in a fault-tolerant durable way. Topics within Kafka clusters can be accessed as tables within Magpie.
Generic: Magpie also supports access to any type of data source supported by the underlying Spark environment using a generic data source type. This provides more flexibility to integrate data from sources that are not fully supported yet, but may be more complex to manage than other Magpie types requiring a more detailed understanding of the underlying configuration of sources of that type. An example of a data source supported as a generic type by Magpie is ElasticSearch.

Creating and Managing Data Sources

At a general level, there are three components to any data source definition.

A Magpie Command

create data source OR
alter data source

Data Source JSON Specification

These vary from source to source. The specification provides the attributes and configuration needed to connect to the data source. Please view the Data Source JSON specification for more details.

Security

To access a data source, you may need to adjust the security configuration of your cloud environment. This can be as simple as generating a service account for Magpie on a database, or require networking configuration on your side.

An example command is shown below:

create data source {
  "name": "my_db_source",
  "sourceType": "jdbc",
  "url": "jdbc:sqlserver://mydb.mycompany.com:1433/analytics",
  "user": "my_user_name",
  "password": "**********"
}

Understanding Data Source-backed Tables

As described in the Magpie Metadata documentation, objects in Magpie are often just pointers to underlying data stores. This has a few implications:

Changes to the data in underlying database table, file system, stream, etc., are reflected immediately in Magpie.
Changes to the the structure of the database, file system, stream, etc., could interfere with the mapping from Magpie to the underlying table.
The performance of queries against data sources will sometimes be slow relative to queries against Magpie's internal data store because data may need to travel over the network.
Even though tables within Magpie reside in separate infrastructure, they may still be queried and joined to other tables in Magpie as if they all reside in the same system.

There are several practices that can help ensure you get good performance from underlying tables:

The ability to query tables in place provides a means to review and understand the data in the source quickly, even allowing for joins with data stored directly within Magpie (to determine whether the new data can be merged). However, to achieve this all of the data in the table needs to be pulled into Magpie.
For very large tables, querying the data in place can be very slow and it may be easier to replicate the table locally prior to analysis.
For ongoing analysis, it is often advisable to copy the table over into Magpie's storage layer. This also acts to minimize the dependencies between the source system and the Magpie data lake. This can be accomplished using the SAVE RESULT or SAVE TABLE AS command after querying.

Guides

We've created several tutorials to guide you through the process of integrating new data sets.

Page	Description
Accessing Data from Distributed File Systems	Guide to integrating data from file systems such as HDFS, S3, Azure Storage, and Google Cloud Storage
Accessing Data from JDBC Sources	Guide to integrating data from JDBC sources, including most relational databases
Accessing Data from PostgreSQL	Guide to integrating data from PostgreSQL sources
Accessing Data from Redshift	Guide to integrating data from Redshift
Accessing Data from BigQuery	Guide to integrating data from BigQuery
Accessing Data from Snowflake	Guide to integrating data from Snowflake
Accessing Data from Generic Spark Data Sources	Guide to integrating data from generic Spark data sources
Loading Data from URL	Guide to integrating data from the web, with example data available for import
Accessing Data from Kafka	Guide to integrating data from Kafka
Exporting Data from Magpie	Guide to exporting data from Magpie