Overview
As described in the overview of Magpie's metadata, all tables within Magpie are backed by an underlying data source. There are several types of data sources:
Relational Database Sources: These are typically relational databases or data warehouses. By creating a database or data warehouse source, you can use Magpie to access the data within the source and either import data into Magpie for further analysis, or execute queries against directly where it exists. In Magpie, we provide multiple connectors for these sources:
JDBC Sources - relational databases like Microsoft SQL Server or MySQL
PostgreSQL - connection to PostgreSQL relational databases
Amazon Redshift - cloud-based data warehouse
Google BigQuery - cloud-based data warehouse
Snowflake - cloud-based data warehouse
Distributed File Systems: Magpie supports access to files stored in HDFS (Hadoop Distributed File System), Amazon S3 (Simple Storage Service), Google Cloud Storage, and multiple types of Azure Storage. Files stored within a file system can be accessed as tables within Magpie as long as they have a valid persistence mapping. While several of the technologies listed are object stores, they can behave much like a file system and are treated as one by Magpie.
Apache Kafka Clusters: Apache Kafka is a distributed streaming platform that allows users to tore streams of records in a fault-tolerant durable way. Topics within Kafka clusters can be accessed as tables within Magpie.
Generic: Magpie also supports access to any type of data source supported by the underlying Spark environment using a generic data source type. This provides more flexibility to integrate data from sources that are not fully supported yet, but may be more complex to manage than other Magpie types requiring a more detailed understanding of the underlying configuration of sources of that type. An example of a data source supported as a generic type by Magpie is ElasticSearch.
Creating and Managing Data Sources
At a general level, there are three components to any data source definition.
A Magpie Command
create data source
ORalter data source
Data Source JSON Specification
These vary from source to source. The specification provides the attributes and configuration needed to connect to the data source. Please view the Data Source JSON specification for more details.
Security
To access a data source, you may need to adjust the security configuration of your cloud environment. This can be as simple as generating a service account for Magpie on a database, or require networking configuration on your side.
An example command is shown below:
create data source { "name": "my_db_source", "sourceType": "jdbc", "url": "jdbc:sqlserver://mydb.mycompany.com:1433/analytics", "user": "my_user_name", "password": "**********" }
Understanding Data Source-backed Tables
As described in the Magpie Metadata documentation, objects in Magpie are often just pointers to underlying data stores. This has a few implications:
Changes to the data in underlying database table, file system, stream, etc., are reflected immediately in Magpie.
Changes to the the structure of the database, file system, stream, etc., could interfere with the mapping from Magpie to the underlying table.
The performance of queries against data sources will sometimes be slow relative to queries against Magpie's internal data store because data may need to travel over the network.
Even though tables within Magpie reside in separate infrastructure, they may still be queried and joined to other tables in Magpie as if they all reside in the same system.
There are several practices that can help ensure you get good performance from underlying tables:
The ability to query tables in place provides a means to review and understand the data in the source quickly, even allowing for joins with data stored directly within Magpie (to determine whether the new data can be merged). However, to achieve this all of the data in the table needs to be pulled into Magpie.
For very large tables, querying the data in place can be very slow and it may be easier to replicate the table locally prior to analysis.
For ongoing analysis, it is often advisable to copy the table over into Magpie's storage layer. This also acts to minimize the dependencies between the source system and the Magpie data lake. This can be accomplished using the SAVE RESULT or SAVE TABLE AS command after querying.
Guides
We've created several tutorials to guide you through the process of integrating new data sets.
Page | Description |
---|---|
Guide to integrating data from file systems such as HDFS, S3, Azure Storage, and Google Cloud Storage | |
Guide to integrating data from JDBC sources, including most relational databases | |
Guide to integrating data from PostgreSQL sources | |
Guide to integrating data from Redshift | |
Guide to integrating data from BigQuery | |
Guide to integrating data from Snowflake | |
Guide to integrating data from generic Spark data sources | |
Guide to integrating data from the web, with example data available for import | |
Guide to integrating data from Kafka | |
Guide to exporting data from Magpie |