Platform Architecture

June 29, 2018 18:26

Architecture Overview

Understanding the basics of Magpie's underlying architecture can be helpful in getting the most out of the platform. Knowing how the components of Magpie interact makes it easier to understand the commands that you will need to use to manage data within Magpie and how best to integrate your own data sources into the platform.

The figure below provides a high level view of the architecture:

The following are the key components of the architecture shown:

Component	Description
Magpie Notebook	The Magpie Notebook interface is the means by which end users access Magpie. It provides a notebook interface similar to Jupyter or Apache Zeppelin allowing users to create a narrative around their work and keep track of the steps needed to reproduce an analysis or pipeline. A more detailed description of the Magpie Notebook and how to use it can be found here – Using the Magpie Notebook.
Magpie Command Line Interface	We’ve created a command line interface (CLI) that surfaces all of the features of Magpie in a portable tool. The CLI is a powerful medium for deploying pipelines, versioning code, or integrating with external continuous integration or scheduling tools. A more detailed description of the Magpie Command Line Interface and how to use it can be found here – Magpie Command Line Interface
Business Intelligence Tools	Magpie supports a variety of business intelligence and data visualization tools. The platform's flexible data lake architecture allows it to serve as a hub for your enterprise data, and its underlying compute capabilities allow for ad hoc analysis and visualization.
Magpie Management Layer	The core Magpie Engine manages all of the computation associated with the Magpie platform. This includes processing queries, executing jobs, controlling access based on security settings, and managing all of the metadata associated with a particular implementation.
Catalog	This component surfaces Magpie's rich metadata to users to help them discover the data and understand where it resides within the system
Security	Magpie controls access to objects (including tables and jobs) at the role and individual user level. Each request to Magpie first passes through a security filter to ensure that the user has appropriate access to use, read, write, or create an object.
Scheduling	This component manages the execution of pre-defined jobs within Magpie. This includes running jobs on a scheduled basis or interactively, and tracking job execution history.
Pipelines	Magpie provides a streamlined format for creating and managing data pipelines, and orchestrates jobs while managing dependencies.
Monitoring & Alerting	Magpie's robust logging features can be leveraged to monitor data loading jobs, alert upon success/failure, and track user/organization behavior in the lake.
Metadata Repository	The metadata repository stores all of the structural information about the data accessible by Magpie (table names, column names and types, etc.), the job flow data associated with Magpie (job definitions, task predecessors and successors, execution history, and schedules), and the configuration information for data sources (connection end points, credentials, etc.). The specifics of the objects contained in the metadata repository are covered in more detail here – Magpie Metadata.
Magpie Engine: Apache Spark	Magpie's underlying computation is provided by an internal Spark cluster that distributes computation across nodes. This allows Magpie to scale horizontal for demanding computational tasks and allows us to expose all of the Magpie table data within the Spark layer for further processing using all of the capabilities of the Spark platform. Magpie allows the user to provision and manage one or any Apache Spark clusters, depending on organization data volume and processing requirements.
Data Sources	Data storage and computation in Magpie are largely independent of one another. That is, Magpie does not have a proprietary internal data store, as is the case in most relational database systems. Instead, Magpie allows data to be accessed across a range of different repositories while providing a single point of entry for users to query those sources. Consequently, rather than being a materialized objects within Magpie, tables are pointers to files in a distributed file system (like Amazon S3 or HDFS) or tables stored within other databases.
Databases	Magpie can integrate with a range of different databases, supporting whatever data sources are supported by the underlying Spark layer. However, Magpie provides easier configuration for some specific data source types including core relational databases like MS SQL Server, Amazon Redshift, PostgreSQL, and MySQL, without the need to add additional drivers, or do detailed connection URL configuration.
HDFS	Magpie also supports HDFS (Hadoop Distributed File System) as a storage mechanism for data. HDFS storage operates similarly to S3 but can provide better performance if HDFS is local to the compute cluster. Magpie does not generally HDFS as part of its clusters, but it can be added if necessary.
Cloud Infrastructure Providers	Magpie can be deployed on the major cloud providers (Amazon Web Services, Google Cloud Platform, and Azure).

How Magpie Stores Data

A key element of Magpie's architecture is the way in which it stores data and managers references to stored data. As noted above, one of the implications of this architecture is that tables are actually "pointers" to information that is stored elsewhere. Magpie stores the information necessary to pull the data from storage make them available for query. The figure below illustrates the relationship between the table representation in Magpie and the data source, as well as how metadata relates the two.

As shown in the figure, the relationship between a table and its underlying storage is defined by two key elements of metadata:

Data Source – The data source captures all of the general information about the end point data is being gathered from and is shared among multiple tables. For example, a data source could be a database connection or a reference to a path within a bucket in a cloud services provider.
Persistence Mapping – A persistence mapping contains all of the mapping information specific to a particular table. That would be references to the specific file format for data that is backed by a file, or references to a specific schema and table within a database. The persistence mapping also always contains a mapping from the fields in the source storage format to the fields in the Magpie environment. This allows database field names to be aliased within Magpie and delimited files lacking header information to have recognizable, persistent names for parsed fields.

In the example above, three types of storage for tables are shown:

Database Storage – Database storage maps the table to an underlying structure stored in a database like MS SQL Servers, MySQL, PostgreSQL, or Amazon Redshift.
Delimited File Storage – This represents data that is stored in a flat file format that uses delimiters to separate fields from one another. In this case, the persistence mapping needs to include information about the delimiter used in the file and other information that directs Magpie in how to parse the file, like whether multi-line fields should be expected. If the path specified in the persistence mapping stops at the folder level, all files within that folder will be considered part of the source information for the table. This allows partitioned files to be read into a single table within Magpie.
Structured File Storage – In this case, structured file storage refers to formats that carry schema information about the data stored within them. For example, the Parquet and ORC formats are compressed columnar formats that have schema header information that defines the types of the columns. As with delimited files, more than one file can be used for a single table.

Once a table has been created in Magpie, it can be queried and joined with the other tables within Magpie as if they are all located in a single database. This capability is called federated query. This allows you to explore data in place without needing to move it into the Magpie environment. Note that the ease of analyzing data in place needs to be balanced against the performance overhead of using data that is located in remote.

When new objects are created within Magpie, they are stored using the Parquet format in the default storage location specified for the data repository.