Magpie is a cloud-based data management platform that brings together many of the core capabilities that are needed to build an enterprise analytics environment. Magpie's capabilities are focused four areas:
Data Integration – Magpie makes it easy to pull data from multiple types of data sources. This includes relational databases like MySQL, Microsoft SQL Server, PostgreSQL, and Amazon Redshift, as well as NoSQL sources like ElasticSearch. It also includes pulling data from web-based sources by simply creating a table from a URL.
Data Exploration and Profiling – Magpie also allows you to explore data using SQL and supports data profiling allowing you to quickly understand the content of data tables using a single command.
Data Pipelines and Transformation – Magpie allows you to create data processing tasks and integrate them into multi-step jobs that can be run on a scheduled basis.
AI and ML Infrastructure – Magpie is built on top of the Apache Spark framework and benefits from Spark's scalability and allow users to leverage all of the objects created in Magpie as dataframes/datasets within Spark.
These capabilities are supported by a set of underlying services:
Magpie DSL – Users can interact with Magpie using a range of languages including SQL, Scala, and Python, but Magpie also supports its own Domain Specific Language (DSL) that allows you to manipulate the underlying data structures within Magpie, specify data processing jobs, explore existing metadata, and manage users and roles.
Metadata Repository – At the core of the Magpie platform, is the metadata repository. It stores all of the data structures, data source configurations, jobs, and schedules associated with the Magpie and provides the backbone for much of Magpie's functionality. A more detailed view of Magpie's metadata management is provided here – Magpie Metadata.
Security – Magpie provides fine-grained control over the objects stored Magpie with support for permission assignments at the role and user level.
Federated Query – Magpie allows you to execute queries that span multiple data stores including its internal storage, Amazon S3 buckets, Hadoop Distributed File System (HDFS), and relational and NoSQL databases. This supports ingestion of data from many different types of sources and the ability to analyze data "in place" without moving it into Magpie.
Scheduling and Notifications – Magpie has a built in scheduling system that allows jobs to be scheduled to run at specific time intervals, send notifications to users when jobs succeed or fail, and maintain a history detailed information about past job runs.
Magpie is deployed in the cloud and a dedicated cluster or clusters are provisioned for your organization. This helps isolate your data and prevent other Magpie users from impacting your work. Magpie functionality is accessed through our web-based user interface, the Magpie Notebook. This allows you to import data, create analysis, build data pipelines, and share the results with teammates. More detailed on the Magpie Notebook interface can be found here – Using the Magpie Notebook.