Magpie Metadata

July 06, 2018 05:32

Overview

Magpie's capabilities are built around a few core elements that store the information necessary to allow queries to execute, jobs to run, and data to be integrated from multiple sources. These elements are referred to as metadata and are stored in the Magpie Metadata Repository. Many of Magpie's commands are built around manipulating this metadata. Examples of these metadata elements include schemas, tables, jobs, and tasks.

Metadata Object Types

The table below describes the different types of metadata that exist within Magpie:

Object Type	Description
Organization	This is the top level element within Magpie's data and it groups together all of the elements for a particular organizational entity. You will not need to deal with organization objects because your access to Magpie is limited to your own organization. All users belong to an organization and can only access data and jobs within that organization.
Cluster	A cluster is the runtime environment that Magpie queries and scripts execute in. A cluster consists of a driver node and optionally a set of worker nodes, depending on the processing requirements of your organization. Your organization will likely start with a single cluster, but depending on the terms of your Magpie subscription agreement, you can dynamically add and use additional clusters to process more data more quickly or segment user workloads.
Repository	A repository is the next level grouping of objects below the organization level. Your organization is likely to have only one repository to start with, but may add others if data and processing need to be strictly segregated to different environments or computing resources. Repositories contain data sources, schemas, and projects.
Schema	Schemas are containers for tables. All tables are assigned to a specific schema. Much like traditional database systems, tables can be accessed across different schemas by using the schema name followed by a period as a prefix (e.g., sales.purchase).
Table	A table is a container of data within the Magpie environment can be queried using SQL queries. A table, in the Magpie context, is really a pointer to some underlying storage along with a schema that tells Magpie how to interpret the source data. This underlying source can be a file in a distributed file system like Amazon S3 or HDFS (Hadoop Distributed File System), or a table within a database. A table is mapped onto a data source based on its persistence mapping.
Field	A field is an individual column within a table. It can have any one of a number of different data types (e.g., string, integer, long, float, double, date, timestamp). Complex types such as maps and arrays are also supported, but they cannot be queried directly within Magpie and must be accessed through Spark or Python.
Profile	A profile is a summary of the data within a table including a count of total rows, the highest frequency values in each column, a count of null values in each column, and a count of distinct values in each column. A higher-level profile type is also inferred for each of the columns. A history of profiles is retained for each table meaning that each table can have zero or more profiles associated with it.
Project	A project is a way of grouping jobs within a repository.
Job	A job is a sequence of tasks that can be run either interactively or on a schedule.
Task	A single step within a job. A task can be a simple SQL statement, a Scala or Python script, or a Magpie script.
Schedule	An execution schedule associated with a job. A single job may have several schedules. The schedule will also have an execution context for the associated job.
History	A history of job executions that have occurred in the past. This includes detailed information about each step in the process and the final status for the job (i.e., success or failure).
Data Source	A data source is an underlying storage mechanism for the data available within Magpie. As mentioned above, a data source can be any one of several types including a range of relational databases (SQL Server, PostgreSQL, MySQL, and Amazon Redshift), file systems (HDFS, S3), or any source supported directly within Spark via a generic adapter (e.g., ElasticSearch, Cassandra, etc.). All tables are linked to their underlying data source through a persistence mapping.
User	A user of the Magpie platform.
Role	A role that that can be assigned to a number of users and that can have permissions on objects. This is a means to simplify permission setting and avoid having to assign permissions to each user individually.
Permission	An assignment of the ability to manipulate a metadata object to a user or a role.

The diagram below illustrates the relationships between the different metadata elements (users, roles, and permissions not shown).

Creating and Accessing Metadata

Each type of metadata object shown above, can be created, modified, or deleted through commands in the Magpie scripting language. For example, tables can be created explicitly by specifying all of the related metadata, or by saving the results of a previous query.

The following is an example of creating a table by explicitly specifying all of the metadata explicitly. This serves to illustrate the syntax for creating tables, and to provide an inventory of the individual data elements that comprise the full set of table metadata. Similar commands exist for the other tables of metadata shown above.

Create Table

drop table if exists chicago_crime_detail;

create table  {
  "name": "chicago_crime_detail",
  "description": "Detailed list of all crimes in Chicago from 2001 to present",
  "objectType": "Entity",
  "fields": [
    {
      "name": "id",
      "description": "The unique identifier for the crime.",
      "length": null,
      "fieldType": "Int",
      "fieldIndex": 0
    },
    {
      "name": "case_number",
      "description": null,
      "length": null,
      "fieldType": "String",
      "fieldIndex": 1
    },
    {
      "name": "date",
      "description": null,
      "length": null,
      "fieldType": "String",
      "fieldIndex": 2
    },
    {
      "name": "block",
      "description": null,
      "length": null,
      "fieldType": "String",
      "fieldIndex": 3
    },
    {
      "name": "primary_type",
      "description": null,
      "length": null,
      "fieldType": "String",
      "fieldIndex": 5
    },
    {
      "name": "arrest",
      "description": "Was there an arrest in the crime.",
      "length": null,
      "fieldType": "Boolean",
      "fieldIndex": 8
    },
    {
      "name": "ward",
      "description": null,
      "length": null,
      "fieldType": "Int",
      "fieldIndex": 12
    },
    {
      "name": "fbi_code",
      "description": null,
      "length": null,
      "fieldType": "String",
      "fieldIndex": 14
    },
    {
      "name": "year",
      "description": null,
      "length": null,
      "fieldType": "Int",
      "fieldIndex": 17
    },
    {
      "name": "latitude",
      "description": null,
      "length": null,
      "fieldType": "Double",
      "fieldIndex": 19
    },
    {
      "name": "longitude",
      "description": null,
      "length": null,
      "fieldType": "Double",
      "fieldIndex": 20
    }
  ],
  "persistenceMapping": {
    "sourceName": "tests3source",
    "path": "default_files",
    "fileName": "chicago_crime_detail_93bf8cca-571c-4157-b53d-fc242073ea46",
    "isWritable": true,
    "mappingType": "ParquetFile"
  },
  "lifecycle": "Global",
  "isVersioned": false
};

This detailed metadata can be accessed by running the "describe" statement on the metadata object. An example for a data source is shown below.

By clicking on the "JSON" tab in the describe view, you can view the underlying JSON representation of the metadata. This is useful for understanding the specification used when creating these objects using the "create" command.