Accessing Data from JDBC Sources

February 04, 2020 17:04

One of the most common sources for data used within Magpie are relational databases that can be queried using SQL over JDBC. Amazon Redshift supports connections via JDBC, but Magpie also has a special connector for this source. Database sources have a few key attributes that need to be configured as part of their creation, please view the Data Source JSON specification for more details.

These attributes are set during the initial creation of the source or through an ALTER statement using a JSON specification to set the values.

An example of such a command is shown below.

create data source {
  "name": "my_db_source",
  "sourceType": "jdbc",
  "url": "jdbc:sqlserver://mydb.mycompany.com:1433/analytics",
  "user": "my_user_name",
  "password": "**********"
}

Create Table

Once a database source has been created, the tables within it can be used as the underlying storage for tables within Magpie. The following is an example of creating a Magpie table that references an underlying database table:

create table transactions 
  from data source my_db_source 
  table analytics.transactions;

This will result in a table within the Magpie Context that in turn references the source table in the database. When this table is queried within Magpie, Magpie will "reach" into the source database and pull the data before combining it with other local data.

Create Schema

Whole schemas can also be created from source databases in Magpie. In most cases, it is useful to have multiple tables from any given data source within Magpie. All of the tables within a given schema can be mirrored in Magpie using the following command:

create schema magpie_schema 
  from data source my_db_source
  using schema analytics;

This command results in a new Magpie schema that contains tables referencing all of the tables in the source schema within the source database.

Show Command

Once a table has been created, we can then query the data in that table using the SHOW SQL command.

show 100 from sql(
	select column_1, count(*) 
	from my_table group by column_1
) with count;

This will display the first 100 rows for the results set and also provide a count of the total number of rows. More information on querying data within Magpie can be found here.

Profile Table Command

Data in this database backed table can also be profiled.

profile table my_table;

For more information on using profiling, see this more detailed primer.

Save Result

As mentioned above, the overall performance and reliability of data will be greater if it is stored directly within Magpie. To copy the table created above into a local table, stored in the default data source for the Magpie instance, you can use the SAVE RESULT command or the SAVE TABLE AS command.

sql(select * from my_table);
save result as my_table_copy;

The new table created, my_table_copy, will now be accessible for query within Magpie. Because no specific storage is specified this table will be stored in the default location set for the Magpie environment in which it was created, typically a Parquet format file in a cloud object store.

Execute SQL Command

This command allows you to execute SQL query directly on the database. This can be useful for performing advanced activities such as some DDL and DML tasks on your database. Note that the SQL query will be executed locally on the database, so it must match the database’s syntax. In general, you should not use this command for routine data analysis, as the Create Table from Database and Create Table from Database SQL commands will perform better and be easier to interact with. Example command:

execute sql script on data source my_db_source with sql """
CREATE VIEW my_analytics_view AS
    SELECT transaction_type, avg(cost), min(date), max(date), count(*) 
    FROM analytics.transactions
    GROUP BY transaction_type;
""";

Case Sensitivity and Special Characters

Note that some databases are case sensitive (e.g., SQL Server), whereas Magpie is case insensitive and converts all references to lowercase before processing. If you need to access tables or schemas with capital letters in their names from a case sensitive database, quote the reference using backticks. For example:

list tables in data source my_db_source schema `MyCaseSensitiveSchema`

This approach also works for table and schema names that contain special characters.

Security

To access this type of data source, you may need to adjust the security configuration of your cloud environment. In particular, you need to generate database user credentials for the Magpie cluster, along with appropriate permissions to access relevant data. Please reach out to a member of the Silectis team with any questions or support requests.

Note, database data sources should generally be created using a user with read-only permissions to the target database. This can prevent unwanted updates to the target database resulting from actions within Magpie. If you do need to write data back to the database, it may be useful to define a separate data source to be used for writing, with stricter permissions.