Loading Data from URL

August 02, 2018 04:30

Data can also be imported into Magpie from external sources, in particular web and other URL sources. Using the SAVE URL AS TABLE command, you can specify a source, automatically copy it into Magpie, and turn it into a table.

This approach is generally used to get data from external web sources via HTTP, but it can also be used for S3 and HDFS sources. This differs from creating a table based on a defined Magpie data source because the data is copied into the Magpie environment when the table is created. If the source data is updated after the table is created, it needs to be pulled in again to update the Magpie table.

Saving data from a URL

This command saves data at a URL as a Magpie table. This approach can be used to pull individual files from web sources when they are directly available as a CSV. For more complicated, API-based approaches to data access, it is possible to write a script using Scala or Python using the appropriate interpreter in the Magpie notebook.

save url "https://data.iowa.gov/api/views/bzed-t5zc/rows.csv" 
  as table city_budgets in schema analysis_team
  with header 
  with infer schema
  with replace with delete;

In typical iterative ETL workflows, we recommend specifying the with replace with delete options to recreate the table. This will automatically overwrite the locally saved data with a fresh copy from the source.

For larger files, specifying the partition option can improve performance by promoting greater parallelization. Also note that several of the options are specific to only certain file formats (e.g. infer schema only applies to CSV). See the full documentation for details: Save URL as Table