File

A file is used in File System Persistence Mappings to define the location and format of data on a filesystem. There are many types of files, and each type may have different required fields, based on the format field.

Text File

A text file has a single field, value for each row in the file.

Specification

{
  "format": "text",
  "path": "<string>",
  "fileName": "<string>",
  "compression": "<string>"
}

Structure Values

Field Name

Type

Description

Required

Default

path

String

Parent path of the file. Can be an empty string.

Required


fileName

String

Name of the file or folder. Can be a glob using wildcards (e.g., *.txt)

Required


compression

String

Compression type for source file. Supported compression types:

  • none

  • bzip2

  • gzip

  • lz4

  • snappy

  • deflate

Note: this compression is only used when writing files. Magpie will detect the compression of files based on the file extension when reading.

Optional

gzip

format

String

File's storage format, text for text files.

Required


Example

{
  "path": "/my_data",
  "fileName": "transactions.txt",
  "format": "text"
}


Delimited Text File (e.g., csv, tsv)

A text file with fields delimited by a specified separator.

Specification

{
  "format": "DelimitedText",
  "path": "<string>",
  "fileName": "<string>",
  "compression": "<string>",
  "encoding": "<string>",
  "delimiter": "<string>",
  "header": <boolean>,
  "multiLine": <boolean>,
  "ignoreLeadingWhiteSpace": <boolean>,
  "ignoreTrailingWhiteSpace": <boolean>,
  "quoteCharacter": "<string>",
  "escapeCharacter": "<string>",
  "dateFormat": "<string>",
  "timestampFormat": "<string>"
}

Structure Values

Field Name

Type

Description

Required

Default

path

String

Parent path of the file. Can be an empty string.

Required


fileName

String

Name of the file or folder. Can be a glob using wildcards (e.g., *.csv)

Required


delimiter

String

The separator used to partition records into fields

Optional

,

compression

String

Compression type for source file. Supported compression types:

  • none

  • bzip2

  • gzip

  • lz4

  • snappy

  • deflate

Note: this compression is only used when writing files. Magpie will detect the compression of files based on the file extension when reading.

Optional

gzip

encoding

String

Encoding of the file

Optional

utf-8

header

Boolean

If true, the first line of each file will be used as field names for the resulting tables.

Optional

false

multiLine

Boolean

If true, multiple lines of the file will be parsed as one record, with new records starting based on field counts.

Optional

false

ignoreLeadingWhiteSpace

Boolean

If true, leading white space will be trimmed from each field.

Optional

false

ignoreTrailingWhiteSpace

Boolean

If true, trailing white space will be trimmed from each field.

Optional

false

quoteCharacter

String

The character optionally used to enclose fields within the files.

Optional

"

escapeCharacter

String

The character optionally used to escape quotations within a quoted field.

Optional

"

nullValue

String

The value to treat as null when reading files and the value to use as null when writing files.

Optional

<empty string>

dateFormat

String

The Java date format used to identify fields as dates within the files.

Optional

yyyy-MM-dd

timestampFormat

String

The Java date time format used to identify fields as timestamps within the files.

Optional

yyyy-MM-dd HH:mm:ss

format

String

File's storage format, DelimitedText for delimited text files.

Required


Example

{
  "path": "/my_data",
  "fileName": "transactions.tsv.gz",
  "delimiter": "\t",
  "compression": "Gzip",
  "encoding": "UTF-8",
  "header": true,
  "multiLine": false,
  "ignoreLeadingWhiteSpace": false,
  "ignoreTrailingWhiteSpace": false,
  "quoteCharacter": "\"",
  "escapeCharacter": "\"",
  "dateFormat": "yyyy-MM-dd",
  "timestampFormat": "yyyy-MM-dd'T'HH:mm:ss.SSSXXX",
  "format": "DelimitedText"
}


Parquet File

A columnar storage format. Default storage format in Magpie.

Specification

{
  "format": "parquet",
  "path": "<string>",
  "fileName": "<string>",
  "compression": "<string>",
  "mergeSchema": <boolean>
}

Structure Values

Field Name

Type

Description

Required

Default

path

String

Parent path of the file. Can be an empty string.

Required


fileName

String

Name of the file or folder. Can be a glob using wildcards.

Required


compression

String

Compression type for source file. Supported compression types:

  • none

  • gzip

  • snappy

  • lzo

Note: this compression is only used when writing files. Magpie will detect the compression of files based on the file extension when reading.

Optional

snappy

mergeSchema

Boolean

If true and this file points to a folder of parquet files, Magpie will merge the schema for each file to create the final schema for the table. If false, Magpie will only use the schema of the first file for the table.

Optional

false

format

String

File's storage format, parquet for parquet files.

Required


Example

{
  "path": "/my_data",
  "fileName": "transactions",
  "mergeSchema": true,
  "format": "parquet"
}


JSON File

A JSON or newline-delimited JSON file (NDJSON).

Specification

{
  "format": "json",
  "path": "<string>",
  "fileName": "<string>",
  "compression": "<string>",
  "encoding": "<string>",
  "multiLine": <boolean>,
  "dateFormat": "<string>",
  "timestampFormat": "<string>"
}

Structure Values

Field Name

Type

Description

Required

Default

path

String

Parent path of the file. Can be an empty string.

Required


fileName

String

Name of the file or folder. Can be a glob using wildcards (e.g., *.json).

Required


compression

String

Compression type for source file. Supported compression types:

  • none

  • bzip2

  • gzip

  • lz4

  • snappy

  • deflate

Note: this compression is only used when writing files. Magpie will detect the compression of files based on the file extension when reading.

Optional

gzip

encoding

String

The file’s encoding. If not specified, encoding is auto-detected.

Optional

multiLine

Boolean

If true, each file will be parsed as a single record (each file is a single JSON record). If false, files are parsed as newline-delimited JSON, with each new line starting a separate JSON record.

Optional

false

dateFormat

String

The Java date format used to identify fields as dates within the files.

Optional

yyyy-MM-dd

timestampFormat

String

The Java datetime format used to identify fields as timestamps within the files.

Optional

yyyy-MM-dd'T'HH:mm:ss.SSSXXX

format

String

File's storage format, json for JSON files.

Required


Example

{
  "path": "/my_data",
  "fileName": "clicks/*.json",
  "multiLine": true,
  "dateFormat": "yyyy-MM-dd",
  "timestampFormat": "yyyy-MM-dd'T'HH:mm:ss.SSSXXX",
  "format": "json"
}


ORC File

A columnar storage format.

Specification

{
  "format": "orc",
  "path": "<string>",
  "fileName": "<string>",
  "compression": "<string>"
}

Structure Values

Field Name

Type

Description

Required

Default

path

String

Parent path of the file. Can be an empty string.

Required


fileName

String

Name of the file or folder. Can be a glob using wildcards.

Required


compression

String

Compression type for source file. Supported compression types:

  • none

  • snappy

  • zlib

  • lzo

Note: this compression is only used when writing files. Magpie will detect the compression of files based on the file extension when reading.

Optional

snappy

format

String

File's storage format, orc for ORC files.

Required


Example

{
  "path": "/my_data",
  "fileName": "transactions",
  "format": "orc"
}


Avro File

A row-based storage format widely used as a serialization platform.

Specification

{
  "format": "avro",
  "path": "<string>",
  "fileName": "<string>",
  "compression": "<string>",
  "ignoreExtension": <boolean>,
  "recordName": "<string>",
  "recordNamespace": "<string>"
}

Structure Values

Field Name

Type

Description

Required

Default

path

String

Parent path of the file. Can be an empty string.

Required


fileName

String

Name of the file or folder. Can be a glob using wildcards.

Required


compression

String

Compression type for source file. Supported compression types:

  • none

  • snappy

  • deflate

  • bzip2

  • xz

Note: this compression is only used when writing files. Magpie will detect the compression of files based on the file extension when reading.

Optional

snappy

ignoreExtension

Boolean

If false, read only .avro files at the specified path. If true, read all files as Avro regardless of the extension.

Optional

false

recordName

String

When writing, the name of the top level record to write.

Optional

topLevelRecord

recordNamespace

String

When writing, the namespace of the record to write.

Optional

<empty string>

format

String

File's storage format, avro for Avro files.

Required


Example

{
  "path": "/my_data",
  "fileName": "transactions",
  "format": "avro"
}


Delta File

A delta lake file.

Specification

{
  "format": "delta",
  "path": "<string>",
  "fileName": "<string>",
  "mergeSchema": <boolean>
}

Structure Values

Field Name

Type

Description

Required

Default

path

String

Parent path of the file. Can be an empty string.

Required


fileName

String

Name of the file or folder. Can be a glob using wildcards.

Required


mergeSchema

Boolean

If true and this file points to a folder of delta files, Magpie will merge the schema for each file to create the final schema for the table. If false, Magpie will only use the schema of the first file for the table.

Optional

false

format

String

File's storage format, delta for Delta Lake files.

Required


Example

{
  "path": "/my_data",
  "fileName": "transactions",
  "format": "delta"
}

Was this article helpful?
0 out of 0 found this helpful