Variable Storage Formats
Each DATASET use must specify a storage format. The STORAGE FORMAT syntax was extended to support the DATASET data type. Vantage provides built-in storage formats for the DATASET data type.
The storage format specification does not necessarily affect the data format on disk, but associates particular data with a specific well-known format.
Built-In Storage Formats
Vantage provides the Avro and CSV storage formats for the DATASET data type, which are based on the Apache Avro and CSV specifications. Each instance contains a schema conforming to the specification. The schema is always optional for the CSV storage format. The schema is interpreted on a per-instance basis, or at the column level.
Storage Format Terminology
Term | Description |
---|---|
Schema | For storage format AVRO, the schema is a JSON document describing the binary-encoded Avro value format. Specified in JSON text, in UTF-8 encoded characters using a VARBYTE or BLOB data type. For CSV, the JSON document describes the extended CSV options such as a field or record delimiter, and column names or header information. It can be specified in any supported JSON format. It is stored in the same character set as the CSV data type for instance-level DATASET values and as UNICODE text, encoded in UTF-8, if stored in the Data Dictionary for column-level DATASET values. |
Binary-encoded Avro Value | The actual Avro data, encoded according to the scheme described by the schema. |
CSV Value | The CSV value in the Latin or Unicode character set. |
JSON-encoded Value | JSON-text representation of the data, as described by the schema. |
Transform format OR Cast format |
For storage format AVRO, this is a null-terminated, UTF-8 encoded schema followed immediately by a binary-encoded value. For CSV, the transform and cast format uses the original CSV value. If a schema is specified for a CSV value, it is not included in the cast or transform. |