Delta Lake Manifest Files Limitations - Analytics Database

Delta Lake Manifest Files Limitations - Analytics Database - Teradata Vantage

Teradata Vantage™ - Native Object Store Getting Started Guide - 17.20

Deployment

VantageCloud

VantageCore

Edition

Enterprise

IntelliFlex

VMware

Product

Analytics Database

Teradata Vantage

Release Number

17.20

Published

June 2022

Language

English (United States)

Last Update

2024-04-05

dita:mapPath

tsq1628112323282.ditamap

dita:ditavalPath

qkf1628213546010.ditaval

dita:id

jjn1567647976698

Product Category

Teradata Vantage

The Vantage integration has known limitations in its behavior.

Data Consistency

Whenever Delta Lake generates updated manifests, it atomically overwrites existing manifest files. Therefore, Vantage will always see a consistent view of the data files. However, the granularity of the consistency guarantees depends on whether the table is partitioned or not.

Unpartitioned tables: All the file names are written in one manifest file, which is updated atomically. Vantage sees full table snapshot consistency.
Partitioned tables: A manifest file is partitioned in the same Hive-partitioning-style directory structure as the original Delta table. This means that each partition is updated atomically, and Vantage sees a consistent view of each partition, but not a consistent view across partitions. Furthermore, since all manifests of all partitions cannot be updated together, concurrent attempts to generate manifests can lead to different partitions having manifests of different versions. While this consistency guarantee under data change is weaker than that of reading Delta tables with Spark, it is still stronger than formats like Parquet as they do not provide partition-level consistency.

Depending on what storage system you are using for Delta tables, it is possible to get incorrect results when Presto or Athena concurrently queries the manifest while the manifest files are being rewritten. In file system implementations that lack atomic file overwrites, a manifest file may be momentarily unavailable. Therefore, use manifests with caution if their updates are likely to coincide with queries from Presto or Athena.

Performance

Very large numbers of files can hurt Vantage performance. Databricks recommends that you compact the files of the table before generating the manifests. The number of files should not exceed 1000 (for the entire unpartitioned table or for each partition in a partitioned table).

Schema Evolution

Delta Lake supports schema evolution and queries on a Delta table automatically using the latest schema regardless of the schema defined in the table in the Hive metastore. However, Vantage uses the schema defined in its table definition, and will not query with the updated schema until the table definition is updated to the new schema.