Java OTF performance enhancements shipped to the August release can cause data inaccuracy issues when working with VARCHAR data in OTF tables. The enhancements involved using the Iceberg upper_bound metrics to determine the size of the VARCHAR column. Though this improved performance, testing found that this field (upper_bound) cannot always be relied upon to give the max length, as it gives an upper bound on the value itself, not the size. This can lead to data truncation on certain VARCHAR columns leading to incorrect results.
- Workaround for August Drop
- Currently the performance enhancement to use the upper_bound metric is enabled by default in the August drop. But if a customer has a dataset that was populated with incorrect upper bound values, then, as a workaround, the feature can be disabled through a DATALAKE USING clause property retrieve_metadata_column_stats('false'). By doing this, the Teradata OTF engine will turn off the upper bound check and apply the default VARCHAR(32000) size for STRING columns. Note that this change can affect performance as it effectively disables the performance enhancement.
The workaround involves dropping & recreating OR replacing the existing DATALAKE to add the retrieve_metadata_column_stats property.
Use the same DATALAKE definition that was used to create the datalake , and add the retrieve_metadata_column_stats property.
Here is example Iceberg DATALAKE with retrieve_metadata_column_stats property.
CREATE DATALAKE TD_SERVER_DB.TPG_DATALAKE_MIXED_ABENCH_UNITY_3000 EXTERNAL SECURITY INVOKER TRUSTED CATALOG ICEBERG_CATALOG_UNITY , EXTERNAL SECURITY INVOKER TRUSTED STORAGE ICEBERG_INVOKER_UNITY USING CATALOG_TYPE ('UNITY') CATALOG_LOCATION ('https://adb-8074943983864086.6.azuredatabricks.net/api/2.1/unity-catalog/iceberg') UNITY_CATALOG_NAME ('abench_3tb') STORAGE_ACCOUNT_NAME ('icebergstorageacct') TENANT_ID ('391c8c4c-6a2a-40fd-ab98-9876baa1234') default_cluster_id ('0210-232334-abc12345') retrieve_metadata_column_stats ('false') TABLE FORMAT iceberg ;