Improve Java OTF Performance|Teradata OTF for Apache Iceberg and Delta Lake - Guidelines to Improve Java OTF Performance - Teradata Vantage

Teradata® Open Table Format for Apache Iceberg and Delta Lake User Guide

Deployment
VantageCloud
VantageCore
Edition
VMware
Enterprise
IntelliFlex
Lake
Product
Teradata Vantage
Release Number
20.00
Published
October 2025
ft:locale
en-US
ft:lastEdition
2025-10-25
dita:mapPath
qrj1749167830193.ditamap
dita:ditavalPath
lli1749584660955.ditaval
dita:id
bsr1702324250454

Based on the internal performance testing done by Engineering, these are the recommendations for ensuring the best performance from the Java OTF solution:

Use of Large Primary Clusters

Query performance improved significantly when Abench queries were run on the Large Primary cluster instance. For example, on a Large VantageCloud Lake Azure instance – E96 with 56 AMPs/Node, Abench queries performed ~1.7 - 2x faster than Spark at a 4x price performance.

Rewrite Queries using WITH Clause

Queries that reference the same table multiple times can experience performance issues. This is because it results in invoking the underlying Java table operator multiple times. To improve efficiency, these queries can be rewritten using the WITH clause.​ An example of this is the Abench Query 21 that can be rewritten to use the WITH clause that improves query run time by 80%:

Original Query

/* Query 21 - Var_0 Rev_01 - TPC-H/TPC-R The Suppliers Who Kept Orders Waiting Query – NO WITH CLAUSE / ​
​SELECT S_NAME, COUNT() AS NUMWAIT FROM TPG_DATALAKE_MIXED_ABENCH_UNITY_3000.MIXED_ABENCH_3000.SUPPLIER, TPG_DATALAKE_MIXED_ABENCH_UNITY_3000.MIXED_ABENCH_3000.LINEITEM L1, TPG_DATALAKE_MIXED_ABENCH_UNITY_3000.MIXED_ABENCH_3000.ORDERTBL, TPG_DATALAKE_MIXED_ABENCH_UNITY_3000.MIXED_ABENCH_3000.NATION​
WHERE​
S_SUPPKEY = L1.L_SUPPKEY AND O_ORDERKEY = L1.L_ORDERKEY​
AND O_ORDERSTATUS='F' AND L1.L_RECEIPTDATE > L1.L_COMMITDATE ​
AND EXISTS (SELECT * FROM ​
TPG_DATALAKE_MIXED_ABENCH_UNITY_3000.MIXED_ABENCH_300 0.LINEITEM L2 ​
WHERE L2.L_ORDERKEY = L1.L_ORDERKEY ​
AND L2.L_SUPPKEY <> L1.L_SUPPKEY ) ​
AND NOT EXISTS (SELECT * FROM ​
TPG_DATALAKE_MIXED_ABENCH_UNITY_3000.MIXED_ABENCH_300 0.LINEITEM L3 ​
WHERE L3.L_ORDERKEY = L1.L_ORDERKEY ​
AND L3.L_SUPPKEY <> L1.L_SUPPKEY ​
AND L3.L_RECEIPTDATE > L3.L_COMMITDATE )​
AND S_NATIONKEY = N_NATIONKEY ​
AND N_NAME = 'KENYA’ ​
GROUP BY S_NAME ORDER BY NUMWAIT DESC, S_NAME; 

Query After Rewrite using with Clause

/* Query 21 - Var_0 Rev_01 - TPC-H/TPC-R The Suppliers Who Kept Orders Waiting Query  - USING WITH CLAUSE */​
WITH LINEITEM_DT AS​
  ( SELECT L_SUPPKEY, L_ORDERKEY, L_RECEIPTDATE, L_COMMITDATE FROM​
     TPG_DATALAKE_MIXED_ABENCH_UNITY_3000.MIXED_ABENCH_3000.LINEITEM)​
SELECT S_NAME, COUNT(*) AS NUMWAIT FROM   ​
   TPG_DATALAKE_MIXED_ABENCH_UNITY_3000.MIXED_ABENCH_3000.SUPPLIER,     ​
    LINEITEM_DT L1,        ​
    TPG_DATALAKE_MIXED_ABENCH_UNITY_3000.MIXED_ABENCH_3000.ORDERTBL,       ​
    TPG_DATALAKE_MIXED_ABENCH_UNITY_3000.MIXED_ABENCH_3000.NATION​
WHERE S_SUPPKEY = L1.L_SUPPKEY ​
  AND O_ORDERKEY = L1.L_ORDERKEY​
  AND O_ORDERSTATUS='F’​
  AND L1.L_RECEIPTDATE > L1.L_COMMITDATE​
  AND EXISTS (SELECT * FROM LINEITEM_DT L2​
                WHERE L2.L_ORDERKEY = L1.L_ORDERKEY​
                     AND L2.L_SUPPKEY <> L1.L_SUPPKEY )​
  AND NOT EXISTS (SELECT * FROM LINEITEM_DT L3​
                    WHERE L3.L_ORDERKEY = L1.L_ORDERKEY​
                          AND L3.L_SUPPKEY <> L1.L_SUPPKEY​
                          AND L3.L_RECEIPTDATE > L3.L_COMMITDATE)  AND S_NATIONKEY = N_NATIONKEY    AND N_NAME = 'KENYA’​
   GROUP BY S_NAME   ORDER BY NUMWAIT DESC, S_NAME;

Setting VARCHAR Sizes through DATALAKE Definition & Query Band

All VARCHAR columns are mapped to a String data type in Iceberg / Delta Lake. When these values are read in, they are mapped to a default size of 4K. When these columns are used in a GROUP BY or JOIN condition, the string comparisons for these 4K columns have a severe impact on performance.

There are 2 ways to mitigate this performance hit, provided the actual sizes of the varchar dataset is well known:

  • Defining the DATALAKE with a smaller default mapping
  • Using Query Band to set VARCHAR column sizes per table.

Defining the DATALAKE with a Smaller Default VARCHAR Size Mapping

The DATALAKE definition can be changed to add the sql_varchar_fixed_size_bytes parameter to set a VARCHAR default size. The following example shows the DATALAKE definition using a sql_varchar_fixed_size_bytes of 264.

CREATE DATALAKE UNITY_DATALAKE
 EXTERNAL SECURITY CATALOG ICEBERG_CATALOG_UNITY_SA,
 EXTERNAL SECURITY STORAGE ICEBERG_INVOKER_UNITY_SA
  USING CATALOG_TYPE('UNITY')
   CATALOG_LOCATION('https://adb-807494398312345.6.azuredatabricks.net/api/2.1/unity-catalog/iceberg')
    UNITY_CATALOG_NAME ('tpcds_unity') 
    STORAGE_ACCOUNT_NAME ('icebergstorageacct')
    TENANT_ID ('391c8c4c-6a2a-40fd-ab98-226b6123456')   
    default_cluster_id ('0123-232334-ab0q59t3') 
    sql_varchar_fixed_size_bytes ('264') 
    TABLE FORMAT iceberg;

Using Query Band to Set VARCHAR Column Sizes per Table

When the dataset is well known and the maximum length of various character columns is known, Query band setting can be used to set the individual VARCHAR size columns. This will override the varchar length set in DATALAKE through sql_varchar_fixed_size_bytes USING clause for the specified columns in QueryBand. The scope of QueryBand is session level, which means the SET QUERY_BAND statement needs to be executed on each session.

The property name is changed to have a suffix for an index to support large number of columns from multiple tables. Here is an example with a smaller number of columns involved with just one column_sizes name/value pair.

SET QUERY_BAND='column_sizes1=ITEM.i_item_id:32,ITEM.i_category:100,ITEM.i_class:100,ITEM.i_brand:100,ITEM.i_product_name:100,ITEM.i_color:40,ITEM.i_units:20,ITEM.i_size:40,ITEM.i_item_desc:400;' FOR SESSION;

When a large number of varchar columns need size adjustment, they may not all fit into a single name/value pair due to query band property size limitations. In such cases, you should create multiple column_sizes name/value pairs. For example: column_sizes1=<value1>; column_sizes2=<value2>; column_sizes3=<value3>, and so on, within the same query band statement. below is an example with multiple name/value pairs.

SET QUERY_BAND='column_sizes1=NATION_iceberg.N_NAME:28,PARTTBL_iceberg.P_MFGR:28,REGION_iceberg.R_NAME:22;column_sizes2=SUPPLIER_iceberg.S_NAME:36,SUPPLIER_iceberg.S_ADDRESS:80,SUPPLIER_iceberg.S_NATIONKEY:4,SUPPLIER_iceberg.S_PHONE:15,SUPPLIER_iceberg.S_COMMENT:200;' FOR SESSION;