Join Index Storage | Database Design | Teradata Vantage - 17.10 - Join Index Storage - Advanced SQL Engine - Teradata Database

Teradata Vantage™ - Database Design

Product
Advanced SQL Engine
Teradata Database
Release Number
17.10
Release Date
July 2021
Content Type
User Guide
Publication ID
B035-1094-171K
Language
English (United States)

The storage organization for join indexes supports a row compressed format to reduce storage space.

If you know that a join index contains groups of rows with repeating information, then its definition DDL can specify repeating groups, indicating the repeating columns in parentheses. The column list is specified as two groups of columns, with each group stipulated within parentheses. The first group contains the non-repeating columns and the second group contains the repeating columns.

You can elect to store join indexes in value order, ordered by the values of a 4-byte column. Value-ordered storage provides better performance for queries that specify selection constraints on the value ordering column. For example, suppose a common task is to look up sales information by sales date. You can create a join index on the sales table and order it by sales date. The benefit is that queries that request sales by sales date only need to access those data blocks that contain the value or range of values that the queries specify.

Physical Join Index Row Compression

Compression refers to a logical row compression in which multiple sets of nonrepeating column values are appended to a single set of repeating column values. This allows the system to store the repeating value set only once, while any nonrepeating column values are stored as logical segmental extensions of the base repeating set.

A physical join index row has:

  • One set of values for the columns in the column_1 list. When a column_2 list is specified, the columns specified in the column_1 list are the data that is compressed for rows that have the same values for the columns in the column_1 list.
  • Multiple sets of values of the columns in the column_2 list for rows that have the same column_1 list values.

For example, if a logical join result has n rows with the same fixed part value, then there is one corresponding physical join index row that includes n repeated parts in the physical join index. A physical join index row represents one or more logical join index rows. The number of logical join index rows represented in the physical row depends on how many repeated values are stored.

Compression is only done on rows inserted by the same INSERT statement. Newly inserted rows are not added as logical rows to existing compressed rows.

When the number of repeated values associated with a given fixed value exceeds the system row size limit, the join index row is automatically split into multiple physical rows, each having the same fixed value but different lists of repeated values. Note that a given logical join index result row cannot exceed the system 1 MB row size limit.

Guidelines for Using Row Compression With Join Index Columns

A column set with a high number of distinct values cannot be row compressed because it rarely (and in the case of a primary key, never) repeats. Other column sets, notably foreign key and status code columns, are generally highly non-distinct: their values repeat frequently. Row compression capitalizes on the redundancy of frequently repeating column values by storing them once in the fixed part of the index row along with multiple repeated values in the repeated part of the index row.

Typically, primary key table column values are specified as the repeating part and foreign key table columns are specified in the fixed part of the index definition.

Using Outer Joins to Define Join Indexes and Creating Join Indexes Using Outer Joins are examples of how you can use row compression to take advantage of repeated column values.

A row-compressed join index might be used for a query when the join index would be used if it were not row compressed, Check the EXPLAIN for a query to see if join indexes are being used.

Determining the Space Overhead for a Hash or Join Index

The ROWID, if included, is always 10 bytes long for hash indexes. For join indexes, the space overhead for the ROWID is 10 bytes for 2-byte partitioning and 16 bytes for 8-byte partitioning.

The following equation provides an estimate of the space overhead required for an uncompressed hash index. Double the result if you define fallback on the index.

Uncompressed hash or join index size = N × (F + O)

where:

Symbol Description
N Cardinality of the base table.
F Length of the columns, including base row identifier information.
O Row overhead.
Assume the following row overhead for a join index depending on whether it is partitioned or not:
  • 12 bytes for an nonpartitioned join index
  • 14 bytes for a join index with 2-byte partitioning
  • 20 bytes for a join index with 8-byte partitioning

The following equation provides an estimate of the space overhead required for a row-compressed hash or join index. Double the result if you define fallback on the index.

Row-compressed hash or join index size = U × (F + O + (R × A))

where:

Symbol Description
F Length of the fixed column column_name.

For a hash index, this length includes the base row identifier information in the fixed column.

R Length of a single repeating column column_name.

For a hash index, this length includes the base row identifier information in the repeating column.

A Average number of repeated columns for a given value in column_1_name.
U Number of unique values in the specified column_1_name
O Row overhead.
Assume the following row overhead for a hash or join index depending on whether it is partitioned or not:
  • 12 bytes for an nonpartitioned hash or join index
  • 14 bytes for a join index with 2-byte partitioning
  • 20 bytes for a join index with 8-byte partitioning

Considerations for Measuring Disk Space

Keep the following rules of thumb in mind whenever you perform these calculations.

  • Implement your measurements over a very large number of rows to avoid distorting the figures with table overhead issues.
  • An INTEGER column uses 4 bytes.

    A SUM(INTEGER) column uses 8 bytes because the system always casts it to FLOAT.

  • If you have not explicitly specified a COUNT column in your aggregate join index definition, add 4 bytes to the definition to account for the required COUNT column.

Example Measurement 1

Table A and Table B both have 60 million 100-byte rows. An aggregate join index with two INTEGER columns and one SUM(INTEGER) column.

From these figures, the computed size of the aggregate join index is as follows:

Row size = (2 × 4 INTEGER bytes) + 4 COUNT bytes + 8 SUM bytes + 14 overhead bytes = 34 bytes

Type of Join Index Join Index Row
Non-aggregate For every matching row in the base table.
Aggregate Per group of rows in the join of the base tables.

The permanent space specification is as follows:

Current Permanent Space (Bytes) Peak Permanent Space (Bytes) Maximum Permanent Space (Bytes)
2,041,595,904 2,041,595,904 0

Using these figures, you can compute a measured row size:

Row size = 2,041,595,904 ÷ 60,000,000 = 34.03 bytes

Because the measured and computed disk spaces are identical within a narrow confidence interval, it is safe to conclude that the disk space formula is accurate for an aggregate join index.

Example Measurement 2

This example examines the size of a simple join index created for the same tables used in Example Measurement 1 above. There is a savings of 8 bytes because there is no COUNT column and no SUM(INTEGER) column.

The computed row size is as follows:

Row size = 34 bytes – (4 INTEGER bytes + 4 SUM(INTEGER) bytes) = 26 bytes

The permanent space specification is as follows.

Current Permanent Space (Bytes) Peak Permanent Space (Bytes) Maximum Permanent Space (Bytes)
1,561,216,000 1,561,216,000 0

Using these figures, you can compute a measured row size.

Row size = 1,561,216,000 ÷ 60,000,000 = 26.02

As with Example Measurement 1, the measured and computed disk spaces are identical, and you can conclude that the disk space formula is accurate for a simple join index.