16.10 - Duplicate Row Checks for NUPIs - Teradata Database

Teradata Database Design

Product
Teradata Database
Release Number
16.10
Release Date
June 2017
Content Type
User Guide
Publication ID
B035-1094-161K
Language
English (United States)

Teradata Database tables defined with the SET attribute in the CREATE TABLE kind clause do not permit duplicate rows. Any time a row is inserted into a SET table having a NUPI and no uniqueness constraints or indexes, the system performs a duplicate row check.

When Teradata Database checks for duplicate rows, it considers nulls to be equal, while in SQL conditions, null comparisons evaluate to UNKNOWN. See Inconsistencies in How SQL Treats Nulls for information about how nulls are interpreted by commercially available relational database management systems.

Performance Effects of Duplicate NUPI Row Checks

Duplicate row checking can be a very I/O-intensive operation. The example below illustrates the number of logical reads performed for the duplicate row checks required by a given number of duplicate NUPI row insertions under the following conditions:

  • The table is a SET table (no duplicate rows permitted).
  • There are no previous hash synonyms for the NUPI value being inserted.

The center column in the table indicates the number of logical reads required before the number of rows specified in the left column can be inserted into the table.

The right column indicates the cumulative number of logical reads required before the number of rows listed in the left column can be inserted.

Duplicate NUPI Row Read I/O as a Function of Number of Rows Inserted

This table illustrates the absolute and cumulative number of read I/Os required as a function of the number of duplicate NUPI rows inserted into a table. These figures are not a function of the number of AMPs in the configuration because duplicate NUPI values all hash to the same AMP. Therefore, the magnitude of the effect is independent of the number of AMPs on the system.

As you can see, the read I/O burden is tractable up to about 100 duplicate NUPI rows inserted. Beyond that point, the read I/O requirements are increasingly intractable.

The performance of any system is noticeably worsened when any additional duplicate NUPI rows are added.

Number of Rows            Inserted Number of Prior Logical Reads Required to    Process This Number of Inserted Rows Cumulative Number of              Logical Reads
1 0 0
2 1 1
3 2 3
4 3  6
5 4 10
6 5 15
7 6  21
8 7 28
9 8 36
10  9 45
20 19 190
30 29 435
40 39 780
50 49 1,225
60 59 1,770
70  69 2,415
80 79 3,160
90 89  4,005
100 99 4,950
200 199 19,900
300 299 44,850
400 399  79,800
500 499 124,750
600 599 179,700
700 699 244,650
800 799 319,600
900 899 404,550
1,000 999 499,500

Guidelines for Keeping NUPI Duplicates Below 100 Rows Per Value

As you can see in the previous table, beyond more than a very few rows, the number of logical reads required to insert duplicate NUPI value rows is prohibitive if duplicate row checks are done.

The file system must read all the rows on an AMP that have the same row hash in order to determine their uniqueness values before it can insert a new NUSI duplicate row. Beyond 100 rows per value, the performance decrement created by that activity becomes significant, so you should always limit NUPI duplicates to fewer than 100 rows per value for any table. This figure is dependent on data block and row sizes: the larger the row size, the fewer rows fit into a single data block. Similarly, the larger the data block, the more rows, on average, fit into it. Keep this in mind as you read this section and realize that the 100 rows per value measure is a relative, not an absolute, criterion.

The figure of 100 duplicates per NUPI value is based on the overwhelming likelihood that duplicate NUPI value rows will spill over into more than five data blocks. When the number of duplicate NUPI values in a table exceeds 100, performance always degrades significantly. The specific operations affected are described in the following table.

Operation Cause of Performance Degradation
Updates Increased I/O
  • Inserts
  • FastLoads
Increased comparisons, resulting in greater I/O activity and CPU usage

Note that performance for the following operations is also degraded significantly when the number of duplicate NUSI row values exceeds 100:

  • Restore operation of the Archive/Recovery utility
  • Table rebuilds using the Table Rebuild utility

These performance problems persist even for the unlikely case where every NUPI value in a table has the same number of duplicates, thus eliminating skew as a factor in the decrement.

Number of Reads for Inserting 10, 100, and 1,000 NUPI Duplicate Rows

Note that a single order of magnitude increase in the number of duplicate NUPI row inserts into a table, from 10 to 100, results in a two orders of magnitude increase in the cumulative number of block reads that must be performed.

When duplicate NUPI inserts increase by another order of magnitude, from 100 to 1,000, there is an additional increase of four orders of magnitude over the cumulative number of reads required to insert 10 duplicate NUSI rows.

The count of cumulative read operations in the following table is a measure of data block reads, not individual row reads. For example, if there are 100 rows in a data block, then only one block read is required to retrieve all 100 rows.

Table 1
To insert this many duplicate NUSI rows … The system must perform this many cumulative data block read operations …
                  N log 10N                   N log 10N
                   10 1                     45                1.6532
                 100 2                4,950                3.6946
              1,000 3            499,500                5.6985

Cumulative Number of Data Block Read I/Os as a Function of Duplicate NUPI Inserts

A log-log plot of the data from the preceding table indicates extremely exponential growth in the cumulative number of data block read I/Os required to process duplicate NUPIs as a function of the number of duplicate NUPI rows on the AMP. Notice that these are data block reads, not individual row reads. For example, if there are 100 rows in a given data block, only one data block read is required to retrieve all 100 rows.