15.00 - Disadvantages of Other Row Partitioning Methods - Teradata Database

Teradata Database Design

Teradata Database
User Guide

Disadvantages of Other Row Partitioning Methods

There are several problems with traditional data placement schemes, particularly in a VLDB environment.

  • Data set partitioning guarantees hot spots and interconnect log jams.
  • This method requires the DBA to batch data as it arrives in the database and then manually control how it is partitioned. While it is possible to maintain a balanced data load with these scheme, it is not possible to maintain a balanced processing load.
  • Table rows are received, batched, and partitioned serially. This means that the newest data is always clustered together. Because a high percentage of data warehouse processing involves comparing current data with historical data to detect trends, this results in the majority of users attempting to access the same, co-located, data simultaneously.

  • VLDB data warehouse environments are typically configured as massively parallel or clustered processing units. This requires some sort of network interconnect channel.
  • There is no way to know whether rows from tables that are to be joined are co-located or not, and there is no way to ensure that they are. This means that table joins in a data set partitioning situation typically need to transmit vast quantities of data across their interconnect channels in order to join tables, severely reducing system throughput in the process.

  • Pure range partitioning is not a useful method for most data warehouse applications. When data access is repeatedly based on the same constraints, then range partitioning is an excellent choice. Data can be partitioned into multiple sets based on the repeatedly used constraints.
  • As the following list of potential problems indicates, there are more problems with pure range partitioning in a massively parallel data warehouse environment than there are solutions provided.

  • Business data is rarely well-balanced. Because of this, the distribution of the data must be carefully analyzed in order to determine how best to partition the data across network nodes.
  • Some way must be found to ensure that commonly joined table rows are co-located in order to minimize traffic on the interconnect channels.
  • Some method of balancing table data across multiple nodes must be determined.
  • Because the demographics of data always change as a function of time, these considerations must be revisited repeatedly to ensure continued smooth performance.
  • Notice that all these issues require intensive intervention on the part of the DBA. Distribution and other demographic data must be collected and analyzed (and an inexpensive, yet reliable method of doing the collection and analysis must be found and implemented), algorithms must be discovered or developed and then tested, and every aspect of the data must be monitored continually.

    The Teradata Database solution to range partitioning is the partitioned primary index, which hashes table rows to the AMPs using the same rowhash method that assigns nonpartitioned primary index rows, but adds the ability to further assign those rows to user‑defined range partitions. See “Row-partitioned and Nonpartitioned Primary Indexes” on page 266 for further information about partitioned primary indexes.

  • Random partitioning can have disadvantages.
  • Random, or round-robin, partitioning is formally related to hash partitioning. Unlike hashing, which uses an algorithm with known partitioning properties to distribute table rows, random partitioning uses a random number generator to distribute rows. The resulting distribution is even, but unrepeatable. This method makes it impossible to know where a table row is stored, so it can never be accessed directly.

    Random partitioning may cause data to be redistributed for join and aggregate processing, resulting in suboptimal system performance.

  • Schema partitioning almost always requires extensive redistribution of table rows for join and aggregate processing in a networked configuration.
  • This method, which is a scheme to assign specific table groups (“schemas”) to specific physical processors or nodes, has proven useful for optimizing the retrieval performance of specific tables in small, single node systems.

    When applied to a multiple node parallel environment, however, its deficiencies are readily apparent.

  • Most join cases require all rows to be redistributed across the interconnect for each query, reducing system throughput in the process.
  • Node balancing problems are always made worse when schema partitioning is used for anything other than very small tables.