Disadvantages of Other Row Partitioning Methods
There are several problems with traditional data placement schemes, particularly in a VLDB environment.
Table rows are received, batched, and partitioned serially. This means that the newest data is always clustered together. Because a high percentage of data warehouse processing involves comparing current data with historical data to detect trends, this results in the majority of users attempting to access the same, co-located, data simultaneously.
There is no way to know whether rows from tables that are to be joined are co-located or not, and there is no way to ensure that they are. This means that table joins in a data set partitioning situation typically need to transmit vast quantities of data across their interconnect channels in order to join tables, severely reducing system throughput in the process.
As the following list of potential problems indicates, there are more problems with pure range partitioning in a massively parallel data warehouse environment than there are solutions provided.
Notice that all these issues require intensive intervention on the part of the DBA. Distribution and other demographic data must be collected and analyzed (and an inexpensive, yet reliable method of doing the collection and analysis must be found and implemented), algorithms must be discovered or developed and then tested, and every aspect of the data must be monitored continually.
The Teradata Database solution to range partitioning is the partitioned primary index, which hashes table rows to the AMPs using the same rowhash method that assigns nonpartitioned primary index rows, but adds the ability to further assign those rows to user‑defined range partitions. See “Row-partitioned and Nonpartitioned Primary Indexes” on page 266 for further information about partitioned primary indexes.
Random, or round-robin, partitioning is formally related to hash partitioning. Unlike hashing, which uses an algorithm with known partitioning properties to distribute table rows, random partitioning uses a random number generator to distribute rows. The resulting distribution is even, but unrepeatable. This method makes it impossible to know where a table row is stored, so it can never be accessed directly.
Random partitioning may cause data to be redistributed for join and aggregate processing, resulting in suboptimal system performance.
This method, which is a scheme to assign specific table groups (“schemas”) to specific physical processors or nodes, has proven useful for optimizing the retrieval performance of specific tables in small, single node systems.
When applied to a multiple node parallel environment, however, its deficiencies are readily apparent.