This topic examines guidelines for analyzing column distribution demographic data for use in selecting optimum columns to use for primary indexes.
Among the column demographics retrieved by the COLLECT STATISTICS (Optimizer Form) statement some figures are very useful for monitoring how evenly your primary index columns distribute rows across the AMPs.
The relevant demographics are the following.
- Number of distinct values within a primary index column
- Number of rows per distinct value
- Number of rows partly null
- Number of rows all null
- Typical number of rows per distinct value
Number of Distinct Values
This number reports how many distinct values a particular primary index column contains.
With respect to NUPI behavior, the closer the ratio of distinct index column values to table cardinality is to 1.0, the better. For a UPI, this value is always 1.0 by definition.
|FOR this type of primary index …||The ratio of distinct index column values to table cardinality is …|
|nonunique||best for distribution when it is close to 1.0 and usually, but not necessarily, worse the further away from 1.0 it becomes.
The evenness of the distribution of rows per value is also an important factor, with better distribution correlating with more even distributions.
With respect to space utilization, the number of distinct values should always be very much greater than the number of AMPs in the configuration to ensure that each table distributes its rows to all AMPs.
While this might not be possible to achieve for very small tables, it is an absolute necessity for anything larger.
If the number of rows per distinct value for a column is significantly larger than 1, then it might not be a good candidate for a primary index.
Number of Rows Per Distinct Value
This value reports the largest number of rows per primary index column value.
|FOR this type of primary index …||The value for this measure is …|
|nonunique||better as it approaches 1 and usually, but not necessarily, worse as it gets larger.|
If the maximum number of rows per column value is much larger than 1, then the column is often not a good candidate for a primary index. Because the evenness of the distribution of rows per value is itself an important factor, with better distribution correlating with more even distributions, a large number of rows per distinct value is not necessarily an indicator that the column set is a poor choice for a primary index. The severity of the penalty paid for larger values is a function of several variables, including the cardinality of the table, the number of AMPs in the configuration, and so on.
In the past, it was commonly stated that if the typical quantity of rows per column value does not fit into a single data block, then the column set is not a good candidate for a primary index. With the significant increase in data block size now used by most sites, the evaluation of this measure is less certain to provide strong guidance in picking a primary index column set.
You can graph these figures to provide an easy-to-comprehend graph of column value distributions. In the example provided here, a NUPI on an attribute called State was analyzed. Note the exceedingly skewed distribution, suggesting that State is not a good candidate for the primary index on this table.
The following table indicates the raw and logarithmic row cardinalities per state code value.
|State Code||Number of Rows||log 10(Number of Rows)|
The following graph displays the row cardinalities on a logarithmic scale. Notice how skewed the distribution is even when displayed on a logarithmic scale:
Number of Null Rows
This value reports the number of rows having a null primary index.
|FOR this type of primary index …||The value for this measure is …|
|unique||better as it approaches 0 and usually, but not necessarily, worse as it gets larger.|
A very large number reflects a spike in the distribution, indicating the likelihood of serious problems with an uneven consumption of disk space because all rows with a null primary index hash to the same AMP (see Number of Distinct Values for a description of the problem).
If the number of rows with null primary indexes columns is significantly larger than 0, then the column set is not a good candidate for a primary index. See Designing for Missing Information for a description of the many problems nulls cause for database management and suggestions about how to minimize or eliminate nulls from your tables.
Effects of Skew on Query Processing Summarized
In the final analysis, a skewed rows per value measure does not necessarily indicate a problem. It is often possible to effect an even distribution of rows across AMPs and evenly distributed workloads when executing a query against skewed data. As long as the Optimizer has good statistics to work with, it is quite good at generating good query plans even when the distribution of table rows is skewed.
It is always better if the data is not skewed, but Teradata Database is better equipped to deal with skew than the database management systems of other vendors, and can often process skewed data successfully while other systems fail the task. Sometimes data is just naturally skewed, and the Optimizer has no choice but to deal with it.
SQL Scripts For Detecting Skew
Notice that several different measures of skew can be made.
- Rows per value
- Rows per hash bucket
- Rows per AMP
- Rows per row partition
- Rows per row partition per AMP
- Rows per row partition per hash bucket
- Rows per row partition per value
Each type of skew can have a different effect on the query plan the Optimizer chooses, as does the concentration of the relevant rows within the data blocks.
If you are analyzing the demographics of an existing table, whether production or prototype, you can use the following set of useful scripts written by the Teradata technical support team to check for data skew. You can adjust the details of these statement to suit the needs of your site.
This is a good practice to undertake when new applications are being loaded on the system. It is also good practice to run these queries regularly if there are many data changes.
The following query identifies tables that are not evenly distributed. Ideally, variance should be less than 5%. In this query, variance is set to 1000%, which generally indicates that some or many AMPs have no rows from the table in question. You can use the BTEQ command RETLIMIT to limit the number of rows returned.
SELECT (MAX(CurrentPerm) - MIN(CurrentPerm)) * 100 /(NULLIF(MIN(currentperm),0))(NAMED variance) (FORMAT ‘zzzzz9.99%’),MAX(CurrentPerm)(TITLE ‘Max’) (FORMAT ‘zzz,zzz,zzz,999’),MIN(currentperm) (TITLE ‘Min’)(FORMAT ‘zzz,zzz,zzz,999’), TRIM(DatabaseName)||’.’||TableName (NAMED Tables) FROM DBC.TablesizeV GROUP BY DatabaseName, TableName HAVING SUM(CurrentPerm) > 1000000 AND variance > 1000 WHERE DatabaseName NOT IN(‘CrashDumps’,’DBC’) ORDER BY Tables;
Use the following query to display the detailed distribution of a table that has been identified as having a skewed distribution:
SELECT vproc, CurrentPerm FROM DBC.TablesizeV WHERE DatabaseName = ‘<databasename>’ AND TableName = ‘<tablename>’ ORDER BY 1;
The following query reports the row distribution of a table by AMP:
SELECT dt1.a (TITLE ’AMP’), dt1.b (TITLE ’Rows’), ((dt1.b/dt2.x (FLOAT)) - 1.0)*100 (FORMAT’+++9%’, TITLE ’Deviation’) FROM (SELECT HASHAMP(HASHBUCKET(HASHROW(<index>))),COUNT(*) FROM <databasename>.<tablename> GROUP BY 1) AS dt1 (a,b), (SELECT (COUNT(*) / (HASHAMP()+1)(FLOAT)) FROM <databasename>.<tablename>) AS dt2(x) ORDER BY 2 DESC,1;
The following query reports the distribution by AMP of the specified index or column.
SELECT HASHAMP(HASHBUCKET(HASHROW(<index or column>))) ,COUNT(*) FROM <databasename>.<tablename> GROUP BY 1 ORDER BY 2 DESC;
The following query reports the number of row hash collisions for the specified index or column.
SELECT HASHROW(index or column), COUNT(*) FROM <databasename>.<tablename> GROUP BY 1 ORDER BY 1 HAVING COUNT(*) > 10;
The following query reports the number of AMPs and the number of rows a given query accesses.
LOCKING TABLE <tablename> FOR ACCESS SELECT COUNT(dt.ampNum)(TITLE ‘#AMPS’), SUM(dt.numRows)(TITLE ‘#ROWS’) FROM (SELECT HASHAMP(HASHBUCKET(HASHROW(<index>))), COUNT(*) FROM <tablename> WHERE <selection criteria> GROUP BY 1)AS dt (ampNum, numRows);