Column Distribution Demographics and Primary Index Selection - Advanced SQL Engine

Column Distribution Demographics and Primary Index Selection - Advanced SQL Engine - Teradata Database

Database Design

Product

Advanced SQL Engine

Teradata Database

Release Number

17.10

Published

July 2021

Language

English (United States)

Last Update

2021-07-27

dita:mapPath

kko1591750222108.ditamap

dita:ditavalPath

kko1591750222108.ditaval

dita:id

B035-1094

lifecycle

Product Category

Teradata Vantage™

Effects of Skew on Query Processing

A skewed rows per value measure does not necessarily indicate a problem. It is often possible to have an even distribution of rows across AMPs and evenly distributed workloads when executing a query against skewed data. As long as the Optimizer has good statistics to work with, it is quite good at generating good query plans even when the distribution of table rows is skewed.

It is always better if the data is not skewed, but Vantage is equipped to deal with skew, and can often process skewed data successfully. Sometimes data is just naturally skewed, and the Optimizer has no choice but to deal with it.

SQL Scripts For Detecting Skew

Notice that several different measures of skew can be made.

Rows per value
Rows per hash bucket
Rows per AMP
Rows per row partition
Rows per row partition per AMP
Rows per row partition per hash bucket
Rows per row partition per value

Each type of skew can have a different effect on the query plan the Optimizer chooses, as does the concentration of the relevant rows within the data blocks.

If you are analyzing the demographics of an existing table, whether production or prototype, you can use the following set of useful scripts written by the Teradata technical support team to check for data skew. You can adjust the details of these statement to suit the needs of your site.

This is a good practice to undertake when new applications are being loaded on the system. It is also good practice to run these queries regularly if there are many data changes.

The following query identifies tables that are not evenly distributed. Ideally, variance should be less than 5%. In this query, variance is set to 1000%, which generally indicates that some or many AMPs have no rows from the table in question. You can use the BTEQ command RETLIMIT to limit the number of rows returned.

SELECT (MAX(CurrentPerm) - MIN(CurrentPerm)) * 100
        /(NULLIF(MIN(currentperm),0))(NAMED variance)
       (FORMAT ‘zzzzz9.99%’),MAX(CurrentPerm)(TITLE ‘Max’)
       (FORMAT ‘zzz,zzz,zzz,999’),MIN(currentperm)
       (TITLE ‘Min’)(FORMAT ‘zzz,zzz,zzz,999’),
        TRIM(DatabaseName)||’.’||TableName (NAMED Tables)
FROM DBC.TablesizeV
GROUP BY DatabaseName, TableName
HAVING SUM(CurrentPerm) > 1000000
AND    variance > 1000
WHERE DatabaseName NOT IN(‘CrashDumps’,’DBC’)
ORDER BY Tables;

Use the following query to display the detailed distribution of a table that has been identified as having a skewed distribution:

SELECT vproc, CurrentPerm
FROM DBC.TablesizeV
WHERE DatabaseName = ‘<databasename>’
AND   TableName = ‘<tablename>’
ORDER BY 1;

The following query reports the row distribution of a table by AMP:

SELECT dt1.a (TITLE ’AMP’), dt1.b (TITLE ’Rows’),
      ((dt1.b/dt2.x (FLOAT)) - 1.0)*100 (FORMAT’+++9%’,
      TITLE ’Deviation’)
FROM (SELECT HASHAMP(HASHBUCKET(HASHROW(<index>))),COUNT(*)
      FROM <databasename>.<tablename>
      GROUP BY 1) AS dt1 (a,b),
     (SELECT (COUNT(*) / (HASHAMP()+1)(FLOAT))
      FROM <databasename>.<tablename>) AS dt2(x)
ORDER BY 2 DESC,1;

The following query reports the distribution by AMP of the specified index or column.

SELECT HASHAMP(HASHBUCKET(HASHROW(<index or column>))) ,COUNT(*)
FROM <databasename>.<tablename>
GROUP BY 1
ORDER BY 2 DESC;

The following query reports the number of row hash collisions for the specified index or column.

SELECT HASHROW(index or column), COUNT(*)
FROM <databasename>.<tablename>
GROUP BY 1
ORDER BY 1
HAVING COUNT(*) > 10;

The following query reports the number of AMPs and the number of rows a given query accesses.

LOCKING TABLE <tablename> FOR ACCESS
SELECT COUNT(dt.ampNum)(TITLE ‘#AMPS’),
       SUM(dt.numRows)(TITLE ‘#ROWS’)
FROM (SELECT HASHAMP(HASHBUCKET(HASHROW(<index>))), COUNT(*)
      FROM <tablename>
      WHERE <selection criteria>
      GROUP BY 1)AS dt (ampNum, numRows);