Data Skew | Teradata Vantage - Determining if Data Skew Might Impact Performance

Data Skew | Teradata Vantage - Determining if Data Skew Might Impact Performance - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product

Teradata Vantage

Release Number

9.02

9.01

2.0

1.3

Published

February 2022

Language

English (United States)

Last Update

2022-02-10

dita:mapPath

rnn1580259159235.ditamap

dita:ditavalPath

ybt1582220416951.ditaval

dita:id

B700-4003

lifecycle

Product Category

Teradata Vantage™

In some cases, transfer of data between Advanced SQL Engine and ML Engine results in uneven distribution of data across ML Engine worker nodes, which can significantly decrease the performance (calculation speed) of some analytic functions.

If you observe slow performance, use this procedure to examine the distribution of data across workers.

Examine the distribution of data across workers:

SELECT * FROM RowDistribution (ON input_table);

The output is similar to this:

 task_index | ip_address  | row_count 
------------+-------------+-----------
          0 | 172.24.0.15 |        10
          1 | 172.24.0.15 |        10
          2 | 172.24.1.17 |        10
          3 | 172.24.1.17 |        10
(4 rows)

This cluster has two worker nodes, each of which has two vworkers.

Calculate a rough estimation of the data skew:
1. Calculate mean of row_count values (m).
2. Find row_count value farthest from mean (d).
3. Calculate skew, defined as (|m-d|)/d.
Example:
```
 task_index | ip_address  | row_count 
------------+-------------+-----------
          0 | 172.24.0.15 |        500
          1 | 172.24.0.15 |        400
          2 | 172.24.1.17 |        400
          3 | 172.24.1.17 |        600
(4 rows)
```
1. Calculate mean of row_count values:
  m = (500+400+400+600)/4 = 475
2. The row_count value farthest from mean value 475 is 600 (d = 600).
3. Calculate skew:
  (|m-d|)/d =(|475-600|)/600 = 125/600 = 0.21

PostrequisiteIf the skew is greater than 0.1, Teradata recommends using the UniqueID syntax element to ensure that the data are evenly distributed.