Data Explorer - Frequency Analysis

Teradata Warehouse Miner User Guide - Volume 1Introduction and Profiling

brand
Software
prodname
Teradata Warehouse Miner
vrm_release
5.4.4
category
User Guide
featnum
B035-2300-077K

If a Frequency analysis is requested, and the option to “Compute unique values for each column selected” is also requested along with the Values analysis, a Frequency analysis is performed on every requested numeric and date type column that has less than or equal to a user specified number of unique values (by default 20), and on every character type column that has less than or equal to a user specified number of unique values (by default 100). Character type columns with more values can be analyzed with a restricted Frequency analysis which returns only 'prominent' values that occur in greater than or equal to a user determined x % of rows (by default 1%), provided the ratio of unique values to rows is less than 100 - x % (by default 99%). The option to perform a restricted Frequency analysis, as well as the threshold values underlined above, can be set on the expert options tab.

If both restricted and regular frequency processing are to be performed, restricted frequency processing is actually performed first in order to facilitate restart processing, should it become necessary. Once restricted frequency processing is performed, a strategy for efficiently calculating regular frequencies must be determined. One strategy is simply to calculate each frequency individually (i.e., one at a time). The other strategy is to combine columns into an intermediate table of counts and then select individual column frequencies from the intermediate table. This can enhance performance dramatically in cases where there are not too many combinations of values and where there are enough rows to make the effort worth while. Too many combined values can, however, lead to greatly degraded performance.

Two parameters control the calculation strategy for regular frequency processing.
  • The minimum number of rows to use the combining strategy with, by default 25000.
  • The maximum number of possible combined values in combined columns, by default 10,000.
In order to use this parameter, the columns to analyze are first placed in ascending order based on the number of values in the columns, as previously calculated in the Values analysis. Then, the number of possible combined values is calculated as the running product of the number of values in successive columns. As many columns are combined as possible without exceeding the parameter for the maximum number of combined values. Any left over single columns are processed individually.
Data is inserted into a volatile table first to avoid lock contention on the final result table when multiple threads are used. Also, the threshold values underlined above can be set on the expert options tab.