Data Explorer - Histogram Analysis

Teradata Warehouse Miner User Guide - Volume 1Introduction and Profiling

brand
Software
prodname
Teradata Warehouse Miner
vrm_release
5.4.4
category
User Guide
featnum
B035-2300-077K

If requested, a Histogram analysis is performed on every numeric and date column for which a Frequency analysis was not performed. The analysis is performed either with a user specified number of equal-width bins (by default 10) between the minimum and maximum values encountered in the data, or with the same number of quantile bins, depending on option settings. There are many similarities in Histogram and Frequency processing. Histogram processing may even be thought of as Frequency processing where the data is first ‘bin-coded’. That is, numeric values are first replaced with the histogram bin or bar that they fall into.

As with a Frequency analysis, a strategy for efficiently calculating histograms or bins must be determined. One strategy is simply to calculate each histogram individually (i.e., one column at a time). The other strategy is to combine columns into an intermediate table of counts and then select individual column histogram data from the intermediate table. This can enhance performance dramatically in cases where there are not too many combinations of bins and where there are enough rows to make the effort worth while. Too many combined bin values can however lead to greatly degraded performance.

The same two parameters used to control the calculation strategy in a Frequency analysis are used here as well.
  • The minimum number of rows to use the combining strategy with, by default 25000.
  • The maximum number of possible combined values (in this case, bins) in combined columns, by default 10,000.
It is not necessary to order the columns based on number of values because all columns have the same number of potential bins, by default 10. The number of possible combined values is calculated as the running product of the number of bins in successive columns. As many columns are combined as possible without exceeding the parameter for the maximum number of combined values. Any left over single columns are processed individually.
Data is inserted into a volatile table first to avoid lock contention on the final result table when multiple threads are used. Also, the threshold values underlined above can be set on the expert options tab.

Columns of type DATE are handled by subtracting the date ‘1900-01-01’ from the date and from the minimum and maximum values, so that the calculations are based on the number of days since 1900.