Adaptive Histogram

Teradata Warehouse Miner User Guide - Volume 1Introduction and Profiling

brand
Software
prodname
Teradata Warehouse Miner
vrm_release
5.4.4
category
User Guide
featnum
B035-2300-077K

The Adaptive Histogram analysis supplements the Histogram analysis by offering options to further subdivide the distribution. This analysis determines the frequency percentage above which a value should be treated as a Spike, and a similar percentage above which a bin is “Overpopulated.” A Spike is a specific value of a variable at which a disproportionately large (user defined) number of rows occurs, while an “Overpopulated Bin” is a range of values of a variable that contains a disproportionately large (user defined) number of rows. In this case, the Adaptive Histogram analysis modifies the computed equal sized bins to include a separate bin for each spike value and to further subdivide an overpopulated bin, returning counts and boundaries for each resulting bin. This subdivision is performed by first dividing by the same number of bins and then merging this with a subdivision in the region of the mean value within the bin. Subdivision near the mean is done by subdividing by the same number of bins the region around the mean, -/+ the standard deviation (if outside of the original bin then from the bin boundary). Subdividing may optionally be done using quantiles, giving approximately equally distributed bins.

Adaptive binning is useful in making an initial investigation of the distribution of a column or columns in a table in order to decide what analysis to perform next. Without adaptive binning, spike values and/or overpopulated bins can distort the bin counts as they are not separated or subdivided without this option enabled. However, adaptive binning does not offer many of the specialized options that the normal Histogram analysis does, such as binning by width, quantile, boundary, or over multiple dimensions. Also, it does not allow use of overlay or statistics on other columns.

Beginning range values are inclusive and generally all ending range values are exclusive except the last. There are some exceptions to this:
  • The last ending range value is inclusive.
  • The ending range value of a spike is inclusive (because the beginning and ending values of a spike are the same).
  • The beginning range value of a bin that follows and adjoins a spike is exclusive (since this value is the same as the spike value).
  • The ending range value of a quantile sub-bin is inclusive.

An optional WHERE clause may be used to reduce the range of bins or to reduce the rows to bin in some other way.

The Adaptive Histogram analysis is parameterized by specifying the table and column(s) to analyze, options unique to the Adaptive Histogram analysis, as well as specifying the desired results and SQL or Expert Options.

For general information about output, see OUTPUT Tab.