Histogram

Teradata Warehouse Miner User Guide - Volume 1Introduction and Profiling

brand
Software
prodname
Teradata Warehouse Miner
vrm_release
5.4.4
category
User Guide
featnum
B035-2300-077K

Histogram analysis is designed to study the distribution of continuous numeric or date values in a column by providing the data necessary to create a histogram graph. This type of analysis is sometimes also referred to as binning because it counts the occurrence of values in a series of numeric ranges called bins. The histogram analysis provided in Teradata Warehouse Miner is particularly rich in functionality, providing a number of ways to define bins and allowing multi-dimensional binning, overlaying of categorical data, and the calculation of numeric statistics within bins.

Given a table name, the name of one or more numeric columns, and either the desired number of equal sized data bins, the desired number of bins with a nearly equal number of values, a desired width, or the specific boundaries, the Histogram analysis separates the data to show its distributional properties. It does this by separating the data by “bin” number and gives counts and percentages over the requested rows. Percentages will always sum to 100%. Separate options are available to specify a number of equal sized data bins in which the analysis determines the minimum and maximum value, as well as a user specified minimum and maximum value. If the minimum and maximum are specified, all values less than the minimum are put in to “bin 0,” while all values greater than the maximum are put in to “bin N+1.” The same is true when the boundary option is specified.

The Histogram analysis optionally provides subtotals within each bin of the count, percentage within the bin, and percentage overall for each value or combination of values of one or more overlaid columns. Another option is provided to collect simple statistics for a binned column or another column of numeric or date type within the table, providing the minimum, maximum, mean, and standard deviation. When statistics are collected for a date type column, the standard deviation is given in units of days.

Part of the information returned for each bin is the beginning and ending range value delimiting the bin. In general, beginning range values are inclusive and ending range values are exclusive except for the last bin. One exception to this is that if equally populated bins are requested the ending range values are inclusive since they are actually the maximum values in the ranges. Another exception is that the beginning and ending values of bins of a date type column may be truncated to a whole date value when equal width or specified width bins are requested. When this happens to an ending range value, the value is inclusive. If this happens to a beginning range value, the value is exclusive. Therefore, since it will probably not be obvious when truncation has occurred, beginning and ending date range values for equal width or specified width bins should be considered approximate.

If multiple columns are requested, the select is repeated for each column, unless the Cross-tab (Multidimensional analysis) option is selected. In this case, all columns are cross-tabulated with one another within a single select statement. If the create table option is requested along with multiple columns, the create table occurs only once with the insert/select repeated.

An optional WHERE clause may be used to reduce the range of bins or to reduce the rows to bin in some other way.