Histogram - Teradata Warehouse Miner

Teradata Profiler Plug-in User Guide

Product
Teradata Warehouse Miner
Release Number
5.4.4
Published
July 2017
Language
English (United States)
Last Update
2018-05-03
dita:mapPath
tjx1484331808948.ditamap
dita:ditavalPath
ft:empty
dita:id
B035-2304
lifecycle
previous
Product Category
Software

Histogram analysis studies the distribution of continuous numeric or date values in a column by providing the data necessary to create a histogram graph. This type of analysis is also referred to as binning because it counts the occurrence of values in a series of numeric ranges called bins. The Histogram analysis provided in Teradata Profiler is rich in functionality, providing a number of ways to define bins and allowing multi-dimensional binning, overlaying of categorical data, and the calculation of numeric statistics within bins.

Given a table name, the name of one or more numeric columns, and either the number of equal-sized data bins, the number of bins with a nearly equal number of values, a desired width, or the specific boundaries, the Histogram analysis separates the data to show its distributional properties. It does this by separating the data by bin number and gives counts and percentages over the requested rows. Percentages always add up 100%.

Separate options are available to specify a number of equal-sized data bins in which the analysis determines the minimum and maximum value, as well as a user-specified minimum and maximum value. If the minimum and maximum are specified, all values less than the minimum are put in to bin 0, while all values greater than the maximum are put in to bin N+1. The same is true when the boundary option is specified.

The Histogram analysis provides subtotals within each bin of the count, percentage within the bin, and percentage overall for each value or combination of values of one or more overlaid columns. Another option collects simple statistics for a binned column or another column of numeric or date type within the table, providing the minimum, maximum, mean, and standard deviation. When statistics are collected for a date type column, the standard deviation is given in units of days.

Part of the information returned for each bin is the beginning and ending range value delimiting the bin. In general, beginning range values are inclusive and ending range values are exclusive except for the last bin. One exception is that if equally-populated bins are requested, the ending range values are inclusive since they are actually the maximum values in the ranges. Another exception is that the beginning and ending values of bins of a date type column may be truncated to a whole date value when equal width or specified width bins are requested. When this happens to an ending range value, the value is inclusive; when this happens to a beginning range value, the value is exclusive. Therefore, since it is not obvious when truncation has occurred, beginning and ending date range values for equal width or specified width bins should be considered approximate.

If multiple columns are requested, the select is repeated for each column. If the create table option is requested along with multiple columns, the create table occurs only once with the insert/select repeated.

An optional WHERE clause can reduce the range of bins or the rows to bin in some other way.