5.4.2 - Histogram - Teradata Warehouse Miner

In-Database Analytic Functions User Guide

prodname
Teradata Warehouse Miner
vrm_release
5.4.2
created_date
October 2016
category
User Guide
featnum
B035-2306-106K

Purpose

Histogram analysis is designed to study the distribution of continuous numeric or date values in a column. This type of analysis is sometimes also referred to as binning because it counts the occurrence of values in a series of numeric ranges called bins. The histogram analysis provides a number of ways to define bins, allowing multi-dimensional binning, overlaying of categorical data, and the calculation of numeric statistics within bins.

Given the desired number of equal sized data bins, the desired number of bins with a nearly equal number of values, a desired width, or the specific boundaries, the Histogram analysis separates the data to show its distributional properties. It does this by separating the data by bin number and gives counts and percentages over the requested rows. Percentages will always sum to 100%. Separate options are available to specify a number of equal sized data bins in which the analysis determines the minimum and maximum value, as well as a user specified minimum and maximum value. If the minimum and maximum are specified, all values less than the minimum are put in to bin 0, while all values greater than the maximum are put in to bin N+1. The same is true when the boundary option is specified.

The Histogram analysis optionally provides subtotals within each bin of the count, percentage within the bin and percentage overall for each value or combination of values of one or more overlaid columns. Another option is provided to collect simple statistics for a binned column or another column of numeric or date type within the table, providing the minimum, maximum, mean, and standard deviation. When statistics are collected for a date type column, the standard deviation is given in units of days.

A Histogram analysis can be performed on columns of numeric or date data type.

Syntax

call twm. td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income,age;Optional Parameters;');

Required Parameters

columns
The columns to be analyzed. For example, columns=c1,c2,c3. If columns=all is entered, all columns in the input table are analyzed.
database
The database containing the table to be analyzed.
histogram
The histogram parameter:
  • Is required
  • Must be the first parameter
  • Is always enclosed in single quotes
tablename
The table containing the columns to be analyzed.

Optional Parameters

bins style
There are a number of options available to define a bin style. Only one of these options can be used at a time. If you do not specify a bins style the default will be bins=10.
Option Description/Example
bins The number of equal width bins to create.

bins=5

widths The width of the bins to create.

widths=100

quantiles The number of approximately equally populated bins to create.

qualtiles=10

boundaries Specific boundaries that define the bins.

For example, boundaries=0,50,100,150 provides 3 bins between 0 and 150 (0 to 50, 50 to 100, and 100 to 150)

binwithminmax The number of bins spanning a range specified by the minimum and maximum values.

For example, binwithminmax=5,0,200 creates 5 bins ranging from 0 to 200.

outputdatabase
This parameter specifies the name of the database that will contain the analysis results table.
outputtablename
Specifies the name of the table that will store the analysis results. If this parameter is not supplied, the results are returned as a result set.
overlaycolumns
An overlay column is typically a categorical variable with only a few values. If an overlay column is specified, frequencies within each bin are calculated for each value of that overlay column (frequencies for crosstabs of values are given if more than one overlay column is requested). A specific column can be used in either Overlay Columns or Statistics Columns, but not both.
statisticscolumns
This parameter represents a list of numeric columns/aliases for which simple statistics will be calculated (minimum, maximum, mean and standard deviation) in each bin. This option is not available for DATE columns. A specific column can be used in either Statistics Columns or Overlay Columns but not both.
histogram style
Option to set the histogram style to one of two possible values:
Style Description
basic Option to create a histogram for individual columns (default.
crosstab Option to create a multi-dimensional histogram by combining columns.
where
Option to have the specified SQL WHERE clause generated within the Histogram SQL to filter rows selected for analysis. For example: where=cust_id > 0.

Examples

The examples in this section demonstrate how to use the Histogram analysis. These examples assume that the td_analyze function has been installed in a database named twm.

The following is a simple case with no special options (and 10 equal width bins by default).

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income,age;');

The following demonstrates the use of the overlay columns option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income,age;overlaycolumns=gender;');

The following demonstrates the use of the statistics columns option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income,age;statisticscolumns=years_with_bank;');

The following demonstrates the use of the crosstab option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income,age;style=crosstab;');

The following demonstrates the use of the where clause option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income;where=income > 0;');

The following demonstrates the use of the bins option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income;bins=5;');

The following demonstrates the use of the widths option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income;widths=25000;');

The following demonstrates the use of the quantiles option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income;quantiles=4;');

The following demonstrates the use of the boundaries option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income;boundaries=0, 50000, 100000, 150000;');

The following demonstrates the use of the bin with min and max option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income;binwithminmax=10,0, 200000;');

The following example creates a result table with 5 equal width bins.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income;bins=5;outputdatabase=twm_results;outputtablename=_twm_histogram;');