5.4.5 - Histogram - Teradata Warehouse Miner

In-Database Analytic Functions User Guide

prodname
Teradata Warehouse Miner
vrm_release
5.4.5
created_date
February 2018
category
User Guide
featnum
B035-2306-028K

Purpose

Histogram analysis studies the distribution of continuous numeric or date values in a column. This type of analysis is also referred to as binning because it counts the occurrence of values in a series of numeric ranges called bins. The histogram analysis provides a number of ways to define bins, allowing multi-dimensional binning, overlaying of categorical data, and the calculation of numeric statistics within bins.

Given the desired number of equal sized data bins, the desired number of bins with a nearly equal number of values, a desired width, or the specific boundaries, the Histogram analysis separates the data to show its distributional properties. It does this by separating the data by bin number and gives counts and percentages over the requested rows. Percentages always sum to 100%. Separate options are available to specify a number of equal sized data bins in which the analysis determines the minimum and maximum value, as well as a user-specified minimum and maximum value. If the minimum and maximum are specified, all values less than the minimum are put in bin 0, while all values greater than the maximum are put in bin N+1. The same is true when the boundary option is specified.

The Histogram analysis optionally provides subtotals within each bin of the count, percentage within the bin and percentage overall for each value or combination of values of one or more overlaid columns. Another option is provided to collect simple statistics for a binned column or another column of numeric or date type within the table, providing the minimum, maximum, mean, and standard deviation. When statistics are collected for a date type column, the standard deviation is given in units of days.

A Histogram analysis can be performed on columns of numeric or date data type.

Syntax

call twm. td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income,age;Optional Parameters;');

Required Parameters

columns
The columns to analyze. For example, columns=c1,c2,c3. Enter columns=all is entered, all columns in the input table are analyzed.
database
The database containing the table to analyze.
histogram
The histogram parameter:
  • Is required
  • Must be the first parameter
  • Is always enclosed in single quotes
tablename
The table containing the columns to analyze.

Optional Parameters

bins style
Use these options to define a bins style. Only one option can be used at a time. If you do not specify a bins style, the default is bins=10.
Option Description/Example
bins The number of equal width bins to create.

For example, bins=5.

widths The width of the bins to create.

For example, widths=100.

quantiles The number of approximately equally populated bins to create.

For example, quantiles=10.

boundaries Specific boundaries that define the bins.

For example, boundaries=0,50,100,150 provides 3 bins between 0 and 150 (0 to 50, 50 to 100, and 100 to 150).

binwithminmax The number of bins spanning a range specified by the minimum and maximum values.

For example, binwithminmax=5,0,200 creates 5 bins ranging from 0 to 200.

outputdatabase
Specifies the name of the database to contain the analysis results table.
outputtablename
Specifies the name of the table to store the analysis results. If not supplied, the results are returned as a result set.
overlaycolumns
A categorical variable with only a few values. If an overlay column is specified, frequencies within each bin are calculated for each value of that overlay column (frequencies for crosstabs of values are given if more than one overlay column is requested). A specific column can be used in either Overlay Columns or Statistics Columns, but not both.
overwrite

When overwrite is set to true (default), the output tables are dropped before creating new ones.

statisticscolumns
Represents a list of numeric columns/aliases for which simple statistics are calculated (minimum, maximum, mean and standard deviation) in each bin. This option is not available for DATE columns. A specific column can be used in either Statistics Columns or Overlay Columns but not both.
histogram style
Sets the histogram style to one of two possible values:
Style Description
basic Creates a histogram for individual columns (default).
crosstab Creates a multi-dimensional histogram by combining columns.
where
Specifies the SQL WHERE clause generated within the Histogram SQL to filter rows selected for analysis. For example: where=cust_id > 0.

Examples

These examples demonstrate how to use the Histogram analysis. To execute the provided examples, the td_analyze function must be installed in a database called twm and the Teradata Warehouse Miner tutorial data must be installed in the twm_source database.

The following case has no special options (and 10 equal width bins by default).

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income,age;');

The following demonstrates the use of the overlay columns option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income,age;overlaycolumns=gender;');

The following demonstrates the use of the statistics columns option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income,age;statisticscolumns=years_with_bank;');

The following demonstrates the use of the crosstab option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income,age;style=crosstab;');

The following demonstrates the use of the where clause option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income;where=income > 0;');

The following demonstrates the use of the bins option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income;bins=5;');

The following demonstrates the use of the widths option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income;widths=25000;');

The following demonstrates the use of the quantiles option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income;quantiles=4;');

The following demonstrates the use of the boundaries option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income;boundaries=0, 50000, 100000, 150000;');

The following demonstrates the use of the bin with min and max option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income;binwithminmax=10,0, 200000;');

The following example creates a result table with 5 equal width bins.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income;bins=5;outputdatabase=twm_results;outputtablename=_twm_histogram;');