Histogram - Teradata Warehouse Miner

In-Database Analytic Functions User Guide

Product
Teradata Warehouse Miner
Release Number
5.4.6
Published
November 2018
Language
English (United States)
Last Update
2018-12-07
dita:mapPath
akh1538171534882.ditamap
dita:ditavalPath
ft:empty
dita:id
B035-2306
lifecycle
previous
Product Category
Teradata® Warehouse Miner

Purpose

Histogram analysis reveals the distribution of continuous numeric or date values in a column. Histogram analysis is also referred to as binning because it counts the occurrence of values in a series of numeric ranges called bins. The histogram analysis provides a number of ways to define bins, allowing multidimensional binning, overlaying of categorical data, and the calculation of numeric statistics within bins.

If you set the desired number of equal sized data bins, the desired number of bins with a nearly equal number of values, a desired width, or the specific boundaries, the Histogram analysis separates the data to show its distributional properties. It does this by separating the data by bin number and gives counts and percentages over the requested rows. Percentages always sum to 100%. Separate options are available to specify a number of equal sized data bins in which the analysis determines the minimum and maximum value, as well as a user-specified minimum and maximum value. If the minimum and maximum are specified, all values less than the minimum are put in bin 0, while all values greater than the maximum are put in bin N+1. The same is true when the boundary option is specified.

The Histogram analysis optionally provides subtotals within each bin of the count, percentage within the bin and percentage overall for each value or combination of values of one or more overlaid columns. Another option is provided to collect simple statistics for a binned column or another column of numeric or date type within the table, providing the minimum, maximum, mean, and standard deviation. When statistics are collected for a date type column, the standard deviation is given in units of days.

You can perform Histogram analysis on columns of numeric or date data type.

Syntax

call twm. td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income,age;Optional Parameters;');

Required Parameters

columns
The columns to analyze. For example, columns=c1,c2,c3. When columns=all is entered, all columns in the input table are analyzed.
database
The database containing the table to analyze.
histogram
The histogram parameter:
  • Is required
  • Must be the first parameter
  • Is always enclosed in single quotes
tablename
The table containing the columns to analyze.

Optional Parameters

bins style
Use these options to define a bins style. You can use only one option at a time. If you do not specify a bins style, the default is bins=10.
Option Description/Example
bins The number of equal width bins to create.

For example, bins=5. If multiple columns are requested, multiple bin sizes may be specified, such as bins=5, 10. If fewer sizes are specified than columns, left-over columns are associated with the default size of 10 bins.

widths The width of the bins to create.

For example, widths=100. If multiple columns are requested, multiple widths must be specified, such as widths=5, 10. If fewer sizes are specified than columns, an error message displays.

quantiles The number of approximately equally populated bins to create.

For example, quantiles=10. If multiple columns are requested, multiple quantile sizes may be specified, such as quantiles=5, 10. If fewer sizes are specified than columns, left-over columns are associated with the default size of 10 quantiles.

boundaries Specific boundaries that define the bins.

For example, boundaries=0,50,100,150 provides 3 bins between 0 and 150 (0 to 50, 50 to 100, and 100 to 150). If multiple columns are requested, multiple sets of parameters must be specified, such as boundaries={0, 50000, 100000, 150000}, {0, 50, 100}.

binwithminmax The number of bins spanning a range specified by the minimum and maximum values.

For example, binwithminmax=5,0,200 creates 5 bins ranging from 0 to 200. If multiple columns are requested, multiple sets of parameters must be specified, such as binwithminmax={10, 0, 200000}, {5, 0, 100}.

gensqlonly
When true, the SQL for the requested function is returned as a result set but not run. When not specified or set to false, the SQL is run but not returned.
outputdatabase
Specifies the name of the database to contain the analysis results table.
outputtablename
Specifies the name of the table to store the analysis results. If not supplied, the results are returned as a result set.
overlaycolumns
A categorical variable with only a few values. If an overlay column is specified, frequencies within each bin are calculated for each value of that overlay column (frequencies for crosstabs of values are given if more than one overlay column is requested). You can use a specific column in either Overlay Columns or Statistics Columns, but not both.
overwrite

When overwrite is set to true (default), the output tables are dropped before creating new ones.

statisticscolumns
Represents a list of numeric columns/aliases for which simple statistics are calculated (minimum, maximum, mean and standard deviation) in each bin. This option is not available for DATE columns. You can use a specific column in either Statistics Columns or Overlay Columns, but not both.
style
Sets the histogram style to one of two possible values:
Style Description
basic Creates a histogram for individual columns (default).
crosstab Creates a multidimensional histogram by combining columns.
where
Specifies the SQL WHERE clause generated within the Histogram SQL to filter rows selected for analysis. For example: where=cust_id > 0.

Examples

These examples demonstrate how to use the Histogram analysis. To run the provided examples, the td_analyze function must be installed in a database called twm and the TWM tutorial data must be installed in the twm_source database.

The following case has no special options (and 10 equal width bins by default).

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income,age;');

The following demonstrates the use of the overlay columns option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income,age;overlaycolumns=gender;');

The following demonstrates the use of the statistics columns option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income,age;statisticscolumns=years_with_bank;');

The following demonstrates the use of the crosstab option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income,age;style=crosstab;');

The following demonstrates the use of the where clause option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income;where=income > 0;');

The following demonstrates the use of the bins option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income;bins=5;');

The following demonstrates the use of the widths option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income;widths=25000;');

The following demonstrates the use of the quantiles option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income;quantiles=4;');

The following demonstrates the use of the boundaries option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income;boundaries=0, 50000, 100000, 150000;');

The following demonstrates the use of the bin with min and max option.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income;binwithminmax=10,0, 200000;');

The following example creates a result table with 5 equal width bins.

call twm.td_analyze('histogram','database=twm_source;tablename=twm_customer;columns=income;bins=5;outputdatabase=twm_results;outputtablename=_twm_histogram;');