Data Explorer

In-Database Analytic Functions User Guide

brand
Software
prodname
Teradata Warehouse Miner
vrm_release
5.4.2
category
User Guide
featnum
B035-2306-106K

Purpose

The Data Explorer performs basic statistical analysis on a set of selected tables, or on selected columns from selected tables, in one or more databases. It stores results from four fundamental types of analysis based on simplified versions of the Descriptive Statistics analyses:

  • Values
  • Statistics
  • Frequency
  • Histogram

An answer table is produced for each requested type of analysis, the output including requested database, table and column names in order to allow results from multiple tables to be included in each answer table. Each analysis can be selected individually, with the following exceptions:

  • If Frequency is selected, Values must be selected
  • If Histogram is selected, Values and Statistics must be selected, including the Count, Minimum, Maximum, Mean and Standard Deviation.

The Data Explorer includes intelligence about which functions should be performed on which columns, with decisions based partly on column type and partly on obtained results. It also includes performance enhancements in the generated SQL to help minimize passes through the data. A separate SQL Where Clause can optionally be specified for each of the input tables selected for analysis.

A Values analysis can be performed on columns of any data type, though the measures displayed vary according to the column type.

A Statistics analysis can be performed on columns of numeric or date data type. For columns of type DATE, statistics other than count, minimum, maximum and mean are calculated by first converting to the number of days since 1900.

Generally, a Frequency analysis can be performed on columns of any data type except types that hold byte data (for example, BYTE and VARBYTE type data).

A Histogram analysis can be performed on columns of numeric or date data type.

Syntax

call twm. td_analyze('DataExplorer','database=twm_source;tablename=twm_customer;outputdatabase=twm_results;Optional Parameters;');

Required Parameters

database
The database or databases containing the table or tables to be analyzed.
The number of databases in the database parameter and the number of tables in the tablename parameter must be the same. They are matched by position, and the database names may be repeated. For example, database=d1,d1,d2 and tablename=t1,t2,t3 indicate that d1.t1, d1.t2 and d2.t3 are the tables to be analyzed, with tables t1 and t2 both residing in database d1.
DataExplorer
The DataExplorer parameter:
  • Is required
  • Must be the first parameter
  • Is always enclosed in single quotes
outputdatabase
The results of the analysis will be stored in one or more tables in this database.
There are four output table name parameters as shown below. These affect not only the names of the output tables but whether or not the indicated types of analysis are performed. If none of these parameters is supplied, all four fundamental types of analysis, Values, Statistics, Frequency and Histogram, are performed and the output table names are:
  • TwmExploreValues
  • TwmExploreStatistics
  • TwmExploreFrequency
  • TwmExploreHistogram
If one or more of the output table name parameters below are specified, then only the indicated types of analysis are performed and the output tables are named as indicated.
  • valuesoutputtablename=
  • statisticsoutputtablename=
  • frequencyoutputtablename=
  • histogramoutputtablename=
Not all combinations of the above output table name parameters are valid however. The valid combinations are:
  • values
  • statistics
  • values + statistics
  • values + frequency
  • values + statistics + histogram
  • values + statistics + frequency + histogram
For example, a combination of Values and Statistics analyses, creating output tables v1 and s1, can be requested as shown below. The Examples section provides additional samples.

...valuesoutputtablename=v1;statisticsoutputtablename=s1;

tablename
The table or tables containing the columns to be analyzed. For example, tablename=t1,t2,t3 indicates that three tables (t1, t2 and t3) are to be analyzed. Note that the number of databases in the database parameter and the number of tables in the tablename parameter must be the same, as described for the database parameter.

Optional Parameters

bins
The number of bins to create if the Histogram analysis is performed. The default is 10.
binstyle
The style of binning to perform if the Histogram analysis is performed. The possibilities are binstyle=bins (default) and binstyle=quantiles.
columns
The columns to be analyzed. For example, columns=c1,c2,c3. If columns=all is entered, all columns in the input table are analyzed.
maxnumcombvalues
The maximum number of combined values for frequency or histogram analysis. The default is 10000.
See the Teradata Warehouse Miner User Guide, Volume 1, Introduction and Profiling for additional information.
maxuniquecharvalues
The maximum number of unique character values for unrestricted frequency analysis. The default value is 100.
See the Teradata Warehouse Miner User Guide, Volume 1, Introduction and Profiling for additional information.
maxuniquenumvalues
The maximum number of unique numeric or date values for frequency analysis. The default is 20.
See the Teradata Warehouse Miner User Guide, Volume 1, Introduction and Profiling for additional information.
minrowsforcomb
Minimum number of rows before frequency or histogram combining attempted. The default is 25000.
See the Teradata Warehouse Miner User Guide, Volume 1, Introduction and Profiling for additional information.
restrictedfreqproc
Performs restricted frequency processing including prominent values. The default is true.
See the Teradata Warehouse Miner User Guide, Volume 1, Introduction and Profiling for additional information.
restrictedthreshold
The minimum fraction of rows frequency value must occur in for restricted frequency processing to be performed. The default is 1.
See the Teradata Warehouse Miner User Guide, Volume 1, Introduction and Profiling for additional information.
statisticalmethod
Determines if Population or Sample statistics should be calculated. For example, statisicalmethod=population or statisticalmethod=sample can be specified. Population statistics is the default if neither is specified.
statsoptions
Requests the basic statistics to be calculated if the Statistics analysis is performed. For convenience, shortened aliases can be used instead of the lengthier names (for example, statsoptions=all or statsoptions=cnt,min,max,mean,std). Available statistics include:
  • all
  • none
  • count (cnt)
  • minimum (min)
  • maximum (max)
  • mean
  • standarddeviation (std)
  • skewness (skew)
  • kurtosis (kurt)
  • standarderror (ste)
  • coefficientofvariance (cv)
  • variance (var)
  • sum
  • uncorrectedsumofsquares (uss)
  • correctedsumofsquares (css)
uniques
When set to true, unique values are counted for each selected column. By default, unique values are set to true and not counted.
where

An optional WHERE clause that can be used to filter the data to be processed. For example:

where=cust_id > 0

Examples

The examples in this section demonstrate the performance of the valid combinations of analyses with a single input table. These examples assume that the td_analyze function has been installed in a database named twm.

This example performs all analyses (Values, Statistics, Frequency and Histogram) and creates the output tables for each of these analyses with default names.

call twm.td_analyze('DataExplorer','database=twm_source;tablename=twm_customer;outputdatabase=twm_results;');

This example performs only the Values analysis, naming the Values output table v1.

call twm.td_analyze('DataExplorer','database=twm_source;tablename=twm_customer;outputdatabase=twm_results;valuesoutputtablename=v1;');

This example performs only the Statistics analysis, naming the Statistics output table s1.

call twm.td_analyze('DataExplorer','database=twm_source;tablename=twm_customer;outputdatabase=twm_results;statisticsoutputtablename=s1;');

This example performs a Values and Statistics analysis, naming the output tables v1 and s1 respectively.

call twm.td_analyze('DataExplorer','database=twm_source;tablename=twm_customer;outputdatabase=twm_results;valuesoutputtablename=v1;statisticsoutputtablename=s1;');

This example performs both a Values and Frequency analysis, naming the output tables v1 and f1 respectively.

call twm.td_analyze('DataExplorer','database=twm_source;tablename=twm_customer;outputdatabase=twm_results;valuesoutputtablename=v1;frequencyoutputtablename=f1;');

This example performs a Values, Statistics and Histogram analysis, naming the output tables v1, s1 and h1 respectively.

call twm.td_analyze('DataExplorer','database=twm_source;tablename=twm_customer;outputdatabase=twm_results;valuesoutputtablename=v1;statisticsoutputtablename=s1;histogramoutputtablename=h1;');

This example performs one of each of the four analyses, Values, Statistics, Frequency and Histogram, and names the output tables v1, s1, f1 and h1 respectively.

call twm.td_analyze('DataExplorer','database=twm_source;tablename=twm_customer;outputdatabase=twm_results;valuesoutputtablename=v1;statisticsoutputtablename=s1;frequencyoutputtablename=f1;histogramoutputtablename=h1;');