5.4.6 - Statistics - Teradata Warehouse Miner

In-Database Analytic Functions User Guide

prodname
Teradata Warehouse Miner
vrm_release
5.4.6
created_date
November 2018
category
User Guide
featnum
B035-2306-118K

Purpose

When working with numeric data columns, use statistical measures to understand the characteristics and properties of each numeric column, and to look for outlying values and anomalies. The Statistics analysis provides several common and not so common statistical measures for numeric data columns. Extended options include additional analyses and measures such as Values, Modes, Quantiles, and Ranks.

You can perform Statistics analysis on columns of numeric or date data type. For columns of type DATE, statistics other than count, minimum, maximum, and mean are calculated by first converting to the number of days since 1900.

Syntax

call twm. td_analyze('Statistics','database=twm_source;tablename=twm_customer;columns=income;Optional Parameters;');

Required Parameters

columns
The columns to analyze. Enter columns=all to analyze all columns in the input table.

For example: columns=column1,column2,column3 or columns=all

database
The database containing the table to analyze.
Statistics
The Statistics parameter:
  • Is required
  • Must be the first parameter
  • Is always enclosed in single quotes
tablename
The table containing the columns to analyze.

Optional Parameters

extendedoptions
Use to request any of the following extended options to be calculated:
  • all
  • none (default if extendedoptions is not specified)
  • modes
  • quantiles
  • values
  • rank
gensqlonly
When true, the SQL for the requested function is returned as a result set but not run. When not specified or set to false, the SQL is run but not returned.
groupby

If columns are specified with the groupby parameter, a separate analysis is performed for each value or combination of values in the specified columns. For example: groupby=gender,marital_status.

outputdatabase
Specifies the name of the database to contain the analysis results table.
outputtablename
Specifies the name of the table to store the analysis results. If not supplied, the results are returned as a result set.
overwrite

When overwrite is set to true (default), the output tables are dropped before creating new ones.

statisticalmethod
The calculated statistic as specified by either population or sample. For example, you can specify either statisicalmethod=population or statisticalmethod=sample. Population statistics is the default if neither is specified.
statsoptions
The basic statistics to be calculated if the Statistics analysis is performed. You can use shortened aliases instead of the lengthier names (for example, statsoptions=all or statsoptions=cnt,min,max,mean,std). Available statistics include:
  • all
  • none
  • count (cnt)
  • minimum (min)
  • maximum (max)
  • mean
  • standarddeviation (std)
  • skewness (skew)
  • kurtosis (kurt)
  • standarderror (ste)
  • coefficientofvariance (cv)
  • variance (var)
  • sum
  • uncorrectedsumofsquares (uss)
  • correctedsumofsquares (css)
where

The optional WHERE clause to filter the data to process. For example:

where=cust_id > 0

Examples

To run the provided examples, the td_analyze function must be installed in a database called twm and the TWM tutorial data must be installed in the twm_source database.

These examples demonstrate the invocation of the Statistics analysis with minimal parameters. The statistics calculated are count, min, max, mean, and standard deviation.

call twm.td_analyze('Statistics','database=twm_source;tablename=twm_customer;columns=income;');

The following example produces an output table with group-by and where clause.

call twm.td_analyze('Statistics','database=twm_source;tablename=twm_customer;columns=income;outputdatabase=twm_results;outputtablename=_twm_statistics;groupby=gender;where=income > 0;');

The following example demonstrates the selection of all statistical measures and extended options.

call twm.td_analyze('Statistics','database=twm_source;tablename=twm_customer;columns=income;statsoptions=all;extendedoptions=all;');

The following example demonstrates the selection of individual statistical measures and extended options, returning sample statistics.

call twm.td_analyze('Statistics','database=twm_source;tablename=twm_customer;columns=income;statsoptions=all;extendedoptions=modes;statisticalmethod=sample');