5.4.2 - Statistics - Teradata Warehouse Miner

In-Database Analytic Functions User Guide

prodname
Teradata Warehouse Miner
vrm_release
5.4.2
created_date
October 2016
category
User Guide
featnum
B035-2306-106K

Purpose

When dealing with numeric data columns, it is useful to have several statistical measures to understand the characteristics and properties of each of those numeric columns, to assess their quality, and to look for outlying values and other possible anomalies. The Statistics analysis provides several common and not so common statistical measures for numeric data columns. Extended options include additional analyses and measures such as Values, Modes, Quantiles, and Ranks.

A Statistics analysis can be performed on columns of numeric or date data type. For columns of type DATE, statistics other than count, minimum, maximum and mean are calculated by first converting to the number of days since 1900.

Syntax

call twm. td_analyze('Statistics','database=twm_source;tablename=twm_customer;columns=income;Optional Parameters;');

Required Parameters

columns
The columns to be analyzed.
database
The database containing the table to be analyzed.
Statistics
The Statistics parameter:
  • Is required
  • Must be the first parameter
  • Is always enclosed in single quotes
tablename
The table containing the columns to be analyzed.

Optional Parameters

extendedoptions
This parameter is used to request any of the following extended options to be calculated:
  • all
  • none (default if extendedoptions is not specified)
  • modes
  • quantiles
  • values
  • rank
groupby

If one or more columns are specified with the groupby parameter, a separate analysis is performed for each value or combination of values in the specified columns. For example: groupby=gender,marital_status.

outputdatabase
This parameter specifies the name of the database that will contain the analysis results table.
outputtablename
Specifies the name of the table that will store the analysis results. If this parameter is not supplied, the results are returned as a result set.
statisticalmethod
Determines if Population or Sample statistics should be calculated. For example, statisicalmethod=population or statisticalmethod=sample can be specified. Population statistics is the default if neither is specified.
statsoptions
Requests the basic statistics to be calculated if the Statistics analysis is performed. For convenience, shortened aliases can be used instead of the lengthier names (for example, statsoptions=all or statsoptions=cnt,min,max,mean,std). Available statistics include:
  • all
  • none
  • count (cnt)
  • minimum (min)
  • maximum (max)
  • mean
  • standarddeviation (std)
  • skewness (skew)
  • kurtosis (kurt)
  • standarderror (ste)
  • coefficientofvariance (cv)
  • variance (var)
  • sum
  • uncorrectedsumofsquares (uss)
  • correctedsumofsquares (css)
where

An optional WHERE clause that can be used to filter the data to be processed. For example:

where=cust_id > 0

Examples

Examples in this section show how to use Statistics. These examples assume that the td_analyze function has been installed in a database named twm.

The following example demonstrates the invocation of the Statistics analysis with minimal parameters. The statistics calculated are count, min, max, mean and standard deviation.

call twm.td_analyze('Statistics','database=twm_source;tablename=twm_customer;columns=income;');

The following example produces an output table with group-by and where clause.

call twm.td_analyze('Statistics','database=twm_source;tablename=twm_customer;columns=income;outputdatabase=twm_results;outputtablename=_twm_statistics;groupby=gender;where=income > 0;');

The following example demonstrates the selection of all statistical measures and extended options.

call twm.td_analyze('Statistics','database=twm_source;tablename=twm_customer;columns=income;statsoptions=all;extendedoptions=all;');

The following example demonstrates the selection of individual statistical measures and extended options, returning sample statistics.

call twm.td_analyze('Statistics','database=twm_source;tablename=twm_customer;columns=income;statsoptions=all;extendedoptions=modes;statisticalmethod=sample');