Statistical Analysis

Teradata Warehouse Miner User Guide - Volume 1Introduction and Profiling

Teradata Warehouse Miner
User Guide

When dealing with numeric data columns, it is useful to have several statistical measures to understand the characteristics and properties of each of those numeric columns, to assess their quality, and to look for outlying values and other possible anomalies. Statistical analysis provides several common and not so common statistical measures for numeric data columns. Extended options include additional analyses and measures such as Values, Modes, Quantiles, and Ranks (top 5 and bottom 5 values, and their respective counts). The Values analysis provided is also available separately, as described in Values.

Given a table name and the name(s) of numeric column(s), Statistical analysis determines descriptive statistics for each of the column(s). Univariate Statistics provided include the following:
  • Count
  • Minimum
  • Maximum
  • Mean
  • Standard Deviation
  • Skewness
  • Kurtosis
  • Standard Error
  • Coefficient of Variance
  • Variance
  • Sum
  • Uncorrected Sums of squares
  • Corrected Sums of squares

For columns of type DATE, statistics other than count, minimum, maximum and mean are calculated by first converting to the number of days since 1900.

In addition to these basic numerical statistics, extended statistics can be requested to add the following to the analysis:
  • Values — Count of rows with value 0, rows with a unique value, rows with a positive value, rows with a negative value, and the number of rows containing blanks in the given column
  • Modes — Modal value and number of modes
  • Quantiles — Bottom ten, top ten, deciles, quartiles, and tertiles. With the extended option Quantiles, the bottom 10, top 10, deciles, quartiles and tertiles are determined by dividing the data set into the respective equal size groups and providing the max value from that group.
  • Rank — Top 5 and bottom 5 ranked values, with respective counts

With the extended option Modes, only the minimum modal value is returned and an additional column called “xnbrmodes” for the number of modes or modality is generated. For example, if the modes are 10 and 20, the value 10 is returned for the mode with xnbrmodes equal to 2.

With the extended option Rank, the top 5 and bottom 5 values are determined as distinct values, with counts also provided for each value. For example, if the top values are 5,6,7,8,9,10,10,10, the top 5 values returned are 6,7,8,9,10 with counts 1,1,1,1,3.

If multiple columns are requested, a VOLATILE table is built, and all columns are processed in a single CREATE VOLATILE TABLE AS SELECT… statement. Data is reformatted with individual INSERT/SELECT statements into the final output dataset as described below. In this case, the create view option may not be requested.

By default, population statistics are generated unless the Sample option is selected.

The Statistical analysis is parameterized by specifying the table and column(s) to analyze, options unique to the Statistical analysis, as well as specifying the desired results and SQL or Expert Options.

For general information about output, see OUTPUT Tab.