Overview of Statistical Tests - Teradata Warehouse Miner

Teradata Warehouse Miner User Guide - Volume 3Analytic Functions

Teradata Warehouse Miner
Release Number
February 2018
English (United States)
Last Update
Product Category
This chapter applies only to an instance of Teradata Warehouse Miner operating on a Teradata database.

Teradata Warehouse Miner contains both parametric and nonparametric statistical tests from the classical statistics literature, as well as more recently developed tests. In addition, “group by” variables permit the ability to statistically analyze data groups defined by selected variables having specific values. In this way, multiple tests can be conducted at once to provide a profile of customer data showing hidden clues about customer behavior.

In simplified terms, what statistical inference allows us to do is to find out whether the outcome of an experiment could have happened by accident, or if it is extremely unlikely to have happened by chance. Of course a very well designed experiment would have outcomes which are clearly different, and require no statistical test. Unfortunately, in nature noisy outcomes of experiments are common, and statistical inference is required to get the answer. It does not matter whether our data come from an experiment we designed, or from a retail database. Questions can be asked of the data, and statistical inference can provide the answer.

What is statistical inference? It is a process of drawing conclusions about parameters of a statistical distribution. In summary, there are three principal approaches to statistical inference. One type of statistical inference is Bayesian estimation, where conclusions are based upon posterior judgments about the parameter given an experimental outcome. A second type is based on the likelihood approach, in which all conclusions are inferred from the likelihood function of the parameter given an experimental outcome. A third type of inference is hypothesis testing, which includes both nonparametric and parametric inference. For nonparametric inference, estimators concerning the distribution function are independent of the specific mathematical form of the distribution function. Parametric inference, by contrast, involves estimators about the distribution function that assumes a particular mathematical form, most often the normal distribution. Parametric tests are based on the sampling distribution of a particular statistic. Given knowledge of the underlying distribution of a variable, how the statistic is distributed in multiple equal-size samples can be predicted.

The statistical tests provided in Teradata Warehouse Miner are solely those of the hypothesis testing type, both parametric and nonparametric. Hypothesis tests generally belong to one of five classes:
  1. parametric tests including the class of t-tests and F-tests assuming normality of data populations
  2. nonparametric tests of the binomial type
  3. nonparametric tests of the chi square type, based on contingency tables.
  4. nonparametric tests based on ranks
  5. nonparametric tests of the Kolmogorov-Smirnov type

Within each class of tests there exist many variants, some of which have risen to the level of being named for their authors. Often tests have multiple names due to different originators. The tests may be applied to data in different ways, such as on one sample, two samples or multiple samples. The specific hypothesis of the test may be two-tailed, upper-tailed or lower-tailed.

Hypothesis tests vary depending on the assumptions made in the context of the experiment, and care must be exercised that they are valid in the particular context of the data to be examined. For example, is it a fair assumption that the variables are normally distributed? The choice of which test to apply will depend on the answer to this question. Failure to exercise proper judgment in which test to apply may result in false alarms, where the null hypothesis is rejected incorrectly, or misses, where the null hypothesis is accepted improperly.

Identity columns (i.e., columns defined with the attribute “GENERATED … AS IDENTITY”), cannot be analyzed by many of the statistical test functions and should therefore generally be avoided.