Kolmogorov-Smirnov Tests | Vantage Analytics Library - Kolmogorov-Smirnov Tests - Vantage Analytics Library

Vantage Analytics Library User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
Lake
VMware
Product
Vantage Analytics Library
Release Number
2.2.0
Published
March 2023
Language
English (United States)
Last Update
2024-01-02
dita:mapPath
ibw1595473364329.ditamap
dita:ditavalPath
iup1603985291876.ditaval
dita:id
zyl1473786378775
Product Category
Teradata Vantage

Kolmogorov-Smirnov tests use maximum vertical distance between functions as a measure of function similarity. They map two empirical distribution functions against each other or a single empirical function against a hypothetical distribution (for example, a normal distribution) and determine the likelihood that the two distributions are the same.

Kolmogorov-Smirnov Test (One Sample)

The Kolmogorov-Smirnov test determines if a dataset matches has the normal distribution.

This test assumes nothing about the data distribution (that is, the test is nonparametric and distribution-free). Less general tests (for example, the Student's t-test) may be more sensitive if the data meet the test requirements.

This test is usually less powerful than tests specifically designed to test for normality, especially when the mean and variance are not specified in advance.

This test does not indicate the type of nonnormality—for example, whether the distribution is skewed, heavy-tailed, or both. Examining the skewness and kurtosis, and the histogram, boxplot, and normal probability plot for the data may show why the data failed the Kolmogorov-Smirnov test.

Each unique set of values in the groupby columns is called a group-by value set, or GBV set. The function does a separate test for each GBV set.

Lilliefors Test

The Lilliefors test determines whether a dataset matches a particular distribution. This test is a modification of the Kolmogorov-Smirnov test in that it converts data to Z-scores.

This test computes the Lilliefors statistic, checks its significance, computes exact tables of the quantiles of the test statistic from random numbers in computer simulations, and compares the computed value of the test statistic with the quantiles of the statistic.

When this test is for the normal distribution, the null hypothesis is that the distribution function is normal with unspecified mean and variance. The alternative hypothesis is that the distribution function is nonnormal. The test compares the empirical distribution of X with a normal distribution with the same mean and variance as X. It is similar to the Kolmogorov-Smirnov test, but it adjusts for the fact that the parameters of the normal distribution are estimated from X rather than specified in advance.

The function does a separate test for each GBV set.

Shapiro-Wilk Test

The Shapiro-Wilk test detects departures from the normal distribution without requiring advance specification of the mean or variance of the hypothesized normal distribution. It is considered one of the best omnibus tests of normality, and is usually more powerful than the Kolmogorov-Smirnov test.

The standard algorithm for the Shapiro-Wilk test applies only to sample sizes from 3 to 2000. The test statistic is based on the Kolmogorov-Smirnov statistic for a normal distribution with the same mean and variance as the sample mean and variance.

The Shapiro-Wilk test performed by the kstest function in the Vantage Analytics Library is based on the approximations and code given by Royston (1982a, b). It too applies only to sample sizes from 3 to 2000. Royston (1982b) gives approximations and tabled values that you can use to compute the coefficients and computes the significance level of the W statistic. Small values of W are evidence of departure from normality. This test has done very well in comparison studies with other goodness- of-fit tests.

This test does not indicate the type of nonnormality—for example, whether the distribution is skewed, heavy-tailed, or both. Examining the skewness and kurtosis, and the histogram, boxplot, and normal probability plot for the data may show why the data failed the Shapiro-Wilk test.

D'Agostino and Pearson Test

The D'Agostino and Pearson test detects departures from the normal distribution without requiring advance specification of the mean or variance of the hypothesized normal distribution. It is an omnibus tests of normality, and is usually more powerful than the Kolmogorov-Smirnov test.

The D'Agostino-Pearson K squared statistic has approximately a chi-squared distribution with two degrees of freedom when the population is normally distributed.

This test does not indicate the type of nonnormality—for example, whether the distribution is skewed, heavy-tailed, or both. Examining the skewness and kurtosis, and the histogram, boxplot, and normal probability plot for the data may show why the data failed the D'Agostino and Pearson test.

Smirnov Test

The Smirnov test (also called the two-sample Kolmogorov-Smirnov test) checks whether two datasets have a significantly different distribution.

This test assumes nothing about the data distribution (that is, the test is nonparametric and distribution-free). Less general tests (for example, the Student's t-test) may be more sensitive if the data meets the test requirements.

If the number of observations in the first distribution times the number of observations in the second distribution is greater than 10000, then an approximate measure of the p-value is made. Otherwise, an exact measure is made.