Teradata Package for R Function Reference | 17.00 - 17.00 - td_ks_test_valib - Teradata Package for R

Teradata® Package for R Function Reference

Product
Teradata Package for R
Release Number
17.00
Release Date
July 2021
Content Type
Programming Reference
Publication ID
B700-4007-090K
Language
English (United States)

Description

Statistical tests of this type attempt to determine the likelihood that two distribution functions represent the same distribution. Two empirical distribution functions are mapped against each other, or a single empirical function is mapped against a hypothetical (e.g., Normal) distribution. Conclusions are then drawn about the likelihood the two distributions are the same.
Performs following tests:

  1. Kolmogorov-Smirnov Test (One Sample)

  2. Lilliefors Test

  3. Shapiro-Wilk Test

  4. D'Agostino and Pearson Test

  5. Smirnov Test

Detailed information about each test can be found in 'Statistical Tests offered' section.

Usage

td_ks_test_valib(data, dependent.column, ...)

Arguments

data

Required Argument.
Specifies the input data to run statistical tests.
Types: tbl_teradata

dependent.column

Required Argument.
Specifies the name of the numeric column that is tested to have a normal distribution.
Types: character

...

Specifies other arguments supported by the function as described in the 'Other Arguments' section.

Value

Function returns an object of class "td_ks_test_valib" which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator using name: result.

Other Arguments

columns

Optional Argument.
Specifies a categorical variable with two values that indicate the distribution to which the "dependent.column" belongs.
Note:

  • Used only by the Smirnov test.

Types: Types: character OR vector of Strings (character)

fallback

Optional Argument.
Specifies whether the FALLBACK is requested as in the output result or not.
Default Value: FALSE (Not requested)
Types: logical

group.columns

Optional Argument.
Specifies the name(s) of the column(s) for grouping so that a separate result is produced for each value or combination of values in the specified column or columns.
Types: character OR vector of Strings (character)

allow.duplicates

Optional Argument.
Specifies whether duplicates are allowed in the output or not.
Default Value: FALSE
Types: logical

stats.database

Optional Argument.
Specifies the database where the statistical test metadata tables are installed. If not specified, the source database is searched for these metadata tables.
Types: character

style

Optional Argument.
Specifies the test style.
Permitted Values:

  1. 'ks' - Kolmogorov-Smirnov test.

  2. 'l' - Lilliefors test.

  3. 'sw' - Shapiro-Wilk test.

  4. 'p' - D'Agostino and Pearson test.

  5. 's' - Smirnov test.

Default Value: 'ks'
Types: character

probability.threshold

Optional Argument.
Specifies the threshold probability, i.e., alpha probability, below which the null hypothesis is rejected.
Default Value: 0.05
Types: numeric

Statistical Tests offered

Kolmogorov-Smirnov Test (One Sample)

The Kolmogorov-Smirnov Test (One Sample) test determines if a dataset matches a particular distribution (for this test, the normal distribution). The test has the advantage of making no assumption about the distribution of data (non-parametric and distribution-free). Note that this generality comes at some cost: other tests (e.g., the Student's t-test) may be more sensitive if the data meet the requirements of the test. The Kolmogorov-Smirnov test is generally less powerful than the tests specifically designed to test for normality. This is especially true when the mean and variance are not specified in advance for the Kolmogorov-Smirnov test, which then becomes conservative. Further, the Kolmogorov-Smirnov test will not indicate the type of nonnormality, e.g., whether the distribution is skewed or heavy-tailed. Examination of the skewness and kurtosis, and of the histogram, boxplot, and normal probability plot for the data may show why the data failed the Kolmogorov-Smirnov test.

You can specify group by variables (GBVs) so a separate test will be done for every unique set of values of the GBVs.

Lilliefors Test

The Lilliefors test determines whether a dataset matches a particular distribution. This test is a modification of the Kolmogorov-Smirnov test in that a conversion to Z-scores is made. The Lilliefors test computes the Lilliefors statistic and checks its significance. Exact tables of the quantiles of the test statistic are computed from random numbers in computer simulations, and the computed value of the test statistic is compared with the quantiles of the statistic.

When the test is for the normal distribution, the null hypothesis is that the distribution function is normal with unspecified mean and variance. The alternative hypothesis is that the distribution function is nonnormal. The empirical distribution of X is compared with a normal distribution with the same mean and variance as X. It is similar to the Kolmogorov-Smirnov test, but it adjusts for the fact that the parameters of the normal distribution are estimated from X rather than specified in advance.

You can specify GBVs so a separate test will be done for every unique set of values of the GBVs.

Shapiro-Wilk Test

The Shapiro-Wilk test detects departures from normality without requiring that the mean or variance of the hypothesized normal distribution be specified in advance. It is considered to be one of the best omnibus tests of normality. The function is based on the approximations and code given by Royston (1982a, b). and can be used in samples as large as 2,000 or as small as 3. Royston (1982b) gives approximations and tabled values that can be used to compute the coefficients, and obtains the significance level of the W statistic. Small values of W are evidence of departure from normality. This test has done very well in comparison studies with other goodness of fit tests.

Either the Shapiro-Wilk or D'Agostino-Pearson test is a powerful overall test for normality. As omnibus tests, however, they will not indicate the type of nonnormality, e.g., whether the distribution is skewed as opposed to heavy-tailed (or both). Examination of the calculated skewness and kurtosis, and of the histogram, boxplot, and normal probability plot for the data may provide clues as to why the data failed the Shapiro-Wilk or D'Agostino-Pearson test.

The standard algorithm for the Shapiro-Wilk test only applies to sample sizes from 3 to 2000. The test statistic is based on the Kolmogorov-Smirnov statistic for a normal distribution with the same mean and variance as the sample mean and variance.

D'Agostino and Pearson Test

Either the Shapiro-Wilk or D'Agostino-Pearson test is a powerful overall test for normality. These tests are designed to detect departures from normality without requiring that the mean or variance of the hypothesized normal distribution be specified in advance. Though these tests cannot indicate the type of nonnormality, they tend to be more powerful than the Kolmogorov-Smirnov test.

The D'Agostino-Pearson K squared statistic has approximately a chi-squared distribution with 2 df when the population is normally distributed.

Smirnov Test

The Smirnov test ("two-sample Kolmogorov-Smirnov test") checks whether two datasets have a significantly different distribution. The tests have the advantage of making no assumption about the distribution of data (non-parametric and distribution free).
Note:

  • This generality comes at some cost: other tests (e.g., the Student's t-test) may be more sensitive if the data meet the test requirements.

Examples

# Notes:
#   1. To execute Vantage Analytic Library functions, set option 
#      'val.install.location' to the database name where Vantage analytic 
#      library functions are installed.
#   2. Datasets used in these examples can be loaded using Vantage Analytic 
#      Library installer.
#   3. The Statistical Test metadata tables must be loaded into the database 
#      where Analytics Library is installed.

# Set the option 'val.install.location'.
options(val.install.location = "SYSLIB")

# Get remote data source connection.
con <- td_get_context()$connection

# Create an object of class "tbl_teradata".
custanly <- tbl(con, "customer_analysis")
print(custanly)

# Example 1: A Kolmogorov-Smirnov test by providing "group.columns".
obj <- td_ks_test_valib(data=custanly, 
                        dependent.column="income", 
                        group.columns="years_with_bank", 
                        style="ks")

# Print the results.
print(obj$result)

# Example 2: A Lilliefors test by providing "group.columns".
obj <- td_ks_test_valib(data=custanly, 
                        dependent.column="income", 
                        group.columns="years_with_bank", 
                        style="l")

# Print the results.
print(obj$result)

# Example 3: A Shapiro-Wilk test by providing "group.columns".
obj <- td_ks_test_valib(data=custanly, 
                        dependent.column="income", 
                        group.columns="years_with_bank", 
                        style="sw")

# Print the results.
print(obj$result)

# Example 4: A D'Agostino and Pearson test by providing "group.columns".
obj <- td_ks_test_valib(data=custanly, 
                        dependent.column="income", 
                        group.columns="years_with_bank", 
                        style="p")

# Print the results.
print(obj$result)

# Example 5: A Smirnov test by providing "group.columns".
obj <- td_ks_test_valib(data=custanly, 
                        columns="gender",
                        dependent.column="income", 
                        group.columns="years_with_bank", 
                        style="s")

# Print the results.
print(obj$result)