Description
Statistical tests of this type attempt to determine the likelihood that two
distribution functions represent the same distribution. Two empirical
distribution functions are mapped against each other, or a single empirical
function is mapped against a hypothetical (e.g., Normal) distribution.
Conclusions are then drawn about the likelihood the two distributions are
the same.
Performs following tests:
Kolmogorov-Smirnov Test (One Sample)
Lilliefors Test
Shapiro-Wilk Test
D'Agostino and Pearson Test
Smirnov Test
Detailed information about each test can be found in
'Statistical Tests offered' section.
Usage
td_ks_test_valib(data, dependent.column, ...)
Arguments
data |
Required Argument. |
dependent.column |
Required Argument. |
... |
Specifies other arguments supported by the function as described in the 'Other Arguments' section. |
Value
Function returns an object of class "td_ks_test_valib"
which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator
using name: result.
Other Arguments
columns
Optional Argument.
Specifies a categorical variable with two values that
indicate the distribution to which the "dependent.column"
belongs.
Note:
Used only by the Smirnov test.
Types: Types: character OR vector of Strings (character)
fallback
Optional Argument.
Specifies whether the FALLBACK is requested as in the
output result or not.
Default Value: FALSE (Not requested)
Types: logical
group.columns
Optional Argument.
Specifies the name(s) of the column(s) for grouping
so that a separate result is produced for each value
or combination of values in the specified column or
columns.
Types: character OR vector of Strings (character)
allow.duplicates
Optional Argument.
Specifies whether duplicates are allowed in the
output or not.
Default Value: FALSE
Types: logical
stats.database
Optional Argument.
Specifies the database where the statistical test
metadata tables are installed. If not specified,
the source database is searched for these metadata
tables.
Types: character
style
Optional Argument.
Specifies the test style.
Permitted Values:
'ks' - Kolmogorov-Smirnov test.
'l' - Lilliefors test.
'sw' - Shapiro-Wilk test.
'p' - D'Agostino and Pearson test.
's' - Smirnov test.
Default Value: 'ks'
Types: character
probability.threshold
Optional Argument.
Specifies the threshold probability, i.e.,
alpha probability, below which the null
hypothesis is rejected.
Default Value: 0.05
Types: numeric
Statistical Tests offered
Kolmogorov-Smirnov Test (One Sample)
The Kolmogorov-Smirnov Test (One Sample) test determines if a dataset matches
a particular distribution (for this test, the normal distribution). The test
has the advantage of making no assumption about the distribution of data
(non-parametric and distribution-free). Note that this generality comes at
some cost: other tests (e.g., the Student's t-test) may be more sensitive if
the data meet the requirements of the test. The Kolmogorov-Smirnov test is
generally less powerful than the tests specifically designed to test for
normality. This is especially true when the mean and variance are not
specified in advance for the Kolmogorov-Smirnov test, which then becomes
conservative. Further, the Kolmogorov-Smirnov test will not indicate the type
of nonnormality, e.g., whether the distribution is skewed or heavy-tailed.
Examination of the skewness and kurtosis, and of the histogram, boxplot, and
normal probability plot for the data may show why the data failed the
Kolmogorov-Smirnov test.
You can specify group by variables (GBVs) so a separate test will be done for
every unique set of values of the GBVs.
Lilliefors Test
The Lilliefors test determines whether a dataset matches a particular
distribution. This test is a modification of the Kolmogorov-Smirnov test in
that a conversion to Z-scores is made. The Lilliefors test computes the
Lilliefors statistic and checks its significance. Exact tables of the
quantiles of the test statistic are computed from random numbers in computer
simulations, and the computed value of the test statistic is compared with
the quantiles of the statistic.
When the test is for the normal distribution, the null hypothesis is that the
distribution function is normal with unspecified mean and variance. The
alternative hypothesis is that the distribution function is nonnormal. The
empirical distribution of X is compared with a normal distribution with the
same mean and variance as X. It is similar to the Kolmogorov-Smirnov test,
but it adjusts for the fact that the parameters of the normal distribution
are estimated from X rather than specified in advance.
You can specify GBVs so a separate test will be done for every unique set of
values of the GBVs.
Shapiro-Wilk Test
The Shapiro-Wilk test detects departures from normality without requiring
that the mean or variance of the hypothesized normal distribution be
specified in advance. It is considered to be one of the best omnibus tests of
normality. The function is based on the approximations and code given by
Royston (1982a, b). and can be used in samples as large as 2,000 or as small
as 3. Royston (1982b) gives approximations and tabled values that can be used
to compute the coefficients, and obtains the significance level of the
W statistic. Small values of W are evidence of departure from normality. This
test has done very well in comparison studies with other goodness of fit
tests.
Either the Shapiro-Wilk or D'Agostino-Pearson test is a powerful overall test
for normality. As omnibus tests, however, they will not indicate the type of
nonnormality, e.g., whether the distribution is skewed as opposed to
heavy-tailed (or both). Examination of the calculated skewness and kurtosis,
and of the histogram, boxplot, and normal probability plot for the data may
provide clues as to why the data failed the Shapiro-Wilk or
D'Agostino-Pearson test.
The standard algorithm for the Shapiro-Wilk test only applies to sample sizes
from 3 to 2000. The test statistic is based on the Kolmogorov-Smirnov
statistic for a normal distribution with the same mean and variance as the
sample mean and variance.
D'Agostino and Pearson Test
Either the Shapiro-Wilk or D'Agostino-Pearson test is a powerful overall test
for normality. These tests are designed to detect departures from normality
without requiring that the mean or variance of the hypothesized normal
distribution be specified in advance. Though these tests cannot indicate the
type of nonnormality, they tend to be more powerful than the
Kolmogorov-Smirnov test.
The D'Agostino-Pearson K squared statistic has approximately a chi-squared
distribution with 2 df when the population is normally distributed.
Smirnov Test
The Smirnov test ("two-sample Kolmogorov-Smirnov test") checks whether two
datasets have a significantly different distribution. The tests have the
advantage of making no assumption about the distribution of data
(non-parametric and distribution free).
Note:
This generality comes at some cost: other tests (e.g., the Student's t-test) may be more sensitive if the data meet the test requirements.
Examples
# Notes:
# 1. To execute Vantage Analytic Library functions, set option
# 'val.install.location' to the database name where Vantage analytic
# library functions are installed.
# 2. Datasets used in these examples can be loaded using Vantage Analytic
# Library installer.
# 3. The Statistical Test metadata tables must be loaded into the database
# where Analytics Library is installed.
# Set the option 'val.install.location'.
options(val.install.location = "SYSLIB")
# Get remote data source connection.
con <- td_get_context()$connection
# Create an object of class "tbl_teradata".
custanly <- tbl(con, "customer_analysis")
print(custanly)
# Example 1: A Kolmogorov-Smirnov test by providing "group.columns".
obj <- td_ks_test_valib(data=custanly,
dependent.column="income",
group.columns="years_with_bank",
style="ks")
# Print the results.
print(obj$result)
# Example 2: A Lilliefors test by providing "group.columns".
obj <- td_ks_test_valib(data=custanly,
dependent.column="income",
group.columns="years_with_bank",
style="l")
# Print the results.
print(obj$result)
# Example 3: A Shapiro-Wilk test by providing "group.columns".
obj <- td_ks_test_valib(data=custanly,
dependent.column="income",
group.columns="years_with_bank",
style="sw")
# Print the results.
print(obj$result)
# Example 4: A D'Agostino and Pearson test by providing "group.columns".
obj <- td_ks_test_valib(data=custanly,
dependent.column="income",
group.columns="years_with_bank",
style="p")
# Print the results.
print(obj$result)
# Example 5: A Smirnov test by providing "group.columns".
obj <- td_ks_test_valib(data=custanly,
columns="gender",
dependent.column="income",
group.columns="years_with_bank",
style="s")
# Print the results.
print(obj$result)