Description
Statistical tests of this type calculate statistics based on the rank of
variables rather than variable values. In general, data that are ranked and
ordinal may be analyzed by these tests. Within some restraints, either
numeric or non-numeric data may be analyzed.
Supported rank tests include the following:
Mann-Whitney/Kruskal-Wallis Test
Mann-Whitney/Kruskal-Wallis Test (Independent Tests)
Wilcoxon Signed Ranks Test
Friedman Test with Kendall's Coefficient of Concordance & Spearmans' Rho
The choice between the Mann-Whitney and Kruskal-Wallis tests is made
automatically, looking at the number of distinct values of the independent
variable. A variation of the Mann-Whitney test considers each requested
variable individually, rather than combined, performing a series of
independent tests.
Detailed information about each test can be found in
'Statistical Tests offered' section.
Usage
td_rank_test_valib(data, ...)
Arguments
data |
Required Argument. |
... |
Specifies other arguments supported by the function as described in the 'Other Arguments' section. |
Value
Function returns an object of class "td_rank_test_valib"
which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator
using name: result.
Other Arguments
block.column
Optional Argument.
Specifies the name of the column representing blocks.
Notes:
Used only by the Friedman test.
When pairing treatment and block column values, a division by zero error can occur if unequal cell counts are found.
Types: character
dependent.column
Optional Argument.
Specifies the name of the column representing
the dependent variable. If non-numeric, it will
be ranked alphanumerically.
Note:
Used only by the Mann-Whitney and Friedman tests.
Types: character
columns
Optional Argument.
Specifies the name(s) of the categorical column(s)
representing independent variables.
Note:
Used only by the Mann-Whitney test.
Types: character OR vector of Strings (character)
fallback
Optional Argument.
Specifies whether the FALLBACK is requested as in the
output result or not.
Default Value: FALSE (Not requested)
Types: logical
first.column
Optional Argument.
Specifies the name of the column that represents
the first sample variable.
Note:
Used only by the Wilcoxon test.
Types: character
group.columns
Optional Argument.
Specifies the name(s) of the column(s) for grouping
so that a separate result is produced for each
value or combination of values in the specified
column or columns.
Types: character OR vector of Strings (character)
include.zero
Optional Argument.
Specifies whether to discard cases with zero
differences or not. Ordinarily, the Wilcoxon test
discards cases with zero differences. When set to
TRUE, includes these cases with the positive
count.
Note:
Used only by the Wilcoxon test.
Default Value: FALSE
Types: logical
independent
Optional Argument.
Specifies whether variation of the Mann-Whitney test
should be performed or not. When set to TRUE,
Mann-Whitney test variation is performed considering
each requested variable individually, rather than in
combination, performing a series of independent
tests.
Note:
Used only by the Mann-Whitney test.
Default Value: FALSE
Types: logical
allow.duplicates
Optional Argument.
Specifies whether duplicates are allowed in the
output or not.
Default Value: FALSE
Types: logical
second.column
Optional Argument.
Specifies the name of the column that represents
the second sample variable.
Note:
Used only by the Wilcoxon test.
Types: character
single.tail
Optional Argument.
Specifies whether to request single-tailed test or
not. When TRUE, a single-tailed test is requested.
Otherwise, a two-tailed test is requested.
Notes:
Used only by the Mann-Whitney and Wilcoxon tests.
If the Mann-Whitney test becomes a Kruskall-Wallis test, the "single.tail" option is invalid.
Default Value: FALSE
Types: logical
stats.database
Optional Argument.
Specifies the database where
the statistical test metadata tables are
installed. If not specified, the source database
is searched for these metadata tables.
Types: character
style
Optional Argument.
Specifies the test style.
Permitted Values:
'mw' - Mann-Whitney test.
'friedman' - Friedman test.
'wilcoxon' - Wilcoxon test.
Default Value: 'mw'
Types: character
probability.threshold
Optional Argument.
Specifies the threshold probability, i.e.,
'alpha' probability, below which the null
hypothesis is rejected.
Default Value: 0.05
Types: numeric
treatment.column
Optional Argument.
Specifies the name of the column representing
the independent categorical variable.
Notes:
Used only by the Friedman test.
When pairing treatment and block column values, a division by zero error can occur if unequal cell counts are found.
Types: character
Statistical Tests offered
Mann-Whitney/Kruskal-Wallis
The selection of which test to execute is automatically based on the number
of distinct values of the independent variable. The Mann-Whitney is used for
two groups, the Kruskal-Wallis for three or more groups.
A special version of the Mann-Whitney/Kruskal-Wallis test performs a
separate, independent test for each independent variable, and displays the
result of each test with its accompanying column name. Under the primary
version of the Mann-Whitney/Kruskal-Wallis test, all independent variable
value combinations are used, often forcing the Kruskal-Wallis test, since
the number of value combinations exceeds two. When a variable which has more
than two distinct values is included in the set of independent variables,
then the Kruskal-Wallis test is performed for all variables. Since
Kruskal-Wallis is a generalization of Mann-Whitney, the Kruskal-Wallis
results are valid for all the variables, including two-valued ones. In the
discussion below, both types of Mann-Whitney/Kruskal-Wallis are referred to
as Mann-Whitney/Kruskal-Wallis tests, since the only difference is the way
the independent variable is treated.
The Mann-Whitney test, AKA Wilcoxon Two Sample Test, is the nonparametric
analog of the 2-sample t test. It is used to compare two independent groups
of sampled data, and tests whether they are from the same population or from
different populations (i.e., whether the samples have the same distribution
function). Unlike the parametric t-test, this non-parametric test makes no
assumptions about the distribution of the data (e.g., normality). It is to
be used as an alternative to the independent group t-test, when the
assumption of normality or equality of variance is not met. Like many
non-parametric tests, it uses the ranks of the data rather than the data
itself to calculate the U statistic. But since the Mann-Whitney test makes
no distribution assumption, it is less powerful than the t-test. On the
other hand, the Mann-Whitney is more powerful than the t-test when parametric
assumptions are not met. Another advantage is that it will provide the same
results under any monotonic transformation of the data so the results of the
test are more generalizable.
The Mann-Whitney is used when the independent variable is nominal or ordinal
and the dependent variable is ordinal (or treated as ordinal). The main
assumption is that the variable on which the 2 groups are to be compared is
continuously distributed. This variable may be non-numeric, and if so, is
converted to a rank based on alphanumeric precedence.
The null hypothesis is that both samples have the same distribution. The
alternative hypotheses are that the distributions differ from each other in
either direction (two-tailed test), or in a specific direction (upper-tailed
or lower-tailed tests). Output is a p-value, which when compared to the
user's threshold, determines whether the null hypothesis should be rejected.
Given one or more columns (independent variables) whose values define two
independent groups of sampled data, and a column (dependent variable) whose
distribution is of interest from the same input, the Mann-Whitney test
is performed for each set of unique values of the group-by variables (GBVs),
if any.
The Kruskal-Wallis test is the nonparametric analog of the one-way analysis
of variance or F-test used to compare three or more independent groups of
sampled data. When there are only two groups, it reduces to the Mann-Whitney
test (above). The Kruskal-Wallis test tests whether multiple samples of data
are from the same population or from different populations (i.e., whether the
samples have the same distribution function). Unlike the parametric
independent group ANOVA (one-way ANOVA), this non-parametric test makes no
assumptions about the distribution of the data (e.g., normality). Since this
test does not make a distributional assumption, it is not as powerful as
ANOVA.
Given k independent samples of numeric values, a Kruskal-Wallis test is
produced for each set of unique values of the GBVs, testing whether all the
populations are identical. This test variable may be non-numeric, and if so,
is converted to a rank based on alphanumeric precedence. The null hypothesis
is that all samples have the same distribution. The alternative hypotheses
are that the distributions differ from each other. Output for each unique set
of values of the GBVs is a statistic H, and a p-value, which when compared to
the user's threshold, determines whether the null hypothesis should be
rejected for the unique set of values of the GBVs.
Wilcoxon Signed Ranks Test
The Wilcoxon Signed Ranks Test is an alternative analogous to the t-test for correlated samples. The correlated-samples t-test makes assumptions about the data, and can be properly applied only if certain assumptions are met:
the scale of measurement has the properties of an equal-interval scale
differences between paired values are randomly selected from the source population
the source population has a normal distribution
If any of these assumptions are invalid, the t-test for correlated samples should not be used. Of cases where these assumptions are unmet, the most common are those where the scale of measurement fails to have equal-interval scale properties, e.g., a case in which the measures are from a rating scale. When data within two correlated samples fail to meet one or another of the assumptions of the t-test, an appropriate non-parametric alternative is the Wilcoxon Signed-Rank Test, a test based on ranks. Assumptions for this test are:
the distribution of difference scores is symmetric (implies equal interval scale)
difference scores are mutually independent
difference scores have the same mean
The original measures are replaced with ranks resulting in analysis only of
the ordinal relationships. The signed ranks are organized and summed, giving
a number, W. When the numbers of positive and negative signs are about equal
(i.e., there is no tendency in either direction), the value of W will be near
zero, and the null hypothesis will be supported. Positive or negative sums
indicate there is a tendency for the ranks to have significance so there is a
difference in the cases in the specified direction.
Given the input and names of paired numeric columns, a Wilcoxon test is
produced. The Wilcoxon tests whether a sample comes from a population with a
specific mean or median. The null hypothesis is that the samples come from
populations with the same mean or median. The alternative hypothesis is that
the samples come from populations with different means or medians
(two-tailed test), or that in addition the difference is in a specific
direction (upper-tailed or lower-tailed tests). Output is a p-value, which
when compared to the user's threshold, determines whether the null hypothesis
should be rejected.
Friedman Test with Kendall's Coefficient of Concordance & Spearmans' Rho
The Friedman test is an extension of the sign test for several independent
samples. It is analogous to the 2-way Analysis of Variance, but depends only
on the ranks of the observations, so it is like a 2-way ANOVA on ranks.
The Friedman test should not be used for only three treatments due to lack of
power, and is best for six or more treatments. It is a test for treatment
differences in a randomized, complete block design. Data consists of b
mutually independent k-variate random variables called blocks. The Friedman
assumptions are that the data in these blocks are mutually independent, and
that within each block, observations are ordinally rankable according to some
criterion of interest.
A Friedman test is produced using rank scores and the F table, though
alternative implementations call it the Friedman Statistic and use the
chi-square table. Note that when all of the treatments are not applied to
each block, it is an incomplete block design. The requirements of the
Friedman test are not met under these conditions, and other tests such as
the Durban test should be applied.
In addition to the Friedman statistics, Kendall's Coefficient of Concordance
(W) is produced, as well as Spearman's Rho. Kendall's coefficient of
concordance can range from 0 to 1. The higher its value, the stronger the
association. W is 1.0 if all treatments receive the same rankness in all
blocks, and 0 if there is "perfect disagreement" among blocks.
Spearman's rho is a measure of the linear relationship between two variables.
It differs from Pearson's correlation only in that the computations are done
after the numbers are converted to ranks. Spearman's Rho equals 1 if there is
perfect agreement among rankings; disagreement causes rho to be less than 1,
sometimes becoming negative.
Examples
# Notes:
# 1. To execute Vantage Analytic Library functions, set option
# 'val.install.location' to the database name where Vantage analytic
# library functions are installed.
# 2. Datasets used in these examples can be loaded using Vantage Analytic
# Library installer.
# 3. The Statistical Test metadata tables must be loaded into the database
# where Analytics Library is installed.
# Set the option 'val.install.location'.
options(val.install.location = "SYSLIB")
# Get remote data source connection.
con <- td_get_context()$connection
# Create an object of class "tbl_teradata".
custanly <- tbl(con, "customer_analysis")
print(custanly)
cust <- tbl(con, "customer")
print(cust)
# Example 1: Shows the parameters for a Mann-Whitney test with a threshold
# probability of 0.01.
obj <- td_rank_test_valib(data= cust,
dependent.column="income",
columns="gender",
group.columns="years_with_bank",
probability.threshold=0.01,
style="mw")
# Print the results.
print(obj$result)
# Example 2: Shows the parameters for a set of Mann-Whitney independent tests.
# The threshold probability assumes the default value of 0.05.
obj <- td_rank_test_valib(data= custanly,
dependent.column="income",
columns=c("gender", "ccacct", "svacct"),
style="mw")
# Print the results.
print(obj$result)
# Example 3: Shows the parameters for a Wilcoxon Test.
obj <- td_rank_test_valib(data= custanly,
first.column="avg_ck_bal",
second.column="avg_sv_bal",
group.columns="years_with_bank",
style=" wilcoxon")
# Print the results.
print(obj$result)
# Example 4: Shows the parameters for a Friedman Test using a specially
# prepared input.
# Prepare data for test style "friedman" as per example in VAL user guide.
# The "Friedman" style need same number of rows for each combination
# "treatment.column" and "block.column".
# Let's get the smallest count of value combinations in the 'gender' and
# 'marital_status' columns from custanly tbl_teradata.
cgb <- custanly %>% group_by(marital_status, gender)
min_val <- cgb %>% summarise(cnt_id = n(cust_id)) %>% pull(cnt_id) %>% min()
val <- as.numeric(min_val)
df_cr <- custanly %>% select("cust_id", "gender", "marital_status", "income",
"ckacct", "svacct")
df_fried <- td_sample(df_cr,
when_then=list("gender='F' and marital_status=1"=val,
"gender='F' and marital_status=2"=val,
"gender='F' and marital_status=3"=val,
"gender='F' and marital_status=4"=val,
"gender='M' and marital_status=1"=val,
"gender='M' and marital_status=2"=val,
"gender='M' and marital_status=3"=val,
"gender='M' and marital_status=4"=val))
# Execute the RankTest() function.
obj <- td_rank_test_valib(data=df_fried,
style="friedman",
dependent.column="income",
block.column="marital_status",
treatment.column="gender")
# Print the results.
print(obj$result)