Description
Statistical tests of this type are based on a matrix of frequencies or counts. A frequency pattern that is non-random is sought in the matrix. Supported tests of this type include the following:
Chi Square Test - Besides a Chi Square value, other measures are computed in a Chi Square Test, including a Phi Coefficient, Cramer's V, Likelihood Ratio Chi Square, Continuity-Adjusted Chi Square, and Contingency Coefficient.
Median Test - A Median Test is a variation of Chi Square Test wherein samples are tested to see if their populations have the same median value.
Detailed information about each test can be found in
'Statistical Tests offered' section.
Usage
td_chi_square_test_valib(data, ...)
Arguments
data |
Required Argument. |
... |
Specifies other arguments supported by the function as described in the 'Other Arguments' section. |
Value
Function returns an object of class "td_chi_square_test_valib"
which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator
using name: result.
Other Arguments
dependent.column
Optional Argument.
Specifies the name of the numeric column
representing dependent variable.
Note:
Used only by the Median Test.
Types: character
columns
Optional Argument.
Specifies the name(s) of the categorical column(s)
representing independent variables.
Note:
Used only by the Median Test.
Types: character OR vector of Strings (character)
fallback
Optional Argument.
Specifies whether the FALLBACK is requested as in the
output result or not.
Default Value: FALSE (Not requested)
Types: logical
first.columns
Optional Argument.
Specifies the name(s) of the column(s) representing
the first of variable pairs for analysis.
Notes:
Used only by the Chi Square Test.
The number of combinations of "first.columns" and "second.columns" may not exceed 100.
If the product of the number distinct values in these column pairs exceeds 2000, the analysis of that combination is skipped.
Types: character OR vector of Strings (character)
group.columns
Optional Argument.
Specifies the name(s) of the column(s) for grouping
so that a separate result is produced for each
value or combination of values in the specified
column or columns.
Note:
Used only by the Median Test.
Types: character OR vector of Strings (character)
allow.duplicates
Optional Argument.
Specifies whether duplicates are allowed in the
output or not.
Default Value: FALSE
Types: logical
second.columns
Optional Argument.
Specifies the name(s) of the column(s)
representing the second of variable pairs for
analysis.
Notes:
Used only by the Chi Square Test.
The number of combinations of "first.columns" and "second.columns" may not exceed 100.
If the product of the number distinct values in these column pairs exceeds 2000, the analysis of that combination is skipped.
Types: character OR vector of Strings (character)
stats.database
Optional Argument.
Specifies the database where the statistical test
metadata tables are installed. If not specified,
the source database is searched for these metadata
tables.
Types: character
style
Optional Argument.
Specifies the test style.
Permitted Values:
'chisq' - Chi Square test.
'median' - Median test.
Default Value: 'chisq'
Types: character
probability.threshold
Optional Argument.
Specifies the threshold probability, i.e.,
'alpha' probability, below which the null
hypothesis is rejected.
Default Value: 0.05
Types: numeric
Statistical Tests offered
Chi Square Tests
The most common application for chi-square is in comparing observed counts of
particular cases to the expected counts. For example, a random sample of
people would contain m males and f females but usually we would not find
exactly m=0.5N and f=0.5N. We could use the chi-squared test to determine
if the difference were significant enough to rule out the 50/50 hypothesis.
The Chi Square Test determines whether the probabilities observed from data
in a RxC contingency tables are the same or different. The null hypothesis is
that probabilities observed are the same. Output is a p-value which when
compared to the user's threshold, determines whether the null hypothesis
should be rejected.
Other Calculated Measures of Association:
Phi coefficient - The Phi coefficient is a measure of the degree of association between two binary variables, and represents the correlation between two dichotomous variables. It is based on adjusting chi-square significance to factor out sample size, and is the same as the Pearson correlation for two dichotomous variables.
Cramer's V - Cramer's V is used to examine the association between two categorical variables when there is more than a 2 X 2 contingency (e.g., 2 X 3). In these more complex designs, phi is not appropriate, but Cramer's statistic is. Cramer's V represents the association or correlation between two variables. Cramer's V is the most popular of the chi-square-based measures of nominal association, designed so that the attainable upper limit is always 1.
Likelihood Ratio Chi Square - Likelihood ratio chi-square is an alternative to test the hypothesis of no association of columns and rows in nominal-level tabular data. It is based on maximum likelihood estimation, and involves the ratio between the observed and the expected frequencies, whereas the ordinary chi-square test involves the difference between the two. This is a more recent version of chi-square and is directly related to log-linear analysis and logistic regression.
Continuity-Adjusted Chi-Square - The continuity-adjusted chi-square statistic for 2 X 2 tables is similar to the Pearson chi-square, except that it is adjusted for the continuity of the chi-square distribution. The continuity-adjusted chi-square is most useful for small sample sizes. The use of the continuity adjustment is controversial; this chi-square test is more conservative, and more like Fisher's exact test, when your sample size is small. As the sample size increases, the statistic becomes more and more like the Pearson chi-square.
Contingency Coefficient - The contingency coefficient is an adjustment to phi coefficient, intended for tables larger than 2-by-2. It is always less than 1 and approaches 1.0 only for large tables. The larger the contingency coefficient, the stronger the association. Recommended only for 5-by-5 tables or larger, for smaller data it underestimates level of association.
Median Test
The Median test is a special case of the chi-square test with fixed marginal
totals. It tests whether several samples came from populations with the same
median. The null hypothesis is that all samples have the same median.
The median test is applied for data in similar cases as for the ANOVA for
independent samples, except when the following occurs:
the data are either importantly non-normally distributed
the measurement scale of the dependent variable is ordinal (not interval or ratio)
or the data sample is too small.
Note:
The Median test is a less powerful non-parametric test than alternative rank tests due to the fact the dependent variable is dichotomized at the median. Because this technique tends to discard most of the information inherent in the data, it is less often used. Frequencies are evaluated by a simple 2 X 2 contingency table, so it becomes simply a 2 X 2 chi square test of independence with 1 DF.
Given k independent samples of numeric values, a Median test is produced for
each set of unique values of the group-by variables (GBVs), if any, testing
whether all the populations have the same median. Output for each set of
unique values of the GBVs is a p-value, which when compared to the user's
threshold, determines whether the null hypothesis should be rejected for the
unique set of values of the GBVs. For more than 2 samples, this is sometimes
called the Brown-Mood test.
Examples
# Notes:
# 1. To execute Vantage Analytic Library functions, set option
# 'val.install.location' to the database name where Vantage analytic
# library functions are installed.
# 2. Datasets used in these examples can be loaded using Vantage Analytic
# Library installer.
# 3. The Statistical Test metadata tables must be loaded into the database
# where Analytics Library is installed.
# Set the option 'val.install.location'.
options(val.install.location = "SYSLIB")
# Get remote data source connection.
con <- td_get_context()$connection
# Create an object of class "tbl_teradata".
df <- tbl(con, "customer_analysis")
print(df)
# Example 1: Shows a Chi Square test execution.
obj <- td_chi_square_test_valib(data=df,
first.columns=c("female", "single"),
second.columns=c("svacct", "ccacct",
"ckacct"),
style="chisq")
# Print the results.
print(obj$result)
# Example 2: Shows a Median test execution with group-by option.
obj <- td_chi_square_test_valib(data=df,
dependent.column="income",
columns="marital_status",
group.columns="years_with_bank",
style="median",
probability.threshold=0.01)
# Print the results.
print(obj$result)