Teradata Package for R Function Reference | 17.00 - td_parametric_test_valib - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

Teradata® Package for R Function Reference

Product
Teradata Package for R
Release Number
17.00
Published
July 2021
Language
English (United States)
Last Update
2023-08-08
dita:id
B700-4007
NMT
no
Product Category
Teradata Vantage
Parametric Tests

Description

Parametric tests make assumptions about the data, such as the observations being normally distributed. This can be verified with a test of normality prior to executing a parametric test. Both T-Tests and F-Tests are provided. T-Tests can be either paired or unpaired, while the unpaired T-Tests can be with or without an indicator variable.
F-Tests can be 1-way, 2-way or 3-way. 2-way tests can have equal or unequal cell counts (count of rows having a combination of distinct column values), while the 3-way test must have equal cell counts. A 1-way test has 1 independent input column, a 2-way test has 2 independent columns and a 3-way test has 3 independent columns in addition to a dependent "column of interest".
Detailed information about each test can be found in 'Statistical Tests offered' section.

Usage

td_parametric_test_valib(data, ...)

Arguments

data

Required Argument.
Specifies the input data to run statistical tests.
Types: tbl_teradata

...

Specifies other arguments supported by the function as described in the 'Other Arguments' section.

Value

Function returns an object of class "td_parametric_test_valib" which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator using name: result.

Other Arguments

columns

Optional Argument.
Specifies the name(s) of the column(s) representing independent variables to be analyzed in a F-Test N-Way with Equal Cell Counts analysis. There can be 1, 2 or 3 columns listed in this parameter. If 2 or 3 columns, cell counts (the count of rows having a combination of distinct column values) should be the same.
Types: character OR vector of Strings (character)

dependent.column

Optional Argument.
Specifies the name of the column representing the dependent variable in an F-Test.
Types: character

equal.variance

Optional Argument.
Required when the argument "near.dep.report" is set to TRUE.
Specifies the condition index threshold parameter to generate Near Dependency Report.
Default Value: 30
Types: numeric

fallback

Optional Argument.
Specifies whether the FALLBACK is requested as in the output result or not.
Default Value: FALSE
Types: logical

first.column

Optional Argument.
Specifies the name of the column representing the first variable to analyze for a T-test. For an F-Test, specifies the name of the column representing the first independent variable in the analysis.
Types: character

first.column.values

Optional Argument.
Specifies a list of the "first.column" values to be included in the analysis.
Types: Integer, Numeric, character OR vector of Integers, Numerics, Strings (character)

group.columns

Optional Argument.
Specifies the name(s) of the column(s) for grouping so that a separate result is produced for each value or combination of values in the specified column or columns.
Note:

  • This option is not available for an F 2-way analysis.

Types: character OR vector of Strings (character)

allow.duplicates

Optional Argument.
Specifies whether duplicates are allowed in the output or not.
Default Value: FALSE
Types: logical

paired

Optional Argument.
Specifies whether the first and second column values are matched with each other. When set to TRUE, the mean difference is also analyzed.
Default Value: FALSE
Note:

  • This is an option for T-Test.

Types: logical

second.column

Optional Argument.
Specifies the name of the column representing the second variable to analyze. If the "with.indicator" argument is set to TRUE, the second column is used to define two analysis categories, one where the second column is negative or zero, and another where the second column is positive.
For an F-Test, specifies the name of the column representing the second independent variable in the analysis.
Note:

  • Date Type is not allowed to be used for the paired T-Test.

Types: character

second.column.values

Optional Argument. Required for a 2-way F-Test with Unequal Cell Counts.
Specifies a list of the "second.column" values to be included in the analysis.
Types: Integer, Numeric, character OR vector of Integers, Numerics, Strings (character)

stats.database

Optional Argument.
Specifies the database where the statistical test metadata tables are installed. If not specified, the source database is searched for these metadata tables.
Types: character

style

Optional Argument.
Specifies the test style.
Permitted Values:

  1. 't' - T-Test paired, unpaired or unpaired with indicator variable (second column).

  2. 'fnway' - F-Test N-Way with Equal Cell Counts (1, 2, or 3 columns with same number of cell counts). A cell count is the count of rows having a combination of distinct column values.

  3. 'f2way' - F-Test 2-Way with Unequal Cell Counts (2 columns with possibly different numbers of cell counts). A cell count is the count of rows having a combination of distinct column values.

Default Value: 't'
Types: character

probability.threshold

Optional Argument.
Specifies the threshold probability, i.e., 'alpha' probability, below which the null hypothesis is rejected.
Default Value: 0.05
Types: numeric

with.indicator

Optional Argument.
Specifies whether the second column is used to indicate there are two analysis categories: one for the case where the second column is negative or zero, and another when the second column is positive. When this is set to TRUE, then second column is used to indicate the analysis categories.
Notes:

  • Argument can be used with an un-paired T-Test, i.e., when "style" is set to 't' and paired is set to FALSE.

Default Value: FALSE
Types: logical

Statistical Tests offered

Two Sample T-Test for Equal Means

For the paired t test, a one-to-one correspondence must exist between values in both samples. The test is whether paired values have mean differences which are not significantly different from zero. It assumes differences are identically distributed normal random variables, and that they are independent.

The unpaired t test is similar, but there is no correspondence between values of the samples. It assumes that, within each sample, values are identically distributed normal random variables, and that the two samples are independent of each other. The two sample sizes may be equal or unequal. Variances of both samples may be assumed to be equal (homoscedastic) or unequal (heteroscedastic). In both cases, the null hypothesis is that the population means are equal. Test output is a p-value which compared to the threshold determines whether the null hypothesis should be rejected.

The unpaired t test uses the following methods of data selection:

  • T Unpaired selects the columns with the two unpaired datasets, some of which may be NULL.

  • T Unpaired with Indicator selects the column of interest and a second indicator column which determines to which group the first variable belongs.

If the indicator variable is negative or zero, it will be assigned to the first group; if it is positive, it will be assigned to the second group.

The two sample t tests for unpaired data are defined as shown below:

  • H0: mu1 = mu2

  • H1: mu1 != mu2

  • Test Statistic: T = (Y1 - Y2) / sqrt(s1/N1 + s2/N2)
    where N1 and N2 are the sample sizes, Y1 and Y2 are the sample means, and s1 and s2 are sample variances.

F-Test - N-Way

  • F-Test/Analysis of Variance - One Way, Equal or Unequal Sample Size.

  • F-Test/Analysis of Variance - Two Way, Equal Sample Size.

  • F-Test/Analysis of Variance - Three Way, Equal Sample Size.

Use the ANOVA or F-test to determine if significant differences exist among treatment means or interactions. This preliminary test indicates if further analysis of the relationship among treatment means is warranted. If the null hypothesis of no difference among treatments is accepted, the test result implies factor levels and response are unrelated, so the analysis is terminated. When the null hypothesis is rejected, the analysis is usually continued to examine the nature of the factor-level effects. Examples are:

  • Tukey's Method - Tests all possible pairwise differences of means.

  • Scheffe's Method -Tests all possible contrasts at the same time.

  • Bonferroni's Method - Tests, or puts simultaneous confidence intervals around a preselected group of contrasts.

Use the N-way F-Test to execute within groups defined by the distinct values of the group-by variables (GBVs), the same as most of the other nonparametric tests. Two or more treatments must exist in the data within the groups defined by the distinct GBV values.

Given a column of interest (dependent variable), one or more input columns (independent variables) and optionally one or more group-by columns (all from the same input), an F-Test is produced. The N-Way ANOVA tests whether a set of sample means are all equal (the null hypothesis). Output is a p-value which when compared to the user's threshold, determines whether the null hypothesis should be rejected.

F-Test/Analysis of Variance - 2-Way Unequal Sample Size

Use the 2-way Unequal Sample Size F-Test to execute on the entire dataset. No group-by parameter is provided for this test, but if such a test is desired, multiple tests must be run on pre-prepared datasets with group-by variables in each as different constants. Two or more treatments must exist in the data within the dataset.
Note:

  • This test creates a temporary work table in the Result Database and drops it at the end of processing, even if the Output option to Store the tabular output of this analysis in the database is not selected.

Given the input of tabulated values, an F-Test is produced. The N-Way ANOVA tests whether a set of sample means are all equal (the null hypothesis). Output is a p-value which when compared to the user's threshold, determines whether the null hypothesis should be rejected.

Examples


# Notes:
#   1. To execute Vantage Analytic Library functions, set option 
#      'val.install.location' to the database name where Vantage analytic 
#      library functions are installed.
#   2. Datasets used in these examples can be loaded using Vantage Analytic 
#      Library installer.
#   3. The Statistical Test metadata tables must be loaded into the database 
#      where Analytics Library is installed.

# Set the option 'val.install.location'.
options(val.install.location = "SYSLIB")

# Get remote data source connection.
con <- td_get_context()$connection

# Create required objects of class "tbl_teradata".
customer <- tbl(con, "customer")
print(df)

cust <- tbl(con, "customer_analysis")
print(cust)

# Example 1: Perform T-Test with default values.
obj <- td_parametric_test_valib(data=cust,
                                first.column="avg_cc_bal",
                                second.column="avg_sv_bal",
                                paired=TRUE,
                                equal.variance=TRUE,
                                group.columns=c("age", "gender"))

# Print the results.
print(obj$result)

# Example 2: Perform One way F-Test.
obj <- td_parametric_test_valib(data=customer,
                                style="fnway",
                                dependent.column="income",
                                columns="gender",
                                probability.threshold=0.01,
                                group.columns=c("years_with_bank", 
                                                "nbr_children"))

# Print the results.
print(obj$result)

# Example 3: Perform a 2-way F-Test with Unequal Cell Counts.
obj <- td_parametric_test_valib(data=customer,
                                style="f2way",
                                dependent.column="income",
                                first.column="years_with_bank",
                                first.column.values=c(0, 1, 2, 3, 4, 5, 6, 7),
                                second.column="marital_status",
                                second.column.values=c(1, 2, 3, 4),
                                probability.threshold=0.01)

# Print the results.
print(obj$result)