Teradata Package for R Function Reference | 17.00 - 17.00 - td_lin_reg_valib - Teradata Package for R

Teradata® Package for R Function Reference

Product
Teradata Package for R
Release Number
17.00
Release Date
July 2021
Content Type
Programming Reference
Publication ID
B700-4007-090K
Language
English (United States)

Description

Linear Regression is one of the fundamental types of predictive modeling algorithms. In linear regression, a dependent numeric variable is expressed in terms of the sum of one or more independent numeric variables, which are each multiplied by a numeric coefficient, usually with a constant term added to the sum of independent variables. It is the coefficients of the independent variables together with a constant term that comprise a linear regression model. Applying these coefficients to the variables (columns) of each observation (row) in a data set is known as scoring, as described in Linear Regression Scoring.
Some of the key features of Linear Regression are outlined below.

  1. The Teradata table operator CALCMATRIX is used to build on object of class "tbl_teradata" that represents an extended cross-products matrix that is the input to the algorithm.

  2. One or more group by columns may optionally be specified so that an input matrix is built for a separate linear model is built for each matrix.

To achieve this, the names of the group by columns are passed to CALCMATRIX as parameters, so it includes them as columns in the matrix output it creates.

  1. The stepwise feature for Linear Regression is a technique for selecting the independent variables in a linear model. It consists of different methods of trying a variable and adding or removing it from a model by checking either a partial F-Statistic or the P-Value of a T-Statistic, at the user's choice.

  2. The algorithm is partially scalable because the size of each input matrix depends only on the number of independent variables (columns) and not on the size of the input tbl_teradata. The calculations performed on the client workstation however are not scalable when group by columns are used, because each model is built serially based on each matrix in the matrix tbl_teradata.

Usage

td_lin_reg_valib(data, columns, response.column, ...)

Arguments

data

Required Argument.
Specifies the input data to build a linear regression model from. Types: tbl_teradata

columns

Required Argument.
Specifies the name(s) of the column(s) representing the independent variables used in building a linear regression model. Occasionally, it can also accept permitted strings to specify all columns, or all numeric columns.
Permitted Values:

  1. Name(s) of the column(s) in "data".

  2. Pre-defined strings:

    1. 'all' - all columns

    2. 'allnumeric' - all numeric columns

Types: character OR vector of Strings (character)

response.column

Required Argument.
Specifies the name of the column that represents the dependent variable being predicted.
Types: character

...

Specifies other arguments supported by the function as described in the 'Other Arguments' section.

Value

Function returns an object of class "td_lin_reg_valib" which is a named list containing objects of class "tbl_teradata".
Named list members can be referenced directly with the "$" operator using names:

  1. model

  2. statistical.measures

  3. xml.reports

Other Arguments

backward

Optional Argument.
Specifies whether to use backward technique or not. When set to TRUE, starting with all independent variables in the model, one backward step is followed by one forward step until no variables can be removed. A backward step consists of computing the Partial F-Statistic for each variable and removing that with the smallest value if it is less than the criterion to remove. The P-value of the T-Statistic may be used instead of the Partial F-Statistic. All of the P-values for variables in the model can be calculated at once, removing the variable with the largest P-Value if greater than the criterion to remove.
Types: logical

backward.only

Optional Argument.
Specifies whether to use only backward technique or not. This technique is similar to the backward technique, except that a forward step is not performed. It starts with all independent variables in the model. Backward steps are executed until no more are possible.
Types: logical

exclude.columns

Optional Argument.
Specifies the name(s) of the column(s) to exclude from the analysis, if a column specifier such as 'all', 'allnumeric' is used in the "columns" argument. By default, when the "exclude.columns" argument is used, dependent variable and group by columns, if any, are automatically excluded as input columns and do not need to be included as "exclude.columns".
Types: character OR vector of Strings (character)

cond.ind.threshold

Optional Argument.
Specifies the condition index threshold value to use while generating near dependency report. This is used when "near.dep.report" is set to TRUE.
Default Value: 30
Types: integer

constant

Optional Argument.
Specifies whether the linear model includes a constant term or not. When set to TRUE, model includes a constant term.
Default Value: TRUE
Types: logical

entrance.criterion

Optional Argument.
Specifies the criterion to enter a variable into the model. The Partial F-Statistic must be greater than this value, or the T-Statistic P-value must be less than this value, depending on the value passed to "use.fstat" or "use.pvalue" arguments.
Default Value: 3.84 if "use.fstat" is TRUE and 0.05 if "use.pvalue" is TRUE.
Types: numeric

forward

Optional Argument.
Specifies whether to use forward technique or not. When set to TRUE, starting with no independent variables in the model, a forward step is performed, adding the best choice in explaining the dependent variable's variance, followed by a backward step, removing the worst choice. A forward step is made by determining the largest partial F-Statistic and adding the corresponding variable to the model, provided the statistic is greater than the criterion to enter (see the "entrance.criterion").
An alternative is to use the P-value of the T-Statistic (the ratio of a variable's B coefficient to its Standard Error). When the P-value is used, a forward step determines the variable with the smallest P-value and adds that variable if the P-value is less than the criterion to enter. (If more than one variable has a P-value of zero, the F-Statistic is used instead.)
Types: logical

forward.only

Optional Argument.
Specifies whether to use only forward technique or not. This technique is similar to the forward technique, except that a backward step is not performed.
Types: logical

group.columns

Optional Argument.
Specifies the name(s) of the column(s) dividing the input into partitions, one for each combination of values in the group by columns. For each partition or combination of values a separate linear model is built.
Types: character OR vector of Strings (character)

matrix.input

Optional Argument.
Specifies whether the input tbl_teradata passed to argument "data" represents an ESSCP matrix build by Matrix Building function or not, refer td_matrix_valib function for more details.
When this is set to TRUE, the input passed to "data" argument represents an ESSCP matrix built by the Matrix Building function. Use of this feature saves internally building a matrix each time this function is performed, providing a significant performance improvement. The columns specified with the "columns" argument may be a subset of the columns in this matrix and may be specified in any order. The columns must, however, all be present in the matrix. Further, if group by columns are specified in the matrix, these same group by columns must be specified in this function.
Note:

  • If the input represents a saved matrix, make sure to set this argument to TRUE because results can otherwise be unpredictable.

Default Value: FALSE
Types: logical

near.dep.report

Optional Argument.
Specifies whether to produce an XML report showing columns that may be collinear as part of the output or not. The report is included in the XML output only if collinearity is detected.
Two threshold arguments are available for this report, "cond.ind.threshold" and "variance.prop.threshold".
Types: logical

remove.criterion

Optional Argument.
Specifies the criterion to remove a variable from the model. The T-Statistic P-value must be greater than this value for a variable to be removed.
The Partial F-Statistic must be less than this value, or the T-Statistic P-value must be greater than this value, depending on the value passed to "use.fstat" or "use.pvalue" arguments.
Default Value: 3.84 if "use.fstat" is TRUE and 0.05 if "use.pvalue" is TRUE.
Types: numeric

stats.output

Optional Argument.
Specifies whether to produce an additional data quality report which includes the mean and standard deviation of each model variable, derived from an ESSCP matrix. The report is included in the XML output.
Types: logical

stepwise

Optional Argument.
Specifies whether to perform a stepwise procedure or not.
Default Value: FALSE
Types: logical

use.fstat

Optional Argument.
Specifies whether to use the partial F-Statistic in assessing whether a variable should be added or removed.
Default Value: TRUE
Types: logical

use.pvalue

Optional Argument.
Specifies whether to use the T-Statistic P-value in assessing whether a variable should be added or removed.
Default Value: FALSE
Types: logical

variance.prop.threshold

Optional Argument.
Specifies the variance proportion threshold value to use while generating near dependency report. This is used when "near.dep.report" is set to TRUE.
Default Value: 0.5
Types: numeric

Examples

# Notes:
#   1. To execute Vantage Analytic Library functions, set option 'val.install.location' to
#      the database name where Vantage analytic library functions are installed.
#   2. Datasets used in these examples can be loaded using Vantage Analytic Library installer.

# Set the option 'val.install.location'.
options(val.install.location = "SYSLIB")

# Get remote data source connection.
con <- td_get_context()$connection

# Create an object of class "tbl_teradata".
df <- tbl(con, "customer")
print(df)

# Example 1: Shows how input columns 'age', 'years_with_bank', and 'nbr_children' are
#            used to predict 'income'.
obj <- td_lin_reg_valib(data=df,
                        columns=c("age", "years_with_bank", "nbr_children"),
                        response.column="income")

# Print the results.
print(obj$model)
print(obj$statistical.measures)
print(obj$xml.reports)

# Example 2: Shows how group by columns 'gender' and 'marital_status' result in 2x4=8 models
#            being built.
obj <- td_lin_reg_valib(data=df,
                        columns=c("age", "years_with_bank", "nbr_children"),
                        response.column="income",
                        group.columns=c("gender", "marital_status"))

# Print the results.
print(obj$model)
print(obj$statistical.measures)
print(obj$xml.reports)