Teradata Package for R Function Reference | 17.00 - td_log_reg_valib - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

Logistic Regression

Description

Logistic Regression is one of the most widely used types of statistical analysis. In Logistic Regression, a set of independent variables (in this case, columns) is processed to predict the value of a dependent variable (column) that assumes two values referred to as response (1) and non-response (0). The user can specify which value of the dependent variable to treat as the response, and all other values assumed by the dependent variable are treated as non-repsonse. The result is not, however, a continuous numeric variable as seen in Linear Regression, but rather a probability between 0 and 1 that the response value is assumed by the dependent variable.

There are many types of analysis that lend themselves to the use of Logistic Regression, and when scoring a model, benefit from the estimation of a probability rather than a fixed value. For example, when predicting who should be targeted for a marketing campaign, the scored customers can be ordered by the predicted probability from most to least likely, and the top n values taken from the customer list.
Some of the key features of Logistic Regression are outlined below.

The Teradata table operator CALCMATRIX is used to build an ESSCP matrix for purposes of validating the input data, such as by checking for constant values. Also, to avoid rebuilding this matrix every time the algorithm is run, the user may run the Matrix Analysis separately, saving an ESSCP matrix in a tbl_teradata that can then be input to Logistic Regression. Refer "matrix.data" argument.
One or more group by columns can optionally be specified so that an input matrix is built for each combination of group by column values, and subsequently a separate Logistic Regression model is built for each matrix. To achieve this, the names of the group by columns are passed to CALCMATRIX as parameters, so it includes them as columns in the matrix data it creates. Refer "group.columns" argument.
The stepwise feature for Logistic Regression is a technique for selecting the independent variables in a logistic model. It consists of different methods of 'trying' variables and adding or removing them from a model through a series of forward and backward steps described in the parameter section.
A Statistics data is available, displaying the mean and standard deviation of each model variable. Refer to the "stats.output" argument.
A Success data is available, displaying counts of predicted versus actual values of the dependent variable in the logistic model.
A Multi-Threshold Success Table is available. Refer "threshold.output" argument.
A Lift Table, such as would be used to build a Lift Chart, is available. Refer "lift.output" argument.
A Near Dependency Report is available to identify two or more columns that may be collinear.
The algorithm is partially scalable because the size of each input matrix depends only on the number of independent variables (columns) and not on the size of the input data. The calculations performed on the client workstation however are not scalable when group by columns are used, because each model is built serially based on each matrix in the matrix data.

Usage

td_log_reg_valib(data, columns, response.column, ...)

Arguments

`data`	Required Argument. Specifies the input data to build a logistic regression model from. Types: tbl_teradata
`columns`	Required Argument. Specifies the name(s) of the column(s) representing the independent variables used in building a logistic regression model. Occasionally, it can also accept permitted strings to specify all columns, or all numeric columns. Permitted Values: Name(s) of the column(s) in "data". Pre-defined strings: 'all' - all columns 'allnumeric' - all numeric columns Types: character OR vector of Strings (character)
`response.column`	Required Argument. Specifies the name of the column that represents the dependent variable being predicted. Types: character
`...`	Specifies other arguments supported by the function as described in the 'Other Arguments' section.

Value

Function returns an object of class "td_log_reg_valib" which is a named list containing objects of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator using names:

model
statistical.measures
xml.reports

Other Arguments

backward

Optional Argument.
Specifies whether to take backward steps or not. Backward steps, i.e., removing variables from a model, use the P-value of the T-statistic, i.e., the ratio of a B-coefficient to its standard error. The variable (column) with the largest P-value is removed if the P-value exceeds the criterion to remove.
Types: logical

backward.only

Optional Argument.
Specifies whether to use only backward technique or not. This technique is similar to the backward technique, except that a forward step is not performed. It starts with all independent variables in the model. Backward steps are executed until no more are possible.
Types: logical

exclude.columns

Optional Argument.
Specifies the name(s) of the column(s) to exclude from the analysis, if a column specifier such as 'all', 'allnumeric' is used in the "columns" argument. By default, when the "exclude.columns" argument is used, dependent variable and group by columns, if any, are automatically excluded as input columns and do not need to be included as "exclude.columns".
Types: character OR vector of Strings (character)

cond.ind.threshold

Optional Argument.
Specifies the condition index threshold value to use while generating near dependency report. This is used when "near.dep.report" is set to TRUE.
Default Value: 30
Types: integer

constant

Optional Argument.
Specifies whether the logistic model includes a constant term or not. When set to TRUE, model includes a constant term.
Default Value: TRUE
Types: logical

convergence

Optional Argument.
Specifies the convergence criterion such that the algorithm stops iterating when the change in log likelihood function falls below this value.
Default Value: 0.001
Types: numeric

entrance.criterion

Optional Argument.
Specifies the criterion to enter a variable into the model. The W-statistic chi-square P-value must be less than this value for a variable to be added.
Default Value: 0.05
Types: numeric

forward

Optional Argument.
Specifies whether to use forward technique or not. When set to TRUE, in this technique, starting with no independent variables in the model, a forward step is performed, adding the "best" choice, followed by a backward step, removing the worst choice. Refer to the "stepwise" argument for a description of the steps in this technique.
Types: logical

forward.only

Optional Argument.
Specifies whether to use only forward technique or not. This technique is similar to the forward technique, except that a backward step is not performed.
Types: logical

group.columns

Optional Argument.
Specifies the name(s) of the column(s) dividing the input into partitions, one for each combination of values in the group by columns. For each partition or combination of values a separate logistic model and XML report is built.
Types: character OR vector of Strings (character)

lift.output

Optional Argument.
Specifies whether to build a lift chart or not and add it in the functions output string. It splits up the computed probability values into deciles with the usual counts and percentages to demonstrate what happens when more and more rows of ordered probabilities are accumulated.
Types: logical

matrix.data

Optional Argument.
Specifies the input matrix data to use for the analysis. Instead of internally building a matrix with the td_matrix_valib() each time this analysis is performed, the user may build an ESSCP Matrix once with the Matrix Analysis using td_matrix_valib(). The matrix can subsequently be read from this data instead of re-building it each time. If this is specified, the columns specified with the "columns" argument should be a subset of the columns in this matrix and can be specified in any order. The columns must however all be present in the matrix. Further, if group by columns are specified in the matrix, these same group by columns must be specified in this analysis.
Types: tbl_teradata

max.iter

Optional Argument.
Specifies the maximum number of attempts to converge on a solution.
Default Value: 100
Types: integer

mem.size

Optional Argument.
Specifies the memory size in megabytes to allocate for in-memory Logistic Regression. If there is too much data to fit in this amount of memory or is set to 0 or argument is not specified, normal SQL processing is performed.
Types: integer

near.dep.report

Optional Argument.
Specifies whether to produce an XML report showing columns that may be collinear as part of the output or not. The report is included in the XML output only if collinearity is detected.
Two threshold arguments are available for this report, "cond.ind.threshold" and "variance.prop.threshold".
Types: logical

remove.criterion

Optional Argument.
Specifies the criterion to remove a variable from the model. The T-Statistic P-value must be greater than this value for a variable to be removed.
Default Value: 0.05
Types: numeric

response.value

Optional Argument.
Specifies the value assumed by the dependent column that is to be treated as the response value.
Types: character

sample

Optional Argument.
Specifies whether to use sample of the data to be read into memory for processing, if the memory size available is less than the amount of data to process. When set to TRUE, a sample of data is read.
Types: logical

stats.output

Optional Argument.
Specifies whether an optional data quality report should be delivered in the function's XML output string or not, which includes the mean and standard deviation of each model variable, derived from an ESSCP matrix.
Default Value: FALSE
Types: logical

stepwise

Optional Argument.
Specifies whether to perform a stepwise procedure or not.
Forward steps, i.e., adding variables to a model, add the variable with the smallest chi-square P-value connected to its special W-statistic, provided the P-value is less than the criterion to enter.
Backward steps, i.e., removing variables from a model, use the P-value of the T-statistic, i.e., the ratio of a B-coefficient to its standard error. The variable (column) with the largest P-value is removed if the P-value exceeds the criterion to remove.
Default Value: FALSE
Types: logical

success.output

Optional Argument.
Specifies whether an optional success report should be delivered in the function's XML output string or not, which includes the displaying counts of predicted versus actual values of the dependent variable of the logistic regression model. This report is similar to the Decision Tree Confusion Matrix, but the success report only includes two values of the dependent variable, namely response versus non-response.
Default Value: FALSE
Types: logical

start.threshold

Optional Argument.
Specifies the beginning threshold value utilized in the Multi-Threshold Success output.
Types: numeric

end.threshold

Optional Argument.
Specifies the ending threshold value utilized in the Multi-Threshold Success output.
Types: numeric

increment.threshold

Optional Argument.
Specifies the difference in threshold values between adjacent rows in the Multi-Threshold Success output.
Types: numeric

threshold.output

Optional Argument.
Specifies whether the Multi-Threshold Success output should be produced or not and included in the XML output string in the result. This report can be thought of as a table where each row is a Prediction Success Table, and each row has a different threshold value as generated by the "start.threshold", "end.threshold", and "increment.threshold" arguments. What is meant by a threshold here is the value above which the predicted probability indicates a response.
Default Value: FALSE
Types: logical

variance.prop.threshold

Optional Argument.
Specifies the variance proportion threshold value to use while generating near dependency report. This is used when "near.dep.report" is set to TRUE.
Default Value: 0.5
Types: numeric

Examples


# Notes:
#   1. To execute Vantage Analytic Library functions, set option 'val.install.location' to
#      the database name where Vantage analytic library functions are installed.
#   2. Datasets used in these examples can be loaded using Vantage Analytic Library installer.

# Set the option 'val.install.location'.
options(val.install.location = "SYSLIB")

# Get remote data source connection.
con <- td_get_context()$connection

# Create an object of class "tbl_teradata".
df <- tbl(con, "customer")
print(df)

# Example 1: Shows the Near Dependency Report is requested with related options.
obj <- td_log_reg_valib(data=df, 
                        columns=c("age", "years_with_bank", "income"), 
                        response.column="nbr_children", 
                        response.value=1,
                        cond.ind.threshold=3,
                        variance.prop.threshold=0.3)

# Print the results.
print(obj$model)
print(obj$statistical.measures)
print(obj$xml.reports)

# Example 2: Shows that 2 group by columns are requested. The output contains 1 row 
#            for each combination of group by column values. 
obj <- td_log_reg_valib(data=df, 
                        columns=c("age", "years_with_bank", "income"), 
                        response.column="nbr_children", 
                        group.columns=c("gender", "marital_status"))

# Print the results.
print(obj$model)
print(obj$statistical.measures)
print(obj$xml.reports)

# Example 3: Shows how a pre-built matrix can be used for generating logistic regression model.
# Generate the ESSCP matrix.
mat_obj <- td_matrix_valib(data=df,
                           columns=c("income", "age", "years_with_bank", "nbr_children"),
                           type="esscp")

# Print the results.
print(mat_obj$result)

# Use the generated matrix in building logistic regression model.
obj <- td_log_reg_valib(data=df,
                        columns=c("age", "years_with_bank", "income"),
                        response.column="nbr_children",
                        response.value=1,
                        matrix.data=mat_obj$result)

# Print the results.
print(obj$model)
print(obj$statistical.measures)
print(obj$xml.reports)