Description
Logistic Regression is one of the most widely used types of statistical analysis.
In Logistic Regression, a set of independent variables (in this case, columns) is
processed to predict the value of a dependent variable (column) that assumes two
values referred to as response (1) and non-response (0). The user can specify which
value of the dependent variable to treat as the response, and all other values
assumed by the dependent variable are treated as non-repsonse. The result is not,
however, a continuous numeric variable as seen in Linear Regression, but rather a
probability between 0 and 1 that the response value is assumed by the dependent
variable.
There are many types of analysis that lend themselves to the use of Logistic Regression,
and when scoring a model, benefit from the estimation of a probability rather than
a fixed value. For example, when predicting who should be targeted for a marketing
campaign, the scored customers can be ordered by the predicted probability from most
to least likely, and the top n values taken from the customer list.
Some of the key features of Logistic Regression are outlined below.
The Teradata table operator CALCMATRIX is used to build an ESSCP matrix for purposes of validating the input data, such as by checking for constant values. Also, to avoid rebuilding this matrix every time the algorithm is run, the user may run the Matrix Analysis separately, saving an ESSCP matrix in a tbl_teradata that can then be input to Logistic Regression. Refer "matrix.data" argument.
One or more group by columns can optionally be specified so that an input matrix is built for each combination of group by column values, and subsequently a separate Logistic Regression model is built for each matrix. To achieve this, the names of the group by columns are passed to CALCMATRIX as parameters, so it includes them as columns in the matrix data it creates. Refer "group.columns" argument.
The stepwise feature for Logistic Regression is a technique for selecting the independent variables in a logistic model. It consists of different methods of 'trying' variables and adding or removing them from a model through a series of forward and backward steps described in the parameter section.
A Statistics data is available, displaying the mean and standard deviation of each model variable. Refer to the "stats.output" argument.
A Success data is available, displaying counts of predicted versus actual values of the dependent variable in the logistic model.
A Multi-Threshold Success Table is available. Refer "threshold.output" argument.
A Lift Table, such as would be used to build a Lift Chart, is available. Refer "lift.output" argument.
A Near Dependency Report is available to identify two or more columns that may be collinear.
The algorithm is partially scalable because the size of each input matrix depends only on the number of independent variables (columns) and not on the size of the input data. The calculations performed on the client workstation however are not scalable when group by columns are used, because each model is built serially based on each matrix in the matrix data.
Usage
td_log_reg_valib(data, columns, response.column, ...)
Arguments
data |
Required Argument. |
columns |
Required Argument.
Types: character OR vector of Strings (character) |
response.column |
Required Argument. |
... |
Specifies other arguments supported by the function as described in the 'Other Arguments' section. |
Value
Function returns an object of class "td_log_reg_valib"
which is a named list containing objects of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator
using names:
model
statistical.measures
xml.reports
Other Arguments
backward
Optional Argument.
Specifies whether to take backward steps or not. Backward steps,
i.e., removing variables from a model, use the P-value of the
T-statistic, i.e., the ratio of a B-coefficient to its standard
error. The variable (column) with the largest P-value is removed
if the P-value exceeds the criterion to remove.
Types: logical
backward.only
Optional Argument.
Specifies whether to use only backward technique or not. This
technique is similar to the backward technique, except that a
forward step is not performed. It starts with all independent
variables in the model. Backward steps are executed until no
more are possible.
Types: logical
exclude.columns
Optional Argument.
Specifies the name(s) of the column(s) to exclude from the
analysis, if a column specifier such as 'all', 'allnumeric'
is used in the "columns" argument. By default, when the
"exclude.columns" argument is used, dependent variable and
group by columns, if any, are automatically excluded as
input columns and do not need to be included as
"exclude.columns".
Types: character OR vector of Strings (character)
cond.ind.threshold
Optional Argument.
Specifies the condition index threshold value to use
while generating near dependency report. This is used
when "near.dep.report" is set to TRUE.
Default Value: 30
Types: integer
constant
Optional Argument.
Specifies whether the logistic model includes a constant term
or not. When set to TRUE, model includes a constant term.
Default Value: TRUE
Types: logical
convergence
Optional Argument.
Specifies the convergence criterion such that the algorithm
stops iterating when the change in log likelihood function
falls below this value.
Default Value: 0.001
Types: numeric
entrance.criterion
Optional Argument.
Specifies the criterion to enter a variable into the model.
The W-statistic chi-square P-value must be less than this
value for a variable to be added.
Default Value: 0.05
Types: numeric
forward
Optional Argument.
Specifies whether to use forward technique or not. When set to TRUE,
in this technique, starting with no independent variables in the model,
a forward step is performed, adding the "best" choice, followed by a
backward step, removing the worst choice. Refer to the "stepwise"
argument for a description of the steps in this technique.
Types: logical
forward.only
Optional Argument.
Specifies whether to use only forward technique or not. This
technique is similar to the forward technique, except that a
backward step is not performed.
Types: logical
group.columns
Optional Argument.
Specifies the name(s) of the column(s) dividing the input into
partitions, one for each combination of values in the group by
columns. For each partition or combination of values a separate
logistic model and XML report is built.
Types: character OR vector of Strings (character)
lift.output
Optional Argument.
Specifies whether to build a lift chart or not and add it in the
functions output string. It splits up the computed probability
values into deciles with the usual counts and percentages to
demonstrate what happens when more and more rows of ordered
probabilities are accumulated.
Types: logical
matrix.data
Optional Argument.
Specifies the input matrix data to use for the analysis. Instead
of internally building a matrix with the td_matrix_valib()
each
time this analysis is performed, the user may build an ESSCP Matrix
once with the Matrix Analysis using td_matrix_valib()
. The matrix
can subsequently be read from this data instead of re-building it
each time. If this is specified, the columns specified with the
"columns" argument should be a subset of the columns in this matrix
and can be specified in any order. The columns must however all be
present in the matrix. Further, if group by columns are specified
in the matrix, these same group by columns must be specified in
this analysis.
Types: tbl_teradata
max.iter
Optional Argument.
Specifies the maximum number of attempts to converge on a solution.
Default Value: 100
Types: integer
mem.size
Optional Argument.
Specifies the memory size in megabytes to allocate for in-memory
Logistic Regression. If there is too much data to fit in this amount
of memory or is set to 0 or argument is not specified, normal SQL
processing is performed.
Types: integer
near.dep.report
Optional Argument.
Specifies whether to produce an XML report showing columns
that may be collinear as part of the output or not. The report
is included in the XML output only if collinearity is detected.
Two threshold arguments are available for this report,
"cond.ind.threshold" and "variance.prop.threshold".
Types: logical
remove.criterion
Optional Argument.
Specifies the criterion to remove a variable from the model.
The T-Statistic P-value must be greater than this value for a
variable to be removed.
Default Value: 0.05
Types: numeric
response.value
Optional Argument.
Specifies the value assumed by the dependent column that is to
be treated as the response value.
Types: character
sample
Optional Argument.
Specifies whether to use sample of the data to be read into memory for
processing, if the memory size available is less than the amount of
data to process. When set to TRUE, a sample of data is read.
Types: logical
stats.output
Optional Argument.
Specifies whether an optional data quality report should be
delivered in the function's XML output string or not, which
includes the mean and standard deviation of each model variable,
derived from an ESSCP matrix.
Default Value: FALSE
Types: logical
stepwise
Optional Argument.
Specifies whether to perform a stepwise procedure or not.
Forward steps, i.e., adding variables to a model, add the variable
with the smallest chi-square P-value connected to its special
W-statistic, provided the P-value is less than the criterion to
enter.
Backward steps, i.e., removing variables from a model, use the
P-value of the T-statistic, i.e., the ratio of a B-coefficient to
its standard error. The variable (column) with the largest P-value
is removed if the P-value exceeds the criterion to remove.
Default Value: FALSE
Types: logical
success.output
Optional Argument.
Specifies whether an optional success report should be delivered
in the function's XML output string or not, which includes the
displaying counts of predicted versus actual values of the
dependent variable of the logistic regression model. This report
is similar to the Decision Tree Confusion Matrix, but the success
report only includes two values of the dependent variable, namely
response versus non-response.
Default Value: FALSE
Types: logical
start.threshold
Optional Argument.
Specifies the beginning threshold value utilized in the
Multi-Threshold Success output.
Types: numeric
end.threshold
Optional Argument.
Specifies the ending threshold value utilized in the
Multi-Threshold Success output.
Types: numeric
increment.threshold
Optional Argument.
Specifies the difference in threshold values between
adjacent rows in the Multi-Threshold Success output.
Types: numeric
threshold.output
Optional Argument.
Specifies whether the Multi-Threshold Success output should
be produced or not and included in the XML output string in
the result. This report can be thought of as a table where
each row is a Prediction Success Table, and each row has a
different threshold value as generated by the "start.threshold",
"end.threshold", and "increment.threshold" arguments. What
is meant by a threshold here is the value above which the
predicted probability indicates a response.
Default Value: FALSE
Types: logical
variance.prop.threshold
Optional Argument.
Specifies the variance proportion threshold value to
use while generating near dependency report. This is
used when "near.dep.report" is set to TRUE.
Default Value: 0.5
Types: numeric
Examples
# Notes:
# 1. To execute Vantage Analytic Library functions, set option 'val.install.location' to
# the database name where Vantage analytic library functions are installed.
# 2. Datasets used in these examples can be loaded using Vantage Analytic Library installer.
# Set the option 'val.install.location'.
options(val.install.location = "SYSLIB")
# Get remote data source connection.
con <- td_get_context()$connection
# Create an object of class "tbl_teradata".
df <- tbl(con, "customer")
print(df)
# Example 1: Shows the Near Dependency Report is requested with related options.
obj <- td_log_reg_valib(data=df,
columns=c("age", "years_with_bank", "income"),
response.column="nbr_children",
response.value=1,
cond.ind.threshold=3,
variance.prop.threshold=0.3)
# Print the results.
print(obj$model)
print(obj$statistical.measures)
print(obj$xml.reports)
# Example 2: Shows that 2 group by columns are requested. The output contains 1 row
# for each combination of group by column values.
obj <- td_log_reg_valib(data=df,
columns=c("age", "years_with_bank", "income"),
response.column="nbr_children",
group.columns=c("gender", "marital_status"))
# Print the results.
print(obj$model)
print(obj$statistical.measures)
print(obj$xml.reports)
# Example 3: Shows how a pre-built matrix can be used for generating logistic regression model.
# Generate the ESSCP matrix.
mat_obj <- td_matrix_valib(data=df,
columns=c("income", "age", "years_with_bank", "nbr_children"),
type="esscp")
# Print the results.
print(mat_obj$result)
# Use the generated matrix in building logistic regression model.
obj <- td_log_reg_valib(data=df,
columns=c("age", "years_with_bank", "income"),
response.column="nbr_children",
response.value=1,
matrix.data=mat_obj$result)
# Print the results.
print(obj$model)
print(obj$statistical.measures)
print(obj$xml.reports)