| |
- LogReg(data, matrix_data=None, columns=None, response_column=None, backward=None, backward_only=None, exclude_columns=None, cond_ind_threshold=30, constant=True, convergence=0.001, entrance_criterion=0.05, forward=None, forward_only=None, group_columns=None, lift_output=None, max_iter=100, mem_size=None, near_dep_report=None, remove_criterion=0.05, response_value=None, sample=None, stats_output=False, stepwise=False, success_output=False, start_threshold=None, end_threshold=None, increment_threshold=None, threshold_output=False, variance_prop_threshold=0.5)
- DESCRIPTION:
Logistic Regression is one of the most widely used types of statistical analysis.
In Logistic Regression, a set of independent variables (in this case, columns) is
processed to predict the value of a dependent variable (column) that assumes two
values referred to as response (1) and non-response (0). The user can specify which
value of the dependent variable to treat as the response, and all other values
assumed by the dependent variable are treated as non-repsonse. The result is not,
however, a continuous numeric variable as seen in Linear Regression, but rather a
probability between 0 and 1 that the response value is assumed by the dependent
variable.
There are many types of analysis that lend themselves to the use of Logistic Regression,
and when scoring a model, benefit from the estimation of a probability rather than
a fixed value. For example, when predicting who should be targeted for a marketing
campaign, the scored customers can be ordered by the predicted probability from most
to least likely, and the top n values taken from the customer list.
Some of the key features of Logistic Regression are outlined below.
* The Teradata table operator CALCMATRIX is used to build an ESSCP matrix for
purposes of validating the input data, such as by checking for constant values.
Also, to avoid rebuilding this matrix every time the algorithm is run, the user
may run the Matrix Analysis separately, saving an ESSCP matrix in a teradataml
DataFrame that can then be input to Logistic Regression.
Refer "matrix_data" argument.
* One or more group by columns can optionally be specified so that an input
matrix is built for each combination of group by column values, and subsequently
a separate Logistic Regression model is built for each matrix. To achieve
this, the names of the group by columns are passed to CALCMATRIX as parameters,
so it includes them as columns in the matrix data it creates.
Refer "group_columns" argument.
* The stepwise feature for Logistic Regression is a technique for selecting the
independent variables in a logistic model. It consists of different methods of
'trying' variables and adding or removing them from a model through a series of
forward and backward steps described in the parameter section.
* A Statistics data is available, displaying the mean and standard deviation of
each model variable. Refer to the "stats_output" argument.
* A Success data is available, displaying counts of predicted versus actual
values of the dependent variable in the logistic model.
* A Multi-Threshold Success Table is available. Refer "threshold_output" argument.
* A Lift Table, such as would be used to build a Lift Chart, is available.
Refer "lift_output" argument.
* A Near Dependency Report is available to identify two or more columns that
may be collinear.
* The algorithm is partially scalable because the size of each input matrix
depends only on the number of independent variables (columns) and not on the size
of the input data. The calculations performed on the client workstation however
are not scalable when group by columns are used, because each model is built
serially based on each matrix in the matrix data.
PARAMETERS:
data:
Required Argument.
Specifies the input data to build a logistic regression model from.
Types: teradataml DataFrame
columns:
Required Argument.
Specifies the name(s) of the column(s) representing the independent variables
used in building a logistic regression model. Occasionally, it can also accept
permitted strings to specify all columns, or all numeric columns.
Permitted Values:
* Name(s) of the column(s) in "data".
* Pre-defined strings:
* 'all' - all columns
* 'allnumeric' - all numeric columns
Types: str OR list of Strings (str)
response_column:
Required Argument.
Specifies the name of the column that represents the dependent variable being
predicted.
Types: str
backward:
Optional Argument.
Specifies whether to take backward steps or not. Backward steps, i.e., removing
variables from a model, use the P-value of the T-statistic, i.e., the ratio of
a B-coefficient to its standard error. The variable (column) with the largest
P-value is removed if the P-value exceeds the criterion to remove.
Types: bool
backward_only:
Optional Argument.
Specifies whether to use only backward technique or not. This technique is similar
to the backward technique, except that a forward step is not performed. It starts
with all independent variables in the model. Backward steps are executed until no
more are possible.
Types: bool
exclude_columns:
Optional Argument.
Specifies the name(s) of the column(s) to exclude from the analysis, if a column
specifier such as 'all', 'allnumeric' is used in the "columns" argument. By
default, when the "exclude_columns" argument is used, dependent variable and
group by columns, if any, are automatically excluded as input columns and do not
need to be included as "exclude_columns".
Types: str OR list of Strings (str)
cond_ind_threshold:
Optional Argument.
Specifies the condition index threshold value to use while generating near
dependency report. This is used when "near_dep_report" is set to True.
Default Value: 30
Types: int
constant:
Optional Argument.
Specifies whether the logistic model includes a constant term or not. When set
to True, model includes a constant term.
Default Value: True
Types: bool
convergence:
Optional Argument.
Specifies the convergence criterion such that the algorithm stops iterating when
the change in log likelihood function falls below this value.
Default Value: 0.001
Types: float
entrance_criterion:
Optional Argument.
Specifies the criterion to enter a variable into the model. The W-statistic
chi-square P-value must be less than this value for a variable to be added.
Default Value: 0.05
Types: float
forward:
Optional Argument.
Specifies whether to use forward technique or not. When set to True, in this
technique, starting with no independent variables in the model, a forward step
is performed, adding the "best" choice, followed by a backward step, removing
the worst choice. Refer to the "stepwise" argument for a description of the
steps in this technique.
Types: bool
forward_only:
Optional Argument.
Specifies whether to use only forward technique or not. This technique is similar
to the forward technique, except that a backward step is not performed.
Types: bool
group_columns:
Optional Argument.
Specifies the name(s) of the column(s) dividing the input into partitions, one
for each combination of values in the group by columns. For each partition or
combination of values a separate logistic model and XML report is built.
Types: str OR list of Strings (str)
lift_output:
Optional Argument.
Specifies whether to build a lift chart or not and add it in the functions output
string. It splits up the computed probability values into deciles with the usual
counts and percentages to demonstrate what happens when more and more rows of
ordered probabilities are accumulated.
Types: bool
matrix_data:
Optional Argument.
Specifies the input matrix data to use for the analysis. Instead of internally
building a matrix with the Matrix function each time this analysis is performed,
the user may build an ESSCP Matrix once with the Matrix Analysis using Matrix()
function. The matrix can subsequently be read from this data instead of re-building
it each time. If this is specified, the columns specified with the "columns"
argument should be a subset of the columns in this matrix and can be specified in
any order. The columns must however all be present in the matrix. Further, if
group by columns are specified in the matrix, these same group by columns must
be specified in this analysis.
Types: teradataml DataFrame
max_iter:
Optional Argument.
Specifies the maximum number of attempts to converge on a solution.
Default Value: 100
Types: int
mem_size:
Optional Argument.
Specifies the memory size in megabytes to allocate for in-memory Logistic
Regression. If there is too much data to fit in this amount of memory or is set
to 0 or argument is not specified, normal SQL processing is performed.
Types: int
near_dep_report:
Optional Argument.
Specifies whether to produce an XML report showing columns that may be
collinear as part of the output or not. The report is included in the XML
output only if collinearity is detected.
Two threshold arguments are available for this report, "cond_ind_threshold" and
"variance_prop_threshold".
Types: bool
remove_criterion:
Optional Argument.
Specifies the criterion to remove a variable from the model. The T-Statistic
P-value must be greater than this value for a variable to be removed.
Default Value: 0.05
Types: float
response_value:
Optional Argument.
Specifies the value assumed by the dependent column that is to be treated as
the response value.
Types: str
sample:
Optional Argument.
Specifies whether to use sample of the data to be read into memory for processing,
if the memory size available is less than the amount of data to process. When set
to True, a sample of data is read.
Types: bool
stats_output:
Optional Argument.
Specifies whether an optional data quality report should be delivered in the
function's XML output string or not, which includes the mean and standard
deviation of each model variable, derived from an ESSCP matrix.
Default Value: False
Types: bool
stepwise:
Optional Argument.
Specifies whether to perform a stepwise procedure or not.
Forward steps, i.e., adding variables to a model, add the variable with the
smallest chi-square P-value connected to its special W-statistic, provided the
P-value is less than the criterion to enter.
Backward steps, i.e., removing variables from a model, use the P-value of the
T-statistic, i.e., the ratio of a B-coefficient to its standard error. The
variable (column) with the largest P-value is removed if the P-value exceeds
the criterion to remove.
Default Value: False
Types: bool
success_output:
Optional Argument.
Specifies whether an optional success report should be delivered in the function's
XML output string or not, which includes the displaying counts of predicted
versus actual values of the dependent variable of the logistic regression model.
This report is similar to the Decision Tree Confusion Matrix, but the success
report only includes two values of the dependent variable, namely response versus
non-response.
Default Value: False
Types: bool
start_threshold:
Optional Argument.
Specifies the beginning threshold value utilized in the Multi-Threshold Success
output.
Types: float, int
end_threshold:
Optional Argument.
Specifies the ending threshold value utilized in the Multi-Threshold Success output.
Types: float, int
increment_threshold:
Optional Argument.
Specifies the difference in threshold values between adjacent rows in the
Multi-Threshold Success output.
Types: float, int
threshold_output:
Optional Argument.
Specifies whether the Multi-Threshold Success output should be produced or not
and included in the XML output string in the result. This report can be thought
of as a table where each row is a Prediction Success Table, and each row has a
different threshold value as generated by the "start_threshold", "end_threshold",
and "increment_threshold" arguments. What is meant by a threshold here is the
value above which the predicted probability indicates a response.
Default Value: False
Types: bool
variance_prop_threshold:
Optional Argument.
Specifies the variance proportion threshold value to use while generating near
dependency report. This is used when "near_dep_report" is set to True.
Default Value: 0.5
Types: float
RETURNS:
An instance of LogReg.
Output teradataml DataFrames can be accessed using attribute references, such as
LogRegObj.<attribute_name>.
Output teradataml DataFrame attribute names are:
1. model
2. statistical_measures
3. xml_reports
RAISES:
TeradataMlException, TypeError, ValueError
EXAMPLES:
# Notes:
# 1. To execute Vantage Analytic Library functions,
# a. import "valib" object from teradataml.
# b. set 'configure.val_install_location' to the database name where Vantage
# analytic library functions are installed.
# 2. Datasets used in these examples can be loaded using Vantage Analytic Library
# installer.
# Import valib object from teradataml to execute this function.
from teradataml import valib
# Set the 'configure.val_install_location' variable.
from teradataml import configure
configure.val_install_location = "SYSLIB"
# Create required teradataml DataFrame.
df = DataFrame("customer")
print(df)
# Example 1: Shows the Near Dependency Report is requested with related options.
obj = valib.LogReg(data=df,
columns=["age", "years_with_bank", "income"],
response_column="nbr_children",
response_value=1,
cond_ind_threshold=3,
variance_prop_threshold=0.3)
# Print the results.
print(obj.model)
print(obj.statistical_measures)
print(obj.xml_reports)
# Example 2: Shows that 2 group by columns are requested. The output contains 1 row
# for each combination of group by column values.
obj = valib.LogReg(data=df,
columns=["age", "years_with_bank", "income"],
response_column="nbr_children",
group_columns=["gender", "marital_status"])
# Print the results.
print(obj.model)
print(obj.statistical_measures)
print(obj.xml_reports)
# Example 3: Shows how a pre-built matrix can be used for generating logistic
# regression model.
# Generate the ESSCP matrix.
mat_obj = valib.Matrix(data=df,
columns=["income", "age", "years_with_bank", "nbr_children"],
type="esscp")
# Print the results.
print(mat_obj.result)
# Use the generated matrix in building logistic regression model.
obj = valib.LogReg(data=df,
columns=["age", "years_with_bank", "income"],
response_column="nbr_children",
response_value=1,
matrix_data=mat_obj.result)
# Print the results.
print(obj.model)
print(obj.statistical_measures)
print(obj.xml_reports)
|