Teradata Package for Python Function Reference - LogReg - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.

LogReg

Functions
		LogReg(data, matrix_data=None, columns=None, response_column=None, backward=None, backward_only=None, exclude_columns=None, cond_ind_threshold=30, constant=True, convergence=0.001, entrance_criterion=0.05, forward=None, forward_only=None, group_columns=None, lift_output=None, max_iter=100, mem_size=None, near_dep_report=None, remove_criterion=0.05, response_value=None, sample=None, stats_output=False, stepwise=False, success_output=False, start_threshold=None, end_threshold=None, increment_threshold=None, threshold_output=False, variance_prop_threshold=0.5) DESCRIPTION: Logistic Regression is one of the most widely used types of statistical analysis. In Logistic Regression, a set of independent variables (in this case, columns) is processed to predict the value of a dependent variable (column) that assumes two values referred to as response (1) and non-response (0). The user can specify which value of the dependent variable to treat as the response, and all other values assumed by the dependent variable are treated as non-repsonse. The result is not, however, a continuous numeric variable as seen in Linear Regression, but rather a probability between 0 and 1 that the response value is assumed by the dependent variable. There are many types of analysis that lend themselves to the use of Logistic Regression, and when scoring a model, benefit from the estimation of a probability rather than a fixed value. For example, when predicting who should be targeted for a marketing campaign, the scored customers can be ordered by the predicted probability from most to least likely, and the top n values taken from the customer list. Some of the key features of Logistic Regression are outlined below. * The Teradata table operator CALCMATRIX is used to build an ESSCP matrix for purposes of validating the input data, such as by checking for constant values. Also, to avoid rebuilding this matrix every time the algorithm is run, the user may run the Matrix Analysis separately, saving an ESSCP matrix in a teradataml DataFrame that can then be input to Logistic Regression. Refer "matrix_data" argument. * One or more group by columns can optionally be specified so that an input matrix is built for each combination of group by column values, and subsequently a separate Logistic Regression model is built for each matrix. To achieve this, the names of the group by columns are passed to CALCMATRIX as parameters, so it includes them as columns in the matrix data it creates. Refer "group_columns" argument. * The stepwise feature for Logistic Regression is a technique for selecting the independent variables in a logistic model. It consists of different methods of 'trying' variables and adding or removing them from a model through a series of forward and backward steps described in the parameter section. * A Statistics data is available, displaying the mean and standard deviation of each model variable. Refer to the "stats_output" argument. * A Success data is available, displaying counts of predicted versus actual values of the dependent variable in the logistic model. * A Multi-Threshold Success Table is available. Refer "threshold_output" argument. * A Lift Table, such as would be used to build a Lift Chart, is available. Refer "lift_output" argument. * A Near Dependency Report is available to identify two or more columns that may be collinear. * The algorithm is partially scalable because the size of each input matrix depends only on the number of independent variables (columns) and not on the size of the input data. The calculations performed on the client workstation however are not scalable when group by columns are used, because each model is built serially based on each matrix in the matrix data. PARAMETERS: data: Required Argument. Specifies the input data to build a logistic regression model from. Types: teradataml DataFrame columns: Required Argument. Specifies the name(s) of the column(s) representing the independent variables used in building a logistic regression model. Occasionally, it can also accept permitted strings to specify all columns, or all numeric columns. Permitted Values: * Name(s) of the column(s) in "data". * Pre-defined strings: * 'all' - all columns * 'allnumeric' - all numeric columns Types: str OR list of Strings (str) response_column: Required Argument. Specifies the name of the column that represents the dependent variable being predicted. Types: str backward: Optional Argument. Specifies whether to take backward steps or not. Backward steps, i.e., removing variables from a model, use the P-value of the T-statistic, i.e., the ratio of a B-coefficient to its standard error. The variable (column) with the largest P-value is removed if the P-value exceeds the criterion to remove. Types: bool backward_only: Optional Argument. Specifies whether to use only backward technique or not. This technique is similar to the backward technique, except that a forward step is not performed. It starts with all independent variables in the model. Backward steps are executed until no more are possible. Types: bool exclude_columns: Optional Argument. Specifies the name(s) of the column(s) to exclude from the analysis, if a column specifier such as 'all', 'allnumeric' is used in the "columns" argument. By default, when the "exclude_columns" argument is used, dependent variable and group by columns, if any, are automatically excluded as input columns and do not need to be included as "exclude_columns". Types: str OR list of Strings (str) cond_ind_threshold: Optional Argument. Specifies the condition index threshold value to use while generating near dependency report. This is used when "near_dep_report" is set to True. Default Value: 30 Types: int constant: Optional Argument. Specifies whether the logistic model includes a constant term or not. When set to True, model includes a constant term. Default Value: True Types: bool convergence: Optional Argument. Specifies the convergence criterion such that the algorithm stops iterating when the change in log likelihood function falls below this value. Default Value: 0.001 Types: float entrance_criterion: Optional Argument. Specifies the criterion to enter a variable into the model. The W-statistic chi-square P-value must be less than this value for a variable to be added. Default Value: 0.05 Types: float forward: Optional Argument. Specifies whether to use forward technique or not. When set to True, in this technique, starting with no independent variables in the model, a forward step is performed, adding the "best" choice, followed by a backward step, removing the worst choice. Refer to the "stepwise" argument for a description of the steps in this technique. Types: bool forward_only: Optional Argument. Specifies whether to use only forward technique or not. This technique is similar to the forward technique, except that a backward step is not performed. Types: bool group_columns: Optional Argument. Specifies the name(s) of the column(s) dividing the input into partitions, one for each combination of values in the group by columns. For each partition or combination of values a separate logistic model and XML report is built. Types: str OR list of Strings (str) lift_output: Optional Argument. Specifies whether to build a lift chart or not and add it in the functions output string. It splits up the computed probability values into deciles with the usual counts and percentages to demonstrate what happens when more and more rows of ordered probabilities are accumulated. Types: bool matrix_data: Optional Argument. Specifies the input matrix data to use for the analysis. Instead of internally building a matrix with the Matrix function each time this analysis is performed, the user may build an ESSCP Matrix once with the Matrix Analysis using Matrix() function. The matrix can subsequently be read from this data instead of re-building it each time. If this is specified, the columns specified with the "columns" argument should be a subset of the columns in this matrix and can be specified in any order. The columns must however all be present in the matrix. Further, if group by columns are specified in the matrix, these same group by columns must be specified in this analysis. Types: teradataml DataFrame max_iter: Optional Argument. Specifies the maximum number of attempts to converge on a solution. Default Value: 100 Types: int mem_size: Optional Argument. Specifies the memory size in megabytes to allocate for in-memory Logistic Regression. If there is too much data to fit in this amount of memory or is set to 0 or argument is not specified, normal SQL processing is performed. Types: int near_dep_report: Optional Argument. Specifies whether to produce an XML report showing columns that may be collinear as part of the output or not. The report is included in the XML output only if collinearity is detected. Two threshold arguments are available for this report, "cond_ind_threshold" and "variance_prop_threshold". Types: bool remove_criterion: Optional Argument. Specifies the criterion to remove a variable from the model. The T-Statistic P-value must be greater than this value for a variable to be removed. Default Value: 0.05 Types: float response_value: Optional Argument. Specifies the value assumed by the dependent column that is to be treated as the response value. Types: str sample: Optional Argument. Specifies whether to use sample of the data to be read into memory for processing, if the memory size available is less than the amount of data to process. When set to True, a sample of data is read. Types: bool stats_output: Optional Argument. Specifies whether an optional data quality report should be delivered in the function's XML output string or not, which includes the mean and standard deviation of each model variable, derived from an ESSCP matrix. Default Value: False Types: bool stepwise: Optional Argument. Specifies whether to perform a stepwise procedure or not. Forward steps, i.e., adding variables to a model, add the variable with the smallest chi-square P-value connected to its special W-statistic, provided the P-value is less than the criterion to enter. Backward steps, i.e., removing variables from a model, use the P-value of the T-statistic, i.e., the ratio of a B-coefficient to its standard error. The variable (column) with the largest P-value is removed if the P-value exceeds the criterion to remove. Default Value: False Types: bool success_output: Optional Argument. Specifies whether an optional success report should be delivered in the function's XML output string or not, which includes the displaying counts of predicted versus actual values of the dependent variable of the logistic regression model. This report is similar to the Decision Tree Confusion Matrix, but the success report only includes two values of the dependent variable, namely response versus non-response. Default Value: False Types: bool start_threshold: Optional Argument. Specifies the beginning threshold value utilized in the Multi-Threshold Success output. Types: float, int end_threshold: Optional Argument. Specifies the ending threshold value utilized in the Multi-Threshold Success output. Types: float, int increment_threshold: Optional Argument. Specifies the difference in threshold values between adjacent rows in the Multi-Threshold Success output. Types: float, int threshold_output: Optional Argument. Specifies whether the Multi-Threshold Success output should be produced or not and included in the XML output string in the result. This report can be thought of as a table where each row is a Prediction Success Table, and each row has a different threshold value as generated by the "start_threshold", "end_threshold", and "increment_threshold" arguments. What is meant by a threshold here is the value above which the predicted probability indicates a response. Default Value: False Types: bool variance_prop_threshold: Optional Argument. Specifies the variance proportion threshold value to use while generating near dependency report. This is used when "near_dep_report" is set to True. Default Value: 0.5 Types: float RETURNS: An instance of LogReg. Output teradataml DataFrames can be accessed using attribute references, such as LogRegObj.<attribute_name>. Output teradataml DataFrame attribute names are: 1. model 2. statistical_measures 3. xml_reports RAISES: TeradataMlException, TypeError, ValueError EXAMPLES: # Notes: # 1. To execute Vantage Analytic Library functions, # a. import "valib" object from teradataml. # b. set 'configure.val_install_location' to the database name where Vantage # analytic library functions are installed. # 2. Datasets used in these examples can be loaded using Vantage Analytic Library # installer. # Import valib object from teradataml to execute this function. from teradataml import valib # Set the 'configure.val_install_location' variable. from teradataml import configure configure.val_install_location = "SYSLIB" # Create required teradataml DataFrame. df = DataFrame("customer") print(df) # Example 1: Shows the Near Dependency Report is requested with related options. obj = valib.LogReg(data=df, columns=["age", "years_with_bank", "income"], response_column="nbr_children", response_value=1, cond_ind_threshold=3, variance_prop_threshold=0.3) # Print the results. print(obj.model) print(obj.statistical_measures) print(obj.xml_reports) # Example 2: Shows that 2 group by columns are requested. The output contains 1 row # for each combination of group by column values. obj = valib.LogReg(data=df, columns=["age", "years_with_bank", "income"], response_column="nbr_children", group_columns=["gender", "marital_status"]) # Print the results. print(obj.model) print(obj.statistical_measures) print(obj.xml_reports) # Example 3: Shows how a pre-built matrix can be used for generating logistic # regression model. # Generate the ESSCP matrix. mat_obj = valib.Matrix(data=df, columns=["income", "age", "years_with_bank", "nbr_children"], type="esscp") # Print the results. print(mat_obj.result) # Use the generated matrix in building logistic regression model. obj = valib.LogReg(data=df, columns=["age", "years_with_bank", "income"], response_column="nbr_children", response_value=1, matrix_data=mat_obj.result) # Print the results. print(obj.model) print(obj.statistical_measures) print(obj.xml_reports)