| |
- LinReg(data, columns=None, response_column=None, backward=None, backward_only=None, exclude_columns=None, cond_ind_threshold=30, constant=False, entrance_criterion=None, forward=None, forward_only=None, group_columns=None, matrix_input=False, near_dep_report=None, remove_criterion=None, stats_output=None, stepwise=False, use_fstat=True, use_pvalue=False, variance_prop_threshold=0.5, charset=None)
- DESCRIPTION:
Linear Regression is one of the fundamental types of predictive modeling algorithms.
In linear regression, a dependent numeric variable is expressed in terms of the sum
of one or more independent numeric variables, which are each multiplied by a numeric
coefficient, usually with a constant term added to the sum of independent variables.
It is the coefficients of the independent variables together with a constant term
that comprise a linear regression model. Applying these coefficients to the variables
(columns) of each observation (row) in a data set is known as scoring, as described
in Linear Regression Scoring.
Some of the key features of Linear Regression are outlined below.
* The Teradata table operator CALCMATRIX is used to build a DataFrame that
represents an extended cross-products matrix that is the input to the algorithm.
* One or more group by columns may optionally be specified so that an input
matrix is built for each combination of group by column values, and subsequently
a separate linear model is built for each matrix.
To achieve this, the names of the group by columns are passed to CALCMATRIX as
parameters, so it includes them as columns in the matrix output it creates.
* The stepwise feature for Linear Regression is a technique for selecting the
independent variables in a linear model. It consists of different methods of
"trying" a variable and adding or removing it from a model by checking either
a partial F-Statistic or the P-Value of a T-Statistic, at the user's choice.
* The algorithm is partially scalable because the size of each input matrix
depends only on the number of independent variables (columns) and not on the
size of the input DataFrame. The calculations performed on the client workstation
however are not scalable when group by columns are used, because each model is
built serially based on each matrix in the matrix DataFrame.
PARAMETERS:
data:
Required Argument.
Specifies the input data to build a predictive model from.
Types: teradataml DataFrame
columns:
Required Argument.
Specifies the name(s) of the column(s) representing the independent variables
used in building a linear regression model. Occasionally, it can also accept
permitted strings to specify all columns, or all numeric columns.
Permitted Values:
* Name(s) of the column(s) in "data".
* Pre-defined strings:
* 'all' - all columns
* 'allnumeric' - all numeric columns
Types: str OR list of Strings (str)
response_column:
Required Argument.
Specifies the name of the column that represents the dependent variable.
Types: str
backward:
Optional Argument.
Specifies whether to use backward technique or not. When set to True, starting
with all independent variables in the model, one backward step is followed by
one forward step until no variables can be removed. A backward step consists
of computing the Partial F-Statistic for each variable and removing that with
the smallest value if it is less than the criterion to remove. The P-value of
the T-Statistic may be used instead of the Partial F-Statistic. All of the
P-values for variables in the model can be calculated at once, removing the
variable with the largest P-Value if greater than the criterion to remove.
Types: bool
backward_only:
Optional Argument.
Specifies whether to use only backward technique or not. This technique is
similar to the backward technique, except that a forward step is not performed.
It starts with all independent variables in the model. Backward steps are
executed until no more are possible.
Types: bool
exclude_columns:
Optional Argument.
Specifies the name(s) of the specific column(s) to exclude from the analysis,
if a column specifier such as 'all', 'allnumeric' is used in the "columns"
argument. For convenience, when the "exclude_columns" argument is used, dependent
variable and group by columns, if any, are automatically excluded as input
columns and do not need to be included as "exclude_columns".
Types: str OR list of Strings (str)
cond_ind_threshold:
Optional Argument.
Specifies the condition index threshold value to use while generating near
dependency report. This is used when "near_dep_report" is set to True.
Default Value: 30
Types: int
constant:
Optional Argument.
Specifies whether the linear model includes a constant term or not. When set
to True, linear model includes a constant term.
Default Value: True
Types: bool
entrance_criterion:
Optional Argument.
Specifies the criterion to enter a variable into the model. The Partial
F-Statistic must be greater than this value, or the T-Statistic P-value must
be less than this value, depending on the "use_fstat" or "use_pvalue" argument
value.
Default Value: 3.84 if "use_fstat" is True, and 0.05 if "use_pvalue" is True.
Types: float
forward:
Optional Argument.
Specifies whether to use forward technique or not. When set to True, starting
with no independent variables in the model, a forward step is performed, adding
the best choice in explaining the dependent variable's variance, followed by a
backward step, removing the worst choice. A forward step is made by determining
the largest partial F-Statistic and adding the corresponding variable to the
model, provided the statistic is greater than the criterion to enter (see the
"entrance_criterion").
An alternative is to use the P-value of the T-Statistic (the ratio of a variable's
B coefficient to its Standard Error). When the P-value is used, a forward step
determines the variable with the smallest P-value and adds that variable if the
P-value is less than the criterion to enter. (If more than one variable has a
P-value of zero, the F-Statistic is used instead.)
Types: bool
forward_only:
Optional Argument.
Specifies whether to use only forward technique or not. This technique is similar
to the forward technique, except that a backward step is not performed.
Types: bool
group_columns:
Optional Argument.
Specifies the name(s) of the column(s) dividing the input into partitions, one
for each combination of values in the group by columns. For each partition or
combination of values a separate linear model is built.
Types: str OR list of Strings (str)
matrix_input:
Optional Argument.
Specifies whether the input teradataml DataFrame passed to argument "data"
represents an ESSCP matrix build by Matrix Building function or not, refer
"valib.Matrix()" function for more details.
When this is set to True, the input passed to "data" argument represents an
ESSCP matrix built by the Matrix Building function. Use of this feature saves
internally building a matrix each time this function is performed, providing
a significant performance improvement. The columns specified with the "columns"
argument may be a subset of the columns in this matrix and may be specified in
any order. The columns must, however, all be present in the matrix. Further,
if group by columns are specified in the matrix, these same group by columns
must be specified in this function.
Note:
If the input represents a saved matrix, make sure to set matrix_input=True
because results can otherwise be unpredictable. A saved matrix may look
like an ordinary teradataml DataFrame to this function.
Default Value: False
Types: bool
near_dep_report:
Optional Argument.
Specifies whether to produce an XML report showing columns that may be collinear
as part of the output or not. The report is included in the XML output only if
collinearity is detected.
Two threshold arguments are available for this report, "cond_ind_threshold" and
"variance_prop_threshold".
Types: bool
remove_criterion:
Optional Argument.
Specifies the criterion to remove a variable from the model. The Partial
F-Statistic must be less than this value, or the T-Statistic P-value must be
greater than this value, depending on the "use_fstat" or "use_pvalue" argument
value.
Default Value: 3.84 if "use_fstat" is True, and 0.05 if "use_pvalue" is True.
Types: float
stats_output:
Optional Argument.
Specifies whether to produce an additional data quality report which includes
the mean and standard deviation of each model variable, derived from an ESSCP
matrix. The report is included in the XML output.
Types: bool
stepwise:
Optional Argument.
Specifies whether to perform a stepwise procedure or not.
Default Value: False
Types: bool
use_fstat:
Optional Argument.
Specifies whether to use the partial F-Statistic in assessing whether a variable
should be added or removed.
Default Value: True
Types: bool
use_pvalue:
Optional Argument.
Specifies whether to use the T-Statistic P-value in assessing whether a variable
should be added or removed.
Default Value: False
Types: bool
variance_prop_threshold:
Optional Argument.
Specifies the variance proportion threshold value to use while generating near
dependency report. This is used when "near_dep_report" is set to True.
Default Value: 0.5
Types: float
charset:
Optional Argument.
Specifies the character set for the table name and column names.
If this argument is not set, the function takes default value set by
VAL library.
Permitted Values:
* 'UTF8'
* 'ASCII'
Types: str
RETURNS:
An instance of LinReg.
Output teradataml DataFrames can be accessed using attribute references, such as
LinRegObj.<attribute_name>.
Output teradataml DataFrame attribute names are:
1. model
2. statistical_measures
3. xml_reports
RAISES:
TeradataMlException, TypeError, ValueError
EXAMPLES:
# Notes:
# 1. To execute Vantage Analytic Library functions,
# a. import "valib" object from teradataml.
# b. set 'configure.val_install_location' to the database name where Vantage
# analytic library functions are installed.
# 2. Datasets used in these examples can be loaded using Vantage Analytic Library
# installer.
# Import valib object from teradataml to execute this function.
from teradataml import valib
# Set the 'configure.val_install_location' variable.
from teradataml import configure
configure.val_install_location = "SYSLIB"
# Create required teradataml DataFrame.
df = DataFrame("customer")
print(df)
# Example 1: Shows how input columns 'age', 'years_with_bank', and 'nbr_children' are
# used to predict 'income'.
obj = valib.LinReg(data=df,
columns=["age", "years_with_bank", "nbr_children"],
response_column="income")
# Print the results.
print(obj.model)
print(obj.statistical_measures)
print(obj.xml_reports)
# Example 2: Shows how group by columns 'gender' and 'marital_status' result in
# 2x4=8 models being built.
obj = valib.LinReg(data=df,
columns=["age", "years_with_bank", "nbr_children"],
response_column="income",
group_columns=["gender", "marital_status"])
# Print the results.
print(obj.model)
print(obj.statistical_measures)
print(obj.xml_reports)
|