Description
Linear Regression is one of the fundamental types of predictive modeling algorithms.
In linear regression, a dependent numeric variable is expressed in terms of the sum
of one or more independent numeric variables, which are each multiplied by a numeric
coefficient, usually with a constant term added to the sum of independent variables.
It is the coefficients of the independent variables together with a constant term
that comprise a linear regression model. Applying these coefficients to the variables
(columns) of each observation (row) in a data set is known as scoring, as described
in Linear Regression Scoring.
Some of the key features of Linear Regression are outlined below.
The Teradata table operator CALCMATRIX is used to build on object of class "tbl_teradata" that represents an extended cross-products matrix that is the input to the algorithm.
One or more group by columns may optionally be specified so that an input matrix is built for a separate linear model is built for each matrix.
To achieve this, the names of the group by columns are passed to CALCMATRIX as parameters, so it includes them as columns in the matrix output it creates.
The stepwise feature for Linear Regression is a technique for selecting the independent variables in a linear model. It consists of different methods of trying a variable and adding or removing it from a model by checking either a partial F-Statistic or the P-Value of a T-Statistic, at the user's choice.
The algorithm is partially scalable because the size of each input matrix depends only on the number of independent variables (columns) and not on the size of the input tbl_teradata. The calculations performed on the client workstation however are not scalable when group by columns are used, because each model is built serially based on each matrix in the matrix tbl_teradata.
Usage
td_lin_reg_valib(data, columns, response.column, ...)
Arguments
data |
Required Argument. |
columns |
Required Argument.
Types: character OR vector of Strings (character) |
response.column |
Required Argument. |
... |
Specifies other arguments supported by the function as described in the 'Other Arguments' section. |
Value
Function returns an object of class "td_lin_reg_valib"
which is a named list containing objects of class "tbl_teradata".
Named list members can be referenced directly with the "$" operator
using names:
model
statistical.measures
xml.reports
Other Arguments
backward
Optional Argument.
Specifies whether to use backward technique or not. When set to TRUE,
starting with all independent variables in the model, one backward step
is followed by one forward step until no variables can be removed.
A backward step consists of computing the Partial F-Statistic for each
variable and removing that with the smallest value if it is less than
the criterion to remove. The P-value of the T-Statistic may be used
instead of the Partial F-Statistic. All of the P-values for variables in
the model can be calculated at once, removing the variable with the
largest P-Value if greater than the criterion to remove.
Types: logical
backward.only
Optional Argument.
Specifies whether to use only backward technique or not. This
technique is similar to the backward technique, except that a
forward step is not performed. It starts with all independent
variables in the model. Backward steps are executed until no
more are possible.
Types: logical
exclude.columns
Optional Argument.
Specifies the name(s) of the column(s) to exclude from the
analysis, if a column specifier such as 'all', 'allnumeric'
is used in the "columns" argument. By default, when the
"exclude.columns" argument is used, dependent variable and
group by columns, if any, are automatically excluded as
input columns and do not need to be included as
"exclude.columns".
Types: character OR vector of Strings (character)
cond.ind.threshold
Optional Argument.
Specifies the condition index threshold value to use
while generating near dependency report. This is used
when "near.dep.report" is set to TRUE.
Default Value: 30
Types: integer
constant
Optional Argument.
Specifies whether the linear model includes a constant term
or not. When set to TRUE, model includes a constant term.
Default Value: TRUE
Types: logical
entrance.criterion
Optional Argument.
Specifies the criterion to enter a variable into the model.
The Partial F-Statistic must be greater than this value, or
the T-Statistic P-value must be less than this value, depending
on the value passed to "use.fstat" or "use.pvalue" arguments.
Default Value: 3.84 if "use.fstat" is TRUE and
0.05 if "use.pvalue" is TRUE.
Types: numeric
forward
Optional Argument.
Specifies whether to use forward technique or not. When set to TRUE,
starting with no independent variables in the model, a forward step is
performed, adding the best choice in explaining the dependent variable's
variance, followed by a backward step, removing the worst choice. A forward
step is made by determining the largest partial F-Statistic and adding
the corresponding variable to the model, provided the statistic is greater
than the criterion to enter (see the "entrance.criterion").
An alternative is to use the P-value of the T-Statistic (the ratio of a
variable's B coefficient to its Standard Error). When the P-value is used,
a forward step determines the variable with the smallest P-value and
adds that variable if the P-value is less than the criterion to enter.
(If more than one variable has a P-value of zero, the F-Statistic is used
instead.)
Types: logical
forward.only
Optional Argument.
Specifies whether to use only forward technique or not. This
technique is similar to the forward technique, except that a
backward step is not performed.
Types: logical
group.columns
Optional Argument.
Specifies the name(s) of the column(s) dividing the input into
partitions, one for each combination of values in the group by
columns. For each partition or combination of values a separate
linear model is built.
Types: character OR vector of Strings (character)
matrix.input
Optional Argument.
Specifies whether the input tbl_teradata passed to argument "data"
represents an ESSCP matrix build by Matrix Building function or not,
refer td_matrix_valib
function for more details.
When this is set to TRUE, the input passed to "data" argument
represents an ESSCP matrix built by the Matrix Building function.
Use of this feature saves internally building a matrix each time
this function is performed, providing a significant performance
improvement. The columns specified with the "columns" argument may
be a subset of the columns in this matrix and may be specified in
any order. The columns must, however, all be present in the matrix.
Further, if group by columns are specified in the matrix, these
same group by columns must be specified in this function.
Note:
If the input represents a saved matrix, make sure to set this argument to TRUE because results can otherwise be unpredictable.
Default Value: FALSE
Types: logical
near.dep.report
Optional Argument.
Specifies whether to produce an XML report showing columns
that may be collinear as part of the output or not. The report
is included in the XML output only if collinearity is detected.
Two threshold arguments are available for this report,
"cond.ind.threshold" and "variance.prop.threshold".
Types: logical
remove.criterion
Optional Argument.
Specifies the criterion to remove a variable from the model.
The T-Statistic P-value must be greater than this value for a
variable to be removed.
The Partial F-Statistic must be less than this value, or the
T-Statistic P-value must be greater than this value, depending
on the value passed to "use.fstat" or "use.pvalue" arguments.
Default Value: 3.84 if "use.fstat" is TRUE and
0.05 if "use.pvalue" is TRUE.
Types: numeric
stats.output
Optional Argument.
Specifies whether to produce an additional data quality report
which includes the mean and standard deviation of each model
variable, derived from an ESSCP matrix. The report is included
in the XML output.
Types: logical
stepwise
Optional Argument.
Specifies whether to perform a stepwise procedure or not.
Default Value: FALSE
Types: logical
use.fstat
Optional Argument.
Specifies whether to use the partial F-Statistic in assessing whether
a variable should be added or removed.
Default Value: TRUE
Types: logical
use.pvalue
Optional Argument.
Specifies whether to use the T-Statistic P-value in assessing whether
a variable should be added or removed.
Default Value: FALSE
Types: logical
variance.prop.threshold
Optional Argument.
Specifies the variance proportion threshold value to
use while generating near dependency report. This is
used when "near.dep.report" is set to TRUE.
Default Value: 0.5
Types: numeric
Examples
# Notes:
# 1. To execute Vantage Analytic Library functions, set option 'val.install.location' to
# the database name where Vantage analytic library functions are installed.
# 2. Datasets used in these examples can be loaded using Vantage Analytic Library installer.
# Set the option 'val.install.location'.
options(val.install.location = "SYSLIB")
# Get remote data source connection.
con <- td_get_context()$connection
# Create an object of class "tbl_teradata".
df <- tbl(con, "customer")
print(df)
# Example 1: Shows how input columns 'age', 'years_with_bank', and 'nbr_children' are
# used to predict 'income'.
obj <- td_lin_reg_valib(data=df,
columns=c("age", "years_with_bank", "nbr_children"),
response.column="income")
# Print the results.
print(obj$model)
print(obj$statistical.measures)
print(obj$xml.reports)
# Example 2: Shows how group by columns 'gender' and 'marital_status' result in 2x4=8 models
# being built.
obj <- td_lin_reg_valib(data=df,
columns=c("age", "years_with_bank", "nbr_children"),
response.column="income",
group.columns=c("gender", "marital_status"))
# Print the results.
print(obj$model)
print(obj$statistical.measures)
print(obj$xml.reports)