Linear Regression is one of the fundamental types of predictive modeling algorithms. In linear regression, a dependent numeric variable is expressed in terms of the sum of one or more independent numeric variables, which are each multiplied by a numeric coefficient, usually with a constant term added to the sum of independent variables. Linear Regression is the coefficients of the independent variables together with a constant term that comprise a linear regression model. Applying these coefficients to the variables (columns) of each observation (row) in a data set (table) is known as scoring, as described in Linear Regression Scoring.
- The Teradata supplied table operator CALCMATRIX is used to build a table that represents an extended cross-products matrix that is the input to the algorithm
- One or more group by columns may optionally be specified so that an input matrix is built for each combination of group by column values, and subsequently a separate linear model is built for each matrix.
To achieve this, the names of the group by columns are passed to CALCMATRIX as parameters, so it includes them as columns in the matrix table it creates.
- The algorithm is partially scalable because the size of each input matrix depends only on the number of independent variables (columns) and not on the size of the input table. The calculations performed on the client workstation however are not scalable when group by columns are used, because each model is built serially based on each matrix in the matrix table.
- Teradata Release 14.10 or later is required by this algorithm due to the use of the CALCMATRIX table operator, first available in that release.
To run the stand-alone version of the linear regression algorithm or to score a model built by this algorithm the td_analyze stored procedure must be installed on the Teradata system, with appropriate permissions granted. Refer to In-Database Analytic Function Setup for instructions on how to install td_analyze.
Syntaxcall twm.td_analyze('linear','database=twm_database;tablename=twm_tablename;columns=c1,c2;dependent=c0;groupby=c3,c4;outputdatabase=twm_out_db;outputtablename=twm_out_tbl;constant=true; ');
- The input columns representing independent variables. The columns must reside in the table named with the tablename parameter, residing in the database named with the database parameter.
For example: columns=column1,column2,column3
- The database containing the input table.
The name of the column that represents the dependent variable.
- The input table to build a predictive model from.
- Set to true if the linear model includes a constant term or false otherwise. The default value is true.
- The input columns dividing the input table into partitions, one for each combination of values in the group by columns. For each partition or combination of values a separate linear model is built. The columns must reside in the table named with the tablename parameter. The default case is no group by columns.
For example: groupby=column1,column2,column3
- The database that contains the resulting output table that represents one or more linear models.
- If outputdatabase and outputtablename are not both specified, a volatile output table with randomly generated name is created in the logon user database.
- The name of the output table representing one or more linear models. A second output table reporting statistical measures is automatically named on the user’s behalf by appending “_rpt” to the end of this name. These two output tables represent a single linear model with coefficients and statistical measures in the absence of group by columns, or if group by columns are specified, there is a model for each combination of group bycolumn values in these output tables.
Note that both of the output tables must first be dropped by the user if outputdatabase and outputtablename are both specified. If outputdatabase and outputtablename are not both specified, volatile output tables with randomly generated names are created in the logon user database, and the two output result sets are returned to the user instead.
When overwrite is set to true (default), the output tables are dropped before creating new ones.
Examples in this section demonstrate the use of Linear Regression, first without and then with group by parameters. To run the provided examples, the td_analyze function must be installed in a database called twm and the TWM tutorial data must be installed in the twm_source database.
In this example, input columns age, years_with_bank, and nbr_children are used to predict income. No permanent output tables are created, just two result data sets that are returned to the user, one with coefficients and statistics, the other with statistical measures.
In this example, two permanent output tables are created, one with coefficients and statistics, the other with statistical measures.
In this example, group by columns gender and marital_status result in 2x4=8 models being built.