Decision Trees - Teradata Warehouse Miner

Purpose

Decision Trees are a popular algorithm class found in many data mining products. Teradata Warehouse Miner includes a few classification algorithms such as gain ratio, gini index and Chaid as well as one regression algorithm. The only algorithm that resides in-database currently is gain ratio which is available in the Teradata Warehouse Miner Product as a decision tree splitting method called Gain Ratio Extreme. It is a stand-alone external stored procedure that can be executed directly in the Teradata database, independently of TWM. It is the stand-alone version and its parameters are described in this document.

In order to execute the in database decision tree algorithm the td_analyze stored procedure and the tda_dt_calc table operator must be installed on the Teradata system, with appropriate permissions such as Execute Procedure granted to the user. The in-database decision tree feature is dependent on Release 15.00 of the Teradata RDBMS.

For each call to td_analyze, a decision tree is performed. The first parameter for decision trees is the decisiontree function name, followed by decision tree parameters.

A Gain Ratio Extreme Decision Tree returns a data set that can be viewed as result set. The result set contains one row with two columns. The second column contains an XML string representing the resulting decision tree model described in Predictive Model Markup Language (PMML).

Syntax

call twm. td_analyze('decisiontree','database=twm_source;tablename=twm_customer_analysis;columns=col names;dependent=column;General Parameters');

Required Parameters

columns: The independent input columns used in decision tree building. These columns must reside in the table named with the tablename parameter, residing in the database named with the database parameter.; For example: columns=column1,column2,column3
database: The database containing the input table.
decisiontree: The parameter that identifies the type of function being performed.
dependent: The dependent value is the name of a column whose values are being predicted. It is selected from the available columns that reside in the table specified by the database and tablename parameters.
tablename: The name of the table to be transformed.

Optional Parameters

algorithm: This option contains the value of the algorithm the decision tree will use during building. Currently this option only allows gainratio.
binning: Option to automatically Bincode the continuous independent variables. Continuous data is separated into one hundred bins when this option is selected. If the variable has less than one hundred distinct values, this option is ignored. Default is false.
max_depth: Another method of stopping the tree is to specify the maximum depth the tree may grow to. This option will stop the algorithm if the tree being built has this many levels. The default is 100.
min_records: This option determines how far the splitting of the decision tree will go. Unless a node is pure (meaning it has only observations with the same dependent value) it will split if each branch that can come off this node will contain at least this many observations. The default is a minimum of two cases for each branch.
operatordatabase: The database where the tda_kmeans table operator called by td_analyze resides. If not specified, the database software searches the standard search path for table operators, including the current user database.; For example: operatordatabase=twm
outputdatabase: The database that will contain the resulting output table when outputstyle=table or view.
outputtablename: The name of the output table representing the decision tree model.
pruning: Option which determines the style of pruning to use after the tree is fully built. The default option is gainratio. The only other option at this time is none which will do no pruning of the tree.

Example

The following example assumes that the td_analyze function has been installed in a database named twm.

This example shows how to invoke the td_analyze stored procedure and the tda_dt_calc table operator to perform decision tree. The resulting model is returned from the td_analyze stored procedure or placed in the output database and output table chosen.

call twm.td_analyze('decisiontree','database=twm_source;tablename=twm_customer_analysis;columns=income,age,nbr_children;dependent=gender;min_records=2;max_depth=5;binning=false;algorithm=gainratio;pruning=gainratio;outputdatabase=twm;outputtable=cust_analysis_dt;operatordatabase=twm;');