Decision Trees - Teradata Warehouse Miner

In-Database Analytic Functions User Guide

Product
Teradata Warehouse Miner
Release Number
5.4.4
Published
August 2017
Language
English (United States)
Last Update
2018-05-04
dita:mapPath
guj1484331868727.ditamap
dita:ditavalPath
ft:empty
dita:id
B035-2306
lifecycle
previous
Product Category
Teradata® Warehouse Miner

Purpose

Miner includes a few Decision Tree algorithms such as gain ratio, gini index, and Chaid as well as one regression algorithm. The only algorithm that resides in-database currently is gain ratio which is available in the Miner as a decision tree splitting method called Gain Ratio Extreme. It is a stand-alone, externally-stored procedure executed directly in the Teradata database.

To execute the in-database decision tree algorithm, the td_analyze stored procedure and the tda_dt_calc table operator must be installed on the Teradata system, with appropriate permissions such as Execute Procedure granted to the user. The in-database decision tree feature is dependent on Release 15.00 of the Teradata RDBMS.

For each call to td_analyze, a decision tree is performed. The first parameter for decision trees is the decisiontree function name, followed by decision tree parameters.

A Gain Ratio Extreme Decision Tree returns a data set that can be viewed as result set. The result set contains one row with two columns. The second column contains an XML string representing the resulting decision tree model described in Predictive Model Markup Language (PMML).

Syntax

call twm. td_analyze('decisiontree','database=twm_source;tablename=twm_customer_analysis;columns=col names;dependent=column;General Parameters');

Required Parameters

columns
The independent input columns used in decision tree building. These columns must reside in the table named with the tablename parameter, residing in the database named with the database parameter.
For example: columns=column1,column2,column3
database
The database containing the input table.
decisiontree
Identifies the type of function being performed.
dependent
The dependent value is the name of a column whose values are being predicted. It is selected from the available columns that reside in the table specified by the database and tablename parameters.
tablename
The name of the table to transform.

Optional Parameters

algorithm
The algorithm the decision tree uses during building. Currently this option only allows gainratio.
binning
Option to automatically Bincode the continuous independent variables. Continuous data is separated into one hundred bins when this option is selected. If the variable has fewer than one hundred distinct values, this option is ignored. Default is false.
max_depth
Specifies the maximum number of levels the tree can grow. The default is 100.
min_records
Specifies how far the decision tree can split. Unless a node is pure (meaning it has only observations with the same dependent value) it splits if each branch that can come off this node contains at least this many observations. The default is a minimum of two cases for each branch.
operatordatabase
The database where the tda_kmeans table operator called by td_analyze resides. If not specified, the database software searches the standard search path for table operators, including the current user database.
For example: operatordatabase=twm
outputdatabase
The database containing the resulting output table when outputstyle=table or view.
outputtablename
The name of the output table representing the decision tree model.
pruning
Determines the style of pruning to use after the tree is fully built. The default option is gainratio. The only other option at this time is none which does no pruning of the tree.

Example

This example assumes the td_analyze function is installed in a database named twm.

This example shows how to invoke the td_analyze stored procedure and the tda_dt_calc table operator to perform decision tree. The resulting model is returned from the td_analyze stored procedure or placed in the output database and output table chosen.
call twm.td_analyze('decisiontree','database=twm_source;tablename=twm_customer_analysis;columns=income,age,nbr_children;dependent=gender;min_records=2;max_depth=5;binning=false;algorithm=gainratio;pruning=gainratio;outputdatabase=twm;outputtablename=cust_analysis_dt;operatordatabase=twm;');