DecisionForest Syntax Elements

DecisionForest Syntax Elements - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product

Teradata Vantage

Release Number

9.02

9.01

2.0

1.3

Published

February 2022

Language

English (United States)

Last Update

2022-02-10

dita:mapPath

rnn1580259159235.ditamap

dita:ditavalPath

ybt1582220416951.ditaval

dita:id

B700-4003

lifecycle

Product Category

Teradata Vantage™

OutputTable

Specify the name for the model table that the function outputs.

OutputMessageTable

[Optional] Specify the name for the output message table that the function outputs.

Default: default_dt_monitor_table in the current schema

ResponseColumn

Specify the name of the InputTable column that contains the response variable (that is, the quantity that you want to predict).

NumericInputs

[Required if you omit CategoricalInputs.] Specify the names of the InputTable columns to treat as the numeric predictor variables (which must be numeric values).

CategoricalInputs

[Required if you omit NumericInputs.] Specify the names of the InputTable columns to treat as the categorical predictor variables (which can be either numeric or VARCHAR values).

For information about columns that you must identify as numeric or categorical, see Identification of Numeric and Categorical Columns.

CategoricalEncoding

[Optional with CategoricalInputs, disallowed otherwise.] Specify algorithm for encoding categorical columns:

Option	Description
GrayCode	Recommended when accuracy is critical. Depending on available memory, performance may be impacted if a categorical column has a large number (for example, 20) unique levels, even with a small data set.
Hashing	Optimizes calculation speed and minimizes memory use, possibly decreasing accuracy.

Default: 'GrayCode'

TreeType

[Optional] Specify whether the analysis is a regression (continuous response variable) or a multiple-class classification (predicting result from the number of classes).

Default: 'regression' if the response variable is numeric, 'classification' otherwise

NumTrees

[Optional] Specify the number of trees to grow in the forest model. When specified, number_of_trees must be greater than or equal to the number of vworkers. When not specified, the function builds the minimum number of trees that provides the input data set with full coverage.

TreeSize

[Optional] Specify the number of rows that each tree uses as its input data set.

Default behavior: The function builds a tree using either the number of rows on a vworker or the number of rows that fit into the vworker’s memory, whichever is less.

MinNodeSize

[Optional] Specify a decision tree stopping criterion; the minimum size of any node within each decision tree.

Default: 1

Variance

[Optional] Specify a decision tree stopping criterion. If the variance within any node dips below this value, the algorithm stops looking for splits in the branch.

Default: 0

MaxDepth

[Optional] Specify a decision tree stopping criterion. If the tree reaches a depth past this value, the algorithm stops looking for splits. Decision trees can grow to (2(max_depth+1) - 1) nodes. This stopping criteria has the greatest effect on the performance of the function.

Default: 12

Mtry

[Optional] Specify the number of variables to randomly sample from each input value. For example, if mtry is 3, then the function randomly samples 3 variables from each input at each split. The mtry must be an INTEGER.

Default behavior: The function randomly samples all predictors.

Tip:

Calculate the initial value for mtry, where p is number of variables used for prediction, as follows:

Tree Type	mtry Initial Value
Classification	round(sqrt(p))
Regression	round(p/3)

MtrySeed

[Optional] Specify a LONG value to use in determining the random seed for mtry.

OutOfBag

[Optional] Specify whether to output the out-of-bag estimate of error rate.

The bootstrapping technique provides a convenient method for estimating the test error that does not require cross-validation. When creating a decision forest, each tree is built on a subset of the original data set that is created by sampling with replacement. The points in the original data set that were not used in creating a particular tree are called the out-of-bag observations for that tree. You can use this set of data points as a test set for this particular tree. By creating such a test set for each tree in the forest, you can estimate the test error of the model.

If OutOfBag is 'true', the function calculates the out-of-bag error for a decision forest model with this procedure:

For each observation i, use each tree for which the observation was out-of-bag to predict a response.
Combine the responses to make an out-of-bag prediction for observation i:
- For classification trees, take the majority vote of the predicted responses.
- For regression trees, take the average of the predicted responses.
Compare the out-of-bag predictions to the actual response for all observations to calculate the overall mean squared error or misclassification rate.

The preceding calculations increase the time that the function takes to complete.

Default: 'false'

DisplayNumProcessedRows

[Optional] Specify whether to output the number of input rows allocated to each worker and the number of input rows processed by each worker (excluding rows skipped because they contained NULL values).

Default: 'false'

Seed

[Optional] Specify the random seed the algorithm uses for repeatable results. The seed must be a LONG value.

For repeatable results, use both the Seed and UniqueID syntax elements. For more information, see Nondeterministic Results and UniqueID Syntax Element.

IDColumn

[Required with OutOfBag, optional otherwise.] Specify the name of the InputTable column that contains the row identifier.

Default: First InputTable column