DecisionForestPerSegment Syntax Elements

DecisionForestPerSegment Syntax Elements - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product

Teradata Vantage

Release Number

9.02

9.01

2.0

1.3

Published

February 2022

Language

English (United States)

Last Update

2022-02-10

dita:mapPath

rnn1580259159235.ditamap

dita:ditavalPath

ybt1582220416951.ditaval

dita:id

B700-4003

lifecycle

Product Category

Teradata Vantage™

ResponseColumn

Specify the name of the InputTable column that contains the response variable (that is, the quantity you want to predict).

NumericInputs

[Optional] Specify the names of the InputTable columns that have the numeric predictor variables (which must be numeric values).

If you specify NumericInputs, its values are the default values for every model.

If you specify values for a partition using AttributeTable, the values in AttributeTable override those specified in NumericInputs for the model the function builds for that partition.

Teradata recommends specifying NumericInputs with values as a superset of all values supplied through AttributeTable for collaborative optimizations.

CategoricalInputs

[Optional] Specify the names of the InputTable columns that have the categorical predictor variables (which can be either numeric or VARCHAR values).

If you specify CategoricalInputs, its values are the default values for every model.

If you specify values for a partition using AttributeTable, the values in AttributeTable override those specified in CategoricalInputs for the model the function builds for that partition.

Teradata recommends specifying CategoricalInputs with values as a superset of all values supplied through AttributeTable for collaborative optimizations.

TreeType

[Optional] Specify whether the analysis is a regression (continuous response variable) or a multiple-class classification (predicting result from the number of classes).

Default: 'regression' if the response variable is numeric, 'classification' otherwise

NumTrees

[Optional] Specify the number of trees to grow in the forest model.

Default: 10

MinNodeSize

[Optional] Specify a decision tree stopping criterion; the minimum size of any node within each decision tree.

Default: 1

Variance

[Optional] Specify a decision tree stopping criterion. If the variance within any node dips below this value, the algorithm stops looking for splits in the branch.

Default: 0

MaxDepth

[Optional] Specify a decision tree stopping criterion. If the tree reaches a depth past this value, the algorithm stops looking for splits. Decision trees can grow to (2(max_depth+1) - 1) nodes. This stopping criteria has the greatest effect on the performance of the function.

Default: 12

CategoricalEncoding

[Optional with CategoricalColumns, disallowed otherwise.] Specify the encoding scheme to use for categorical variables.

Option	Description
Target	Uses target encoding described in https://dl.acm.org/citation.cfm?id=507538. Supports regression and binary classification. Does not create high dimensionality, but requires careful validation, as it is prone to overfitting when distribution of categorical variables in training data and test data differ significantly.
GrayCode	Recommended when accuracy is critical. Depending on available memory, performance may be impacted if a categorical column has a large number (for example, 20) unique levels, even with a small data set.
Hashing	Optimizes calculation speed and minimizes memory use, possibly decreasing accuracy.

Option

Description

Target

Uses target encoding described in https://dl.acm.org/citation.cfm?id=507538.

Supports regression and binary classification.

Does not create high dimensionality, but requires careful validation, as it is prone to overfitting when distribution of categorical variables in training data and test data differ significantly.

GrayCode

Recommended when accuracy is critical. Depending on available memory, performance may be impacted if a categorical column has a large number (for example, 20) unique levels, even with a small data set.

Hashing

Optimizes calculation speed and minimizes memory use, possibly decreasing accuracy.

Default: GrayCode

MinSamplesForEncoding

[Optional with CategoricalEncoding ('Target'), disallowed otherwise.] Specify minimum number of samples for target encoding, which is k in the following formula:

Ɣ (n) = 1 / (1 + e-( (n - k)/f))

The Target encoding algorithm uses the hyperparameter Ɣ.

MinSamplesForEncoding is the same as the min_samples_leaf parameter in https://dl.acm.org/citation.cfm?id=507538.

Default: 1.0

Smoothing

[Optional with CategoricalEncoding ('Target'), disallowed otherwise.] Specify smoothing parameter for target encoding, which is f in the following formula:

Ɣ (n) = 1 / (1 + e-( (n - k)/f))

The Target encoding algorithm uses the hyperparameter Ɣ.

Smoothing is the same as the smoothing parameter in https://contrib.scikit-learn.org/categorical-encoding/targetencoder.html.

Default: 1.0

ErrorHandler

[Optional] Specify whether the function stops on error or continues with the next model:

Value	Function Behavior
'true'	Function skips partition where error occurs and continues with next partition. Output table displays error for the partitions for which an error occurs.
'false'	Function stops and reports error.

Default: 'false'

Mtry

[Optional] Specify the number of variables to randomly sample from each input value. For example, if mtry is 3, then the function randomly samples 3 variables from each input at each split. The mtry must be an INTEGER.

Default behavior: The function randomly samples all predictors.

Tip:

Calculate the initial value for mtry, where p is number of variables used for prediction, as follows:

Tree Type	mtry Initial Value
Classification	round(sqrt(p))
Regression	round(p/3)

MtrySeed

[Optional] Specify a LONG value to use in determining the random seed for mtry.

DisplayNumProcessedRows

[Optional] Specify whether to output the number of input rows allocated to each worker and the number of input rows processed by each worker (excluding rows skipped because they contained NULL values).

Default: 'false'

Seed

[Optional] Specify the random seed the algorithm uses for repeatable results. The seed must be a LONG value.

For repeatable results, use both the Seed and UniqueID syntax elements. For more information, see Nondeterministic Results and UniqueID Syntax Element.