- Specifies the name of the table that contains the input data set.
- Specifies the name of the output table where the function stores the predictive model that it generates.
If a table with this name exists in the database, the function drops the existing table and creates a new table with the same name.
- Specifies the name of the column that contains the response variable (that is, the quantity that you want to predict).
- [Required if CategoricalInputs is omitted.] Specifies the names of the columns that contain the numeric predictor variables (which must be numeric values).
- [Optional] Specifies the maximum number of distinct values for a single categorical variable. The max_cat_values must be a positive INTEGER. Default: 20. A max_cat_values greater than 20 is not recommended.
- [Required if NumericInputs is omitted.] Specifies the names of the columns that contain the categorical predictor variables (which can be either numeric or VARCHAR values).
Each categorical input column can have at most max_cat_values distinct categorical values. If max_cat_values exceeds 20, the function might run out of memory, because classification trees grow rapidly as max_cat_valuesincreases.
- [Optional] Specifies the number of trees to grow in the forest model. When
specified, number_of_trees must be
greater than or equal to the number of vworkers.
When not specified, the function builds the minimum number of trees that provides the input data set with full coverage.
- [Optional] Specifies whether the analysis is a regression (continuous response variable) or a multiple-class classification (predicting result from the number of classes). Default: 'regression' if the response variable is numeric, 'classification' otherwise.
- [Optional] Specifies the number of rows that each tree uses as its input data set. Default behavior: The function builds a tree using either the number of rows on a vworker or the number of rows that fit into the vworker’s memory, whichever is less.
- [Optional] Specifies a decision tree stopping criterion; the minimum size of any node within each decision tree. Default: 1.
- [Optional] Specifies a decision tree stopping criterion. If the variance within any node dips below this value, the algorithm stops looking for splits in the branch. Default: 0.
- [Optional] Specifies a decision tree stopping criterion. If the tree reaches a depth past this value, the algorithm stops looking for splits. Decision trees can grow to (2(max_depth+1) - 1) nodes. This stopping criteria has the greatest effect on the performance of the function. Default: 12.
- [Optional] Specifies the name of the table in which the function stores monitoring information. Default: 'default_dt_monitor_table' in the current schema.
- [Optional] Specifies whether to drop monitor_table, if it exists. Default: 'true'.
- [Optional] Specifies the number of variables to randomly sample from each input value. For example, if mtry is 3, then the function randomly samples 3 variables from each input at each split. The mtry must be an INTEGER. Default behavior: The function randomly samples all predictors.Tip:Calculate the initial value for mtry, where p is number of variables used for prediction, as follows:
- For classification trees: round(sqrt(p))
- For regression trees: round(p/3)
- [Optional] Specifies a LONG value to use in determining the random seed for mtry.
- [Optional] Specifies whether to output the out-of-bag estimate of error rate.
The bootstrapping technique for creating a random forest provides a convenient method for estimating the test error that does not require cross-validation. When creating a random forest, each tree is built on a subset of the original data set that is created by sampling with replacement. The points in the original data set that were not used in creating a particular tree are called the out-of-bag observations for that tree. You can use this set of data points as a test set for this particular tree. By creating such a test set for each tree in the forest, you can estimate the test error of the model.If OutOfBag is 'true', the out-of-bag error for a random forest model is calculated as follows:
- For each observation i, use each tree for which the observation was out-of-bag to predict a response.
- Combine the responses to make an out-of-bag
prediction for observation i:
- For classification trees, take the majority vote of the predicted responses.
- For regression trees, take the average of the predicted responses.
- Compare the out-of-bag predictions to the actual response for all observations to calculate the overall mean squared error or misclassification rate.
The preceding calculations increase the time that the function takes to complete.
- [Optional] Specifies a value to use in determining the seed for the random number generator. If you specify this value, you can specify the same value in future calls to this function and the function builds the same tree. The seed must be a LONG value. Default behavior: The function selects the seed.