| |
Methods defined here:
- __init__(self, data=None, attribute_name_columns=None, attribute_value_column=None, id_columns=None, attribute_table=None, response_table=None, response_column=None, categorical_attribute_table=None, splits_table=None, split_value=None, num_splits=10, approx_splits=True, nodesize=1, max_depth=30, weighted=False, weight_column=None, split_measure='gini', output_response_probdist=False, response_probdist_type='Laplace', categorical_encoding='graycode', attribute_table_sequence_column=None, data_sequence_column=None, categorical_attribute_table_sequence_column=None, response_table_sequence_column=None, splits_table_sequence_column=None)
- DESCRIPTION:
The Decision Tree function creates a single decision tree in a
distributed fashion, either weighted or unweighted. The model teradataml
DataFrame that this function outputs can be input to the function
DecisionTreePredict.
PARAMETERS:
data:
Optional Argument.
Specifies the name of the teradataml DataFrame that contains the
input data set.
Note: This argument is required if you omit attribute_table
and response_table.
attribute_name_columns:
Required Argument.
Specifies the names of the attribute teradataml DataFrame columns
that define the attribute.
Types: str OR list of Strings (str)
attribute_value_column:
Required Argument.
Specifies the names of the attribute teradataml DataFrame columns
that define the value.
Types: str
id_columns:
Required Argument.
Specifies the names of the columns in the response and attribute
tables that specify the ID of the instance.
Types: str OR list of Strings (str)
attribute_table:
Optional Argument.
Specifies the name of the teradataml DataFrame that contains the
attribute names and the values.
Note : This argument is required if you omit data.
response_table:
Optional Argument.
Specifies the name of the teradataml DataFrame that contains the
response values.
Note : This argument is required if you omit data.
response_column:
Required Argument.
Specifies the name of the response teradataml DataFrame column that
contains the response variable.
Types: str
categorical_attribute_table:
Optional Argument.
The name of the input teradataml DataFrame containing the categorical
attributes.
splits_table:
Optional Argument.
Specifies the name of the input teradataml DataFrame that contains
the user-specified splits. By default, the function creates new
splits.
split_value:
Optional Argument.
If you specify splits_table, this argument specifies the name of the
column that contains the split value. If approx_splits is "true",
then the default value is splits_valcol; if not, then the default
value is the attribute_value_column argument, node_column.
Types: str
num_splits:
Optional Argument.
Specifies the number of splits to consider for each variable. The
function does not consider all possible splits for all attributes.
Default Value: 10
Types: int
approx_splits:
Optional Argument.
Specifies whether to use approximate percentiles (true) or exact
percentiles (false). Internally, the function uses percentile values
as split values.
Default Value: True
Types: bool
nodesize:
Optional Argument.
Specifies the decision tree stopping criteria and the minimum size
of any particular node within each decision tree.
Default Value: 1
Types: int
max_depth:
Optional Argument.
Specifies a decision tree stopping criteria. If the tree reaches a
depth past this value, the algorithm stops looking for splits.
Decision trees can grow up to (2(max_depth+1) - 1) nodes. This
stopping criteria has the greatest effect on function performance.
The maximum value is 60.
Default Value: 30
Types: int
weighted:
Optional Argument.
Specifies whether to build a weighted decision tree. If you specify
"true", then you must also specify the weight_column argument.
Default Value: False
Types: bool
weight_column:
Optional Argument.
Specifies the name of the response teradataml DataFrame column that
contains the weights of the attribute values.
Types: str
split_measure:
Optional Argument.
Specifies the impurity measurement to use while constructing the
decision tree.
Default Value: "gini"
Permitted Values: GINI, ENTROPY, CHISQUARE
Types: str
output_response_probdist:
Optional Argument.
Specifies switch to enable or disable output of probability
distribution for output labels.
Note: 'output_response_probdist' argument can accept input value True
only when teradataml is connected to Vantage 1.0 Maintenance
Update 2 version or later.
Default Value: False
Types: bool
response_probdist_type:
Optional Argument.
Specifies the type of algorithm to use to generate output probability
distribution for output labels. Uses one of Laplace, Frequency or
RawCounts to generate Probability Estimation Trees (PET) based
distributions.
Note: This argument can only be used when output_response_probdist is
set to True.
Default Value: "Laplace"
Permitted Values: Laplace, Frequency, RawCount
Types: str
categorical_encoding:
Optional Argument.
Specifies which encoding method is used for categorical variables.
Note: categorical_encoding argument support is only available
when teradataml is connected to Vantage 1.1 or later.
Default Value: "graycode"
Permitted Values: graycode, hashing
Types: str
attribute_table_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "attribute_table". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
data_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
categorical_attribute_table_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "categorical_attribute_table". The argument is
used to ensure deterministic results for functions which produce
results that vary from run to run.
Types: str OR list of Strings (str)
response_table_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "response_table". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
splits_table_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "splits_table". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
RETURNS:
Instance of DecisionTree.
Output teradataml DataFrames can be accessed using attribute
references, such as DecisionTreeObj.<attribute_name>.
Output teradataml DataFrame attribute names are:
1. model_table
2. intermediate_splits_table
3. final_response_tableto
4. output
Note: When argument splits_table is used, output teradataml DataFrame,
intermediate_splits_table, is not created. If tried to access this
attribute an AttributeError will be raised.
RAISES:
TeradataMlException
EXAMPLES:
# Load the data to run the example.
load_example_data("DecisionTree", ["iris_attribute_train", "iris_response_train", "iris_altinput"])
# Create teradataml DataFrame
iris_attribute_train = DataFrame.from_table("iris_attribute_train")
iris_altinput = DataFrame.from_table("iris_altinput")
iris_response_train = DataFrame.from_table("iris_response_train")
# Example 1 -
sdt_out1 = DecisionTree(attribute_name_columns = 'attribute',
attribute_value_column = 'attrvalue',
id_columns = 'pid',
attribute_table = iris_attribute_train,
response_table = iris_response_train,
response_column = 'response',
approx_splits = True,
nodesize = 100,
max_depth = 5,
weighted = False,
split_measure = "gini",
output_response_probdist = False)
# Print the result DataFrame
print(sdt_out1.model_table)
print(sdt_out1.intermediate_splits_table)
print(sdt_out1.final_response_tableto)
print(sdt_out1.output)
# Example 2 -
sdt_out2 = DecisionTree(data = iris_altinput,
attribute_name_columns = 'attribute',
attribute_value_column = 'attrvalue',
id_columns = 'pid',
response_column = 'response',
num_splits = 10,
nodesize = 100,
max_depth = 5,
weighted = False,
split_measure = "gini",
output_response_probdist = False,
response_probdist_type = "Laplace")
# Print the result DataFrame
print(sdt_out2.model_table)
print(sdt_out2.intermediate_splits_table)
print(sdt_out2.final_response_tableto)
print(sdt_out2.output)
- __repr__(self)
- Returns the string representation for a DecisionTree class instance.
- get_build_time(self)
- Function to return the build time of the algorithm in seconds.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_prediction_type(self)
- Function to return the Prediction type of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_target_column(self)
- Function to return the Target Column of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- show_query(self)
- Function to return the underlying SQL query.
When model object is created using retrieve_model(), then None is returned.
|