Using Decision Forest Model with teradataml Package - Using Decision Forest Model with teradataml Package - Teradata Package for Python

Teradata® Package for Python User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
March 2024
Language
English (United States)
Last Update
2024-04-09
dita:mapPath
nvi1706202040305.ditamap
dita:ditavalPath
plt1683835213376.ditaval
dita:id
rkb1531260709148
Product Category
Teradata Vantage

This section uses iris data with three classes. The dataset contains 150 samples, each with four features describing flower properties, and a fifth column indicates the flower species.

In this example, you build a Decision Forest model based on the training dataset and apply the model to the test dataset to evaluate the performance of the model.

  1. Import the required modules.
    from teradataml import DecisionForest
    from teradataml import DecisionForestPredict
    from teradataml import load_example_data
    from teradataml.dataframe.dataframe import DataFrame
  2. If the input table "iris_input" does not already exist, create it and load the dataset.
    load_example_data("byom", "iris_input")
  3. Create a teradataml DataFrame from the loaded dataset.
    1. Create a teradataml DataFrame "iris_input" consisting the tokens from the training dataset.
      iris_input = DataFrame("iris_input")
    2. Create two samples of input data: sample 1 has 80% of the total rows for training the model ("iris_train"), and sample 2 has 20% of the total rows for testing the model ("iris_test").
      First, sample the "iris_input" dataframe.
      iris_sample = iris_input.sample(frac=[0.8, 0.2])
    3. Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.
      iris_train = iris_sample[iris_sample.sampleid == "1"].drop("sampleid", axis = 1)
       
      
    4. Create test dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is not required for scoring.
      iris_test = iris_sample[iris_sample.sampleid == "2"].drop("sampleid", axis = 1)
  4. Train a new Decision Forest model based on the teradataml DataFrame "iris_train" from the training dataset, using the DecisionForest function from teradataml package.
    This can be done with or without using the formula argument.

    Example 1: Train the decision forest Classification model using input teradataml dataframe and provided the formula argument.

    formula = "species ~ sepal_length + sepal_width + petal_length + petal_width"
     
    # Train the Decision Forest model.
    rft_model = DecisionForest(data=iris_train,
                               formula = formula,
                               tree_type="classification",
                               ntree=50,
                               tree_size=100,
                               nodesize=1,
                               variance=0.0,
                               max_depth=12,
                               maxnum_categorical=20,
                               mtry=3,
                               mtry_seed=100,
                               seed=100)
    Example 2: Train the same decision forest Classification model (rft_model) without using the formula argument.
    rft_model = DecisionForest(data=iris_train,
                               input_columns=["sepal_length", "sepal_width", "petal_length", "petal_width"],
                               response_column="species",
                               tree_type="classification",
                               ntree=50,
                               tree_size=100,
                               nodesize=1,
                               variance=0.0,
                               max_depth=12,
                               maxnum_categorical=20,
                               mtry=3,
                               mtry_seed=100,
                               seed=100)
    Once the model is created, you can apply the model to the test dataset.
  5. Predict the iris species by applying the Decision Forest model to the teradataml DataFrame "iris_test" from the test dataset, using the DecisionForestPredict function.
    decision_forest_predict_out = DecisionForestPredict(object = rft_model,
                                                        newdata = iris_test,
                                                        id_column = "id",
                                                        detailed = False,
                                                        terms = ["species"]
                                                        )
  6. Inspect the results.
    decision_forest_predict_out.result