Using Decision Forest Model with teradataml Package - Using Decision Forest Model with teradataml Package - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
Language
English (United States)
Last Update
2024-04-03
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

This section uses iris data with three classes. The dataset contains 150 samples, each with four features describing flower properties, and a fifth column indicates the flower species.

In this example, you build a Decision Forest model based on the training dataset and apply the model to the test dataset to evaluate the performance of the model.

  1. Import the required modules.
    from teradataml import DecisionForest
    from teradataml import DecisionForestPredict
    from teradataml import load_example_data
    from teradataml.dataframe.dataframe import DataFrame
  2. If the input table "iris_input" does not already exist, create it and load the dataset.
    load_example_data("byom", "iris_input")
  3. Create a teradataml DataFrame from the loaded dataset.
    1. Create a teradataml DataFrame "iris_input" consisting the tokens from the training dataset.
      iris_input = DataFrame("iris_input")
    2. Create two samples of input data: sample 1 has 80% of the total rows for training the model ("iris_train"), and sample 2 has 20% of the total rows for testing the model ("iris_test").
      First, sample the "iris_input" dataframe.
      iris_sample = iris_input.sample(frac=[0.8, 0.2])
    3. Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.
      iris_train = iris_sample[iris_sample.sampleid == "1"].drop("sampleid", axis = 1)
       
      
    4. Create test dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is not required for scoring.
      iris_test = iris_sample[iris_sample.sampleid == "2"].drop("sampleid", axis = 1)
  4. Train a new Decision Forest model based on the teradataml DataFrame "iris_train" from the training dataset, using the DecisionForest function from teradataml package.
    This can be done with or without using the formula argument.

    Example 1: Train the decision forest Classification model using input teradataml dataframe and provided the formula argument.

    formula = "species ~ sepal_length + sepal_width + petal_length + petal_width"
     
    # Train the Decision Forest model.
    rft_model = DecisionForest(data=iris_train,
                               formula = formula,
                               tree_type="classification",
                               ntree=50,
                               tree_size=100,
                               nodesize=1,
                               variance=0.0,
                               max_depth=12,
                               maxnum_categorical=20,
                               mtry=3,
                               mtry_seed=100,
                               seed=100)
    Example 2: Train the same decision forest Classification model (rft_model) without using the formula argument.
    rft_model = DecisionForest(data=iris_train,
                               input_columns=["sepal_length", "sepal_width", "petal_length", "petal_width"],
                               response_column="species",
                               tree_type="classification",
                               ntree=50,
                               tree_size=100,
                               nodesize=1,
                               variance=0.0,
                               max_depth=12,
                               maxnum_categorical=20,
                               mtry=3,
                               mtry_seed=100,
                               seed=100)
    Once the model is created, you can apply the model to the test dataset.
  5. Predict the iris species by applying the Decision Forest model to the teradataml DataFrame "iris_test" from the test dataset, using the DecisionForestPredict function.
    decision_forest_predict_out = DecisionForestPredict(object = rft_model,
                                                        newdata = iris_test,
                                                        id_column = "id",
                                                        detailed = False,
                                                        terms = ["species"]
                                                        )
  6. Inspect the results.
    decision_forest_predict_out.result