Input Data Hyperparameterization for ModelTrainerFunction Tuning | RandomSearch - Example 2: Input Data Hyper-parameterization for Model Trainer Function Tuning - Teradata Package for Python

Teradata® Package for Python User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
March 2024
Language
English (United States)
Last Update
2024-04-09
dita:mapPath
nvi1706202040305.ditamap
dita:ditavalPath
plt1683835213376.ditaval
dita:id
rkb1531260709148
Product Category
Teradata Vantage

teradataml’s RandomSearch offers hyper-parameterization of training data for hyperparameter tuning tasks. This example builds a DecisionForest classification model to classify iris flower. Perform hyperparameter-tuning on DecisionForest model trainer function for classification task.

In this example, teradataml example iris data is used to build the DecisionForest classification model.

  1. Example Setup.
    1. Load example data.
      >>> load_example_data("byom", "iris_input")
    2. Create teradataml DataFrame.
      >>> iris_input = DataFrame("iris_input")
    3. Create two samples of input data: sample 1 has 90% of total rows and sample 2 has 10% of total rows.
      >>> iris_sample = iris_input.sample(frac=[0.9, 0.1])
    4. Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.
      >>> iris_train = iris_sample[iris_sample.sampleid == "1"].drop("sampleid", axis = 1)
    5. Create validation dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is not required for scoring.
      >>> iris_val = iris_sample[iris_sample.sampleid == "2"].drop("sampleid", axis = 1)
  2. Define a parameter space and use RandomSearch for input data hyperparameterization.
    1. Create two slices of training data for this use case.
      >>> train_df1 = iris_train.iloc[:60]
      >>> train_df2 = iris_train.iloc[60:]
    2. Define parameter space for model training.
      >>> params = {"input_columns":["sepal_length", "sepal_width", "petal_length", "petal_width"],
      ...           "response_column":"species",
      ...           "tree_type":"classification",
      ...           "ntree":tuple(set(round(random.uniform(20, 500)) for i in range(50))),
      ...           "tree_size":(100, 200),
      ...           "nodesize":10,
      ...           "variance":tuple(set(round(random.random(), 2) for i in range(20))),
      ...           "max_depth":tuple(set(round(random.uniform(2, 20)) for i in range(6))),
      ...           "maxnum_categorical":20,
      ...           "mtry":30,
      ...           "mtry_seed":100,
      ...           "seed":100}
    3. Define required argument for model prediction and evaluation.
      >>> eval_params = {"id_column": "id",
      ...                "accumulate": "species"}
    4. Import trainer function and optimizer.
      >>> from teradataml import DecisionForest, RandomSearch
    5. Initialize the RandomSearch optimizer with model trainer function and parameter space required for model training.
      >>> rs_obj = RandomSearch(func=DecisionForest, params=params, n_iter=4)
      Model optimization is initiated using fit method.
  3. Perform model optimization for DecisionForest function.

    Pass single DataFrame for model trainer function and hyperparameter tuning execution viewed using progress bar.Evaluation and prediction arguments are passed along with training dataframe.

    >>> rs_obj.fit(**eval_params)
    data argument is not required for fit() method. Labeled dataframes passed in params argument as a hyperparameter.
  4. View hyperparameter tuning trained model metadata using models property. Retrieve the model metadata of "rs_obj" instance.
    >>> rs_obj.models
               MODEL_ID DATA_ID                                         PARAMETERS STATUS  ACCURACY
    0  DECISIONFOREST_1  data-2  {'input_columns': ['sepal_length', 'sepal_widt...   PASS       0.8
    1  DECISIONFOREST_3  data-2  {'input_columns': ['sepal_length', 'sepal_widt...   PASS       0.8
    2  DECISIONFOREST_0  data-1  {'input_columns': ['sepal_length', 'sepal_widt...   PASS       1.0
    3  DECISIONFOREST_2  data-1  {'input_columns': ['sepal_length', 'sepal_widt...   PASS       1.0
    4  DECISIONFOREST_5  data-2  {'input_columns': ['sepal_length', 'sepal_widt...   PASS       0.8
    5  DECISIONFOREST_7  data-2  {'input_columns': ['sepal_length', 'sepal_widt...   PASS       0.8
    6  DECISIONFOREST_4  data-1  {'input_columns': ['sepal_length', 'sepal_widt...   PASS       1.0
    7  DECISIONFOREST_6  data-1  {'input_columns': ['sepal_length', 'sepal_widt...   PASS       1.0
    8  DECISIONFOREST_9  data-2  {'input_columns': ['sepal_length', 'sepal_widt...   PASS       0.8
    9  DECISIONFOREST_8  data-1  {'input_columns': ['sepal_length', 'sepal_widt...   PASS       0.8

    Collectively 10 models are built because 'n' iteration is performed on all the input data.

    All model training has been passed. In case of failure, use get_error_log method to retrieve corresponding error logs.
  5. View the best model identified by RandomSearch. Retrieve the best model id identified by "rs_obj" instance.
    >>> rs_obj.best_model_id
    'DECISIONFOREST_0'
    Identified best model is stored as a default model for future prediction and evaluation operations.
  6. Perform prediction on validation data using the identified best model.
    >>> rs_obj.predict(newdata=iris_val, **eval_params)
    ############ result Output ############
    
       species   id  prediction  confidence_lower  confidence_upper
    0        3  106           2               1.0               1.0
    1        3  136           2               1.0               1.0
    2        2   71           2               1.0               1.0
    3        2   61           2               1.0               1.0
    4        2   91           2               1.0               1.0
    5        1    1           1               1.0               1.0
    6        3  149           2               1.0               1.0
    7        3  128           2               1.0               1.0
    8        2   95           2               1.0               1.0
    9        2   85           2               1.0               1.0
  7. Perform evaluation on validation data using the best model.
    >>> rs_obj.evaluate()
    ############ output_data Output ############
    
       SeqNum                                              Metric  MetricValue
    0       3  Micro-Recall                                                1.0
    1       5  Macro-Precision                                             1.0
    2       6  Macro-Recall                                                1.0
    3       7  Macro-F1                                                    1.0
    4       9  Weighted-Recall                                             1.0
    5      10  Weighted-F1                                                 1.0
    6       8  Weighted-Precision                                          1.0
    7       4  Micro-F1                                                    1.0
    8       2  Micro-Precision                                             1.0
    9       1  Accuracy                                                    1.0
    
    
    ############ result Output ############
    
           Prediction  Mapping  CLASS_1  CLASS_2  Precision  Recall   F1  Support
    SeqNum
    1               2  CLASS_2        0        6        1.0     1.0  1.0        6
    0               1  CLASS_1        6        0        1.0     1.0  1.0        6
    When validation data is not passed to evaluate() method, it will use internally sampled test data for evaluation.
  8. View all trained models stats report. Retrieve the model stats of "rs_obj" instance.
    >>> rs_obj.model_stats
               MODEL_ID  ACCURACY  MICRO-PRECISION  ...  WEIGHTED-PRECISION  WEIGHTED-RECALL  WEIGHTED-F1
    0  DECISIONFOREST_1       0.8              0.8  ...               0.875              0.8      0.80543
    1  DECISIONFOREST_3       0.8              0.8  ...               0.875              0.8      0.80543
    2  DECISIONFOREST_0       1.0              1.0  ...               1.000              1.0      1.00000
    3  DECISIONFOREST_2       1.0              1.0  ...               1.000              1.0      1.00000
    4  DECISIONFOREST_5       0.8              0.8  ...               0.875              0.8      0.80543
    5  DECISIONFOREST_7       0.8              0.8  ...               0.875              0.8      0.80543
    6  DECISIONFOREST_4       1.0              1.0  ...               1.000              1.0      1.00000
    7  DECISIONFOREST_6       1.0              1.0  ...               1.000              1.0      1.00000
    8  DECISIONFOREST_9       0.8              0.8  ...               0.875              0.8      0.80543
    9  DECISIONFOREST_8       0.8              0.8  ...               0.875              0.8      0.80543
    
    [10 rows x 11 columns]
    Model stats provide additional evaluation metrics report.
  9. Update default model with other trained model and perform predictions.
    1. Find the best model.
      >>> rs_obj.best_model_id
      'DECISIONFOREST_0'
      RandomSearch identifies 'DECISIONFOREST_0' as a best model and same is considered as default model.
    2. Update the default trained model. Default model of RandomSearch instance is updated using set_model method.
      >>> rs_obj.set_model(model_id="DECISIONFOREST_4")
    3. Perform prediction using "DECISIONFOREST_4" model.
      >>> rs_obj.predict(newdata=iris_val.iloc[:5], **eval_params)
      ############ result Output ############
      
         species  id  prediction  confidence_lower  confidence_upper
      0        1  26           1               1.0               1.0
      1        1  29           1               1.0               1.0
      2        1  28           1               1.0               1.0
      3        1  13           1               1.0               1.0
      4        1   6           1               1.0               1.0
      Though the default model is updated, known best model information will remain unchanged. The best model and corresponding information can be retrieved using the Properties of RandomSearch starting with "best_".
  10. Retrieve the identified best training data.
    >>> rs_obj.get_input_data(data_id=rs_obj.best_data_id)
    [{'data':     sepal_length  sepal_width  petal_length  petal_width  species  id
        36           5.0          3.2           1.2          0.2        1
        26           5.0          3.0           1.6          0.2        1
        5            5.0          3.6           1.4          0.2        1
        17           5.4          3.9           1.3          0.4        1
        34           5.5          4.2           1.4          0.2        1
        13           4.8          3.0           1.4          0.1        1
        53           6.9          3.1           4.9          1.5        2
        11           5.4          3.7           1.5          0.2        1
        15           5.8          4.0           1.2          0.2        1
        7            4.6          3.4           1.4          0.3        1},
     {'newdata':     sepal_length  sepal_width  petal_length  petal_width  species  id
        38              4.9          3.6           1.4          0.1        1
        62              5.9          3.0           4.2          1.5        2
        25              4.8          3.4           1.9          0.2        1
        51              7.0          3.2           4.7          1.4        2
        31              4.8          3.1           1.6          0.2        1
        29              5.2          3.4           1.4          0.2        1
        48              4.6          3.2           1.4          0.2        1
        27              5.0          3.4           1.6          0.4        1
        52              6.4          3.2           4.5          1.5        2
        57              6.3          3.3           4.7          1.6        2}]
  11. Retrieve any trained model using RandomSearch instance.
    >>> rs_obj.get_model("DECISIONFOREST_1")
    ############ result Output ############
    
       task_index  tree_num  tree_order                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       classification_tree
    0           0         0           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              {"id_":1,"size_":89,"maxDepth_":5,"responseCounts_":{"2":37,"3":52},"nodeType_":"CLASSIFICATION_NODE","split_":{"splitValue_":1.750000,"attr_":"petal_width","type_":"CLASSIFICATION_NUMERIC_SPLIT","score_":0.485797,"scoreImprove_":0.485797,"leftNodeSize_":37,"rightNodeSize_":52},"leftChild_":{"id_":2,"size_":37,"maxDepth_":4,"label_":"2","responseCounts_":{"2":37},"nodeType_":"CLASSIFICATION_LEAF"},"rightChild_":{"id_":3,"size_":52,"maxDepth_":4,"label_":"3","responseCounts_":{"3":52},"nodeType_":"CLASSIFICATION_LEAF"}}
    1           1         0           0  {"id_":1,"size_":87,"maxDepth_":5,"responseCounts_":{"2":36,"3":51},"nodeType_":"CLASSIFICATION_NODE","split_":{"splitValue_":4.650000,"attr_":"petal_length","type_":"CLASSIFICATION_NUMERIC_SPLIT","score_":0.485137,"scoreImprove_":0.399870,"leftNodeSize_":32,"rightNodeSize_":55},"leftChild_":{"id_":2,"size_":32,"maxDepth_":4,"label_":"2","responseCounts_":{"2":32},"nodeType_":"CLASSIFICATION_LEAF"},"rightChild_":{"id_":3,"size_":55,"maxDepth_":4,"responseCounts_":{"3":51,"2":4},"nodeType_":"CLASSIFICATION_NODE","split_":{"splitValue_":6.050000,"attr_":"sepal_length","type_":"CLASSIFICATION_NUMERIC_SPLIT","score_":0.134876,"scoreImprove_":0.030094,"leftNodeSize_":10,"rightNodeSize_":45},"leftChild_":{"id_":6,"size_":10,"maxDepth_":3,"responseCounts_":{"2":4,"3":6},"nodeType_":"CLASSIFICATION_NODE","split_":{"splitValue_":5.050000,"attr_":"petal_length","type_":"CLASSIFICATION_NUMERIC_SPLIT","score_":0.480000,"scoreImprove_":0.055172,"leftNodeSize_":6,"rightNodeSize_":4},"leftChild_":{"id_":12,"size_":6,"maxDepth_":2,"label_":"3","responseCounts_":{"3":6},"nodeType_":"CLASSIFICATION_LEAF"},"rightChild_":{"id_":13,"size_":4,"maxDepth_":2,"label_":"2","responseCounts_":{"2":4},"nodeType_":"CLASSIFICATION_LEAF"}},"rightChild_":{"id_":7,"size_":45,"maxDepth_":3,"label_":"3","responseCounts_":{"3":45},"nodeType_":"CLASSIFICATION_LEAF"}}}
    Any trained model is retrieved using get_model method. Best model can retrieved using best_model property.
  12. Retrieve the parameter grid of "rs_obj" object.
    >>> rs_obj.get_parameter_grid()
    [{'data_id': 'data-1',
      'param': {'data': '"ALICE"."ml__select__169836486331463"',
                'input_columns': ['sepal_length',
                                  'sepal_width',
                                  'petal_length',
                                  'petal_width'],
                'max_depth': 5,
                'maxnum_categorical': 20,
                'mtry': 30,
                'mtry_seed': 100,
                'nodesize': 10,
                'ntree': 23,
                'response_column': 'species',
                'seed': 100,
                'tree_size': 100,
                'tree_type': 'classification',
                'variance': 0.81}},
     {'data_id': 'data-2',
      'param': {'data': '"ALICE"."ml__select__169836486331463"',
                'input_columns': ['sepal_length',
                                  'sepal_width',
                                  'petal_length',
                                  'petal_width'],
                'max_depth': 5,
                'maxnum_categorical': 20,
                'mtry': 30,
                'mtry_seed': 100,
                'nodesize': 10,
                'ntree': 23,
                'response_column': 'species',
                'seed': 100,
                'tree_size': 100,
                'tree_type': 'classification',
                'variance': 0.81}},
     {'data_id': 'data-1',
      'param': {'data': '"ALICE"."ml__select__169836486331463"',
                'input_columns': ['sepal_length',
                                  'sepal_width',
                                  'petal_length',
                                  'petal_width'],
                'max_depth': 15,
                'maxnum_categorical': 20,
                'mtry': 30,
                'mtry_seed': 100,
                'nodesize': 10,
                'ntree': 99,
                'response_column': 'species',
                'seed': 100,
                'tree_size': 200,
                'tree_type': 'classification',
                'variance': 0.13}},
     {'data_id': 'data-2',
      'param': {'data': '"ALICE"."ml__select__169836486331463"',
                'input_columns': ['sepal_length',
                                  'sepal_width',
                                  'petal_length',
                                  'petal_width'],
                'max_depth': 15,
                'maxnum_categorical': 20,
                'mtry': 30,
                'mtry_seed': 100,
                'nodesize': 10,
                'ntree': 99,
                'response_column': 'species',
                'seed': 100,
                'tree_size': 200,
                'tree_type': 'classification',
                'variance': 0.13}},
     {'data_id': 'data-1',
      'param': {'data': '"ALICE"."ml__select__169836486331463"',
                'input_columns': ['sepal_length',
                                  'sepal_width',
                                  'petal_length',
                                  'petal_width'],
                'max_depth': 12,
                'maxnum_categorical': 20,
                'mtry': 30,
                'mtry_seed': 100,
                'nodesize': 10,
                'ntree': 89,
                'response_column': 'species',
                'seed': 100,
                'tree_size': 100,
                'tree_type': 'classification',
                'variance': 0.13}},
     {'data_id': 'data-2',
      'param': {'data': '"ALICE"."ml__select__169836486331463"',
                'input_columns': ['sepal_length',
                                  'sepal_width',
                                  'petal_length',
                                  'petal_width'],
                'max_depth': 12,
                'maxnum_categorical': 20,
                'mtry': 30,
                'mtry_seed': 100,
                'nodesize': 10,
                'ntree': 89,
                'response_column': 'species',
                'seed': 100,
                'tree_size': 100,
                'tree_type': 'classification',
                'variance': 0.13}},
     {'data_id': 'data-1',
      'param': {'data': '"ALICE"."ml__select__169836486331463"',
                'input_columns': ['sepal_length',
                                  'sepal_width',
                                  'petal_length',
                                  'petal_width'],
                'max_depth': 13,
                'maxnum_categorical': 20,
                'mtry': 30,
                'mtry_seed': 100,
                'nodesize': 10,
                'ntree': 107,
                'response_column': 'species',
                'seed': 100,
                'tree_size': 200,
                'tree_type': 'classification',
                'variance': 0.73}},
     {'data_id': 'data-2',
      'param': {'data': '"ALICE"."ml__select__169836486331463"',
                'input_columns': ['sepal_length',
                                  'sepal_width',
                                  'petal_length',
                                  'petal_width'],
                'max_depth': 13,
                'maxnum_categorical': 20,
                'mtry': 30,
                'mtry_seed': 100,
                'nodesize': 10,
                'ntree': 107,
                'response_column': 'species',
                'seed': 100,
                'tree_size': 200,
                'tree_type': 'classification',
                'variance': 0.73}},
     {'data_id': 'data-1',
      'param': {'data': '"ALICE"."ml__select__169836486331463"',
                'input_columns': ['sepal_length',
                                  'sepal_width',
                                  'petal_length',
                                  'petal_width'],
                'max_depth': 9,
                'maxnum_categorical': 20,
                'mtry': 30,
                'mtry_seed': 100,
                'nodesize': 10,
                'ntree': 22,
                'response_column': 'species',
                'seed': 100,
                'tree_size': 200,
                'tree_type': 'classification',
                'variance': 0.69}},
     {'data_id': 'data-2',
      'param': {'data': '"ALICE"."ml__select__169836486331463"',
                'input_columns': ['sepal_length',
                                  'sepal_width',
                                  'petal_length',
                                  'petal_width'],
                'max_depth': 9,
                'maxnum_categorical': 20,
                'mtry': 30,
                'mtry_seed': 100,
                'nodesize': 10,
                'ntree': 22,
                'response_column': 'species',
                'seed': 100,
                'tree_size': 200,
                'tree_type': 'classification',
                'variance': 0.69}}]