teradataml’s RandomSearch offers hyper-parameterization of training data for hyperparameter tuning tasks. This example builds a DecisionForest classification model to classify iris flower. Perform hyperparameter-tuning on DecisionForest model trainer function for classification task.
In this example, teradataml example iris data is used to build the DecisionForest classification model.
- Example Setup.
- Load example data.
>>> load_example_data("byom", "iris_input")
- Create teradataml DataFrame.
>>> iris_input = DataFrame("iris_input")
- Create two samples of input data: sample 1 has 90% of total rows and sample 2 has 10% of total rows.
>>> iris_sample = iris_input.sample(frac=[0.9, 0.1])
- Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.
>>> iris_train = iris_sample[iris_sample.sampleid == "1"].drop("sampleid", axis = 1)
- Create validation dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is not required for scoring.
>>> iris_val = iris_sample[iris_sample.sampleid == "2"].drop("sampleid", axis = 1)
- Load example data.
- Define a parameter space and use RandomSearch for input data hyperparameterization.
- Create two slices of training data for this use case.
>>> train_df1 = iris_train.iloc[:60] >>> train_df2 = iris_train.iloc[60:]
- Define parameter space for model training.
>>> params = {"input_columns":["sepal_length", "sepal_width", "petal_length", "petal_width"], ... "response_column":"species", ... "tree_type":"classification", ... "ntree":tuple(set(round(random.uniform(20, 500)) for i in range(50))), ... "tree_size":(100, 200), ... "nodesize":10, ... "variance":tuple(set(round(random.random(), 2) for i in range(20))), ... "max_depth":tuple(set(round(random.uniform(2, 20)) for i in range(6))), ... "maxnum_categorical":20, ... "mtry":30, ... "mtry_seed":100, ... "seed":100}
- Define required argument for model prediction and evaluation.
>>> eval_params = {"id_column": "id", ... "accumulate": "species"}
- Import trainer function and optimizer.
>>> from teradataml import DecisionForest, RandomSearch
- Initialize the RandomSearch optimizer with model trainer function and parameter space required for model training.
>>> rs_obj = RandomSearch(func=DecisionForest, params=params, n_iter=4)
Model optimization is initiated using fit method.
- Create two slices of training data for this use case.
- Perform model optimization for DecisionForest function.
Pass single DataFrame for model trainer function and hyperparameter tuning execution viewed using progress bar.Evaluation and prediction arguments are passed along with training dataframe.
>>> rs_obj.fit(**eval_params)
data argument is not required for fit() method. Labeled dataframes passed in params argument as a hyperparameter. - View hyperparameter tuning trained model metadata using models property. Retrieve the model metadata of "rs_obj" instance.
>>> rs_obj.models
MODEL_ID DATA_ID PARAMETERS STATUS ACCURACY 0 DECISIONFOREST_1 data-2 {'input_columns': ['sepal_length', 'sepal_widt... PASS 0.8 1 DECISIONFOREST_3 data-2 {'input_columns': ['sepal_length', 'sepal_widt... PASS 0.8 2 DECISIONFOREST_0 data-1 {'input_columns': ['sepal_length', 'sepal_widt... PASS 1.0 3 DECISIONFOREST_2 data-1 {'input_columns': ['sepal_length', 'sepal_widt... PASS 1.0 4 DECISIONFOREST_5 data-2 {'input_columns': ['sepal_length', 'sepal_widt... PASS 0.8 5 DECISIONFOREST_7 data-2 {'input_columns': ['sepal_length', 'sepal_widt... PASS 0.8 6 DECISIONFOREST_4 data-1 {'input_columns': ['sepal_length', 'sepal_widt... PASS 1.0 7 DECISIONFOREST_6 data-1 {'input_columns': ['sepal_length', 'sepal_widt... PASS 1.0 8 DECISIONFOREST_9 data-2 {'input_columns': ['sepal_length', 'sepal_widt... PASS 0.8 9 DECISIONFOREST_8 data-1 {'input_columns': ['sepal_length', 'sepal_widt... PASS 0.8
Collectively 10 models are built because 'n' iteration is performed on all the input data.
All model training has been passed. In case of failure, use get_error_log method to retrieve corresponding error logs. - View the best model identified by RandomSearch. Retrieve the best model id identified by "rs_obj" instance.
>>> rs_obj.best_model_id
'DECISIONFOREST_0'
Identified best model is stored as a default model for future prediction and evaluation operations. - Perform prediction on validation data using the identified best model.
>>> rs_obj.predict(newdata=iris_val, **eval_params)
############ result Output ############ species id prediction confidence_lower confidence_upper 0 3 106 2 1.0 1.0 1 3 136 2 1.0 1.0 2 2 71 2 1.0 1.0 3 2 61 2 1.0 1.0 4 2 91 2 1.0 1.0 5 1 1 1 1.0 1.0 6 3 149 2 1.0 1.0 7 3 128 2 1.0 1.0 8 2 95 2 1.0 1.0 9 2 85 2 1.0 1.0
- Perform evaluation on validation data using the best model.
>>> rs_obj.evaluate()
############ output_data Output ############ SeqNum Metric MetricValue 0 3 Micro-Recall 1.0 1 5 Macro-Precision 1.0 2 6 Macro-Recall 1.0 3 7 Macro-F1 1.0 4 9 Weighted-Recall 1.0 5 10 Weighted-F1 1.0 6 8 Weighted-Precision 1.0 7 4 Micro-F1 1.0 8 2 Micro-Precision 1.0 9 1 Accuracy 1.0 ############ result Output ############ Prediction Mapping CLASS_1 CLASS_2 Precision Recall F1 Support SeqNum 1 2 CLASS_2 0 6 1.0 1.0 1.0 6 0 1 CLASS_1 6 0 1.0 1.0 1.0 6
When validation data is not passed to evaluate() method, it will use internally sampled test data for evaluation. - View all trained models stats report. Retrieve the model stats of "rs_obj" instance.
>>> rs_obj.model_stats
MODEL_ID ACCURACY MICRO-PRECISION ... WEIGHTED-PRECISION WEIGHTED-RECALL WEIGHTED-F1 0 DECISIONFOREST_1 0.8 0.8 ... 0.875 0.8 0.80543 1 DECISIONFOREST_3 0.8 0.8 ... 0.875 0.8 0.80543 2 DECISIONFOREST_0 1.0 1.0 ... 1.000 1.0 1.00000 3 DECISIONFOREST_2 1.0 1.0 ... 1.000 1.0 1.00000 4 DECISIONFOREST_5 0.8 0.8 ... 0.875 0.8 0.80543 5 DECISIONFOREST_7 0.8 0.8 ... 0.875 0.8 0.80543 6 DECISIONFOREST_4 1.0 1.0 ... 1.000 1.0 1.00000 7 DECISIONFOREST_6 1.0 1.0 ... 1.000 1.0 1.00000 8 DECISIONFOREST_9 0.8 0.8 ... 0.875 0.8 0.80543 9 DECISIONFOREST_8 0.8 0.8 ... 0.875 0.8 0.80543 [10 rows x 11 columns]
Model stats provide additional evaluation metrics report. - Update default model with other trained model and perform predictions.
- Find the best model.
>>> rs_obj.best_model_id
'DECISIONFOREST_0'
RandomSearch identifies 'DECISIONFOREST_0' as a best model and same is considered as default model. - Update the default trained model. Default model of RandomSearch instance is updated using set_model method.
>>> rs_obj.set_model(model_id="DECISIONFOREST_4")
- Perform prediction using "DECISIONFOREST_4" model.
>>> rs_obj.predict(newdata=iris_val.iloc[:5], **eval_params)
############ result Output ############ species id prediction confidence_lower confidence_upper 0 1 26 1 1.0 1.0 1 1 29 1 1.0 1.0 2 1 28 1 1.0 1.0 3 1 13 1 1.0 1.0 4 1 6 1 1.0 1.0
Though the default model is updated, known best model information will remain unchanged. The best model and corresponding information can be retrieved using the Properties of RandomSearch starting with "best_".
- Find the best model.
- Retrieve the identified best training data.
>>> rs_obj.get_input_data(data_id=rs_obj.best_data_id)
[{'data': sepal_length sepal_width petal_length petal_width species id 36 5.0 3.2 1.2 0.2 1 26 5.0 3.0 1.6 0.2 1 5 5.0 3.6 1.4 0.2 1 17 5.4 3.9 1.3 0.4 1 34 5.5 4.2 1.4 0.2 1 13 4.8 3.0 1.4 0.1 1 53 6.9 3.1 4.9 1.5 2 11 5.4 3.7 1.5 0.2 1 15 5.8 4.0 1.2 0.2 1 7 4.6 3.4 1.4 0.3 1}, {'newdata': sepal_length sepal_width petal_length petal_width species id 38 4.9 3.6 1.4 0.1 1 62 5.9 3.0 4.2 1.5 2 25 4.8 3.4 1.9 0.2 1 51 7.0 3.2 4.7 1.4 2 31 4.8 3.1 1.6 0.2 1 29 5.2 3.4 1.4 0.2 1 48 4.6 3.2 1.4 0.2 1 27 5.0 3.4 1.6 0.4 1 52 6.4 3.2 4.5 1.5 2 57 6.3 3.3 4.7 1.6 2}]
- Retrieve any trained model using RandomSearch instance.
>>> rs_obj.get_model("DECISIONFOREST_1")
############ result Output ############ task_index tree_num tree_order classification_tree 0 0 0 0 {"id_":1,"size_":89,"maxDepth_":5,"responseCounts_":{"2":37,"3":52},"nodeType_":"CLASSIFICATION_NODE","split_":{"splitValue_":1.750000,"attr_":"petal_width","type_":"CLASSIFICATION_NUMERIC_SPLIT","score_":0.485797,"scoreImprove_":0.485797,"leftNodeSize_":37,"rightNodeSize_":52},"leftChild_":{"id_":2,"size_":37,"maxDepth_":4,"label_":"2","responseCounts_":{"2":37},"nodeType_":"CLASSIFICATION_LEAF"},"rightChild_":{"id_":3,"size_":52,"maxDepth_":4,"label_":"3","responseCounts_":{"3":52},"nodeType_":"CLASSIFICATION_LEAF"}} 1 1 0 0 {"id_":1,"size_":87,"maxDepth_":5,"responseCounts_":{"2":36,"3":51},"nodeType_":"CLASSIFICATION_NODE","split_":{"splitValue_":4.650000,"attr_":"petal_length","type_":"CLASSIFICATION_NUMERIC_SPLIT","score_":0.485137,"scoreImprove_":0.399870,"leftNodeSize_":32,"rightNodeSize_":55},"leftChild_":{"id_":2,"size_":32,"maxDepth_":4,"label_":"2","responseCounts_":{"2":32},"nodeType_":"CLASSIFICATION_LEAF"},"rightChild_":{"id_":3,"size_":55,"maxDepth_":4,"responseCounts_":{"3":51,"2":4},"nodeType_":"CLASSIFICATION_NODE","split_":{"splitValue_":6.050000,"attr_":"sepal_length","type_":"CLASSIFICATION_NUMERIC_SPLIT","score_":0.134876,"scoreImprove_":0.030094,"leftNodeSize_":10,"rightNodeSize_":45},"leftChild_":{"id_":6,"size_":10,"maxDepth_":3,"responseCounts_":{"2":4,"3":6},"nodeType_":"CLASSIFICATION_NODE","split_":{"splitValue_":5.050000,"attr_":"petal_length","type_":"CLASSIFICATION_NUMERIC_SPLIT","score_":0.480000,"scoreImprove_":0.055172,"leftNodeSize_":6,"rightNodeSize_":4},"leftChild_":{"id_":12,"size_":6,"maxDepth_":2,"label_":"3","responseCounts_":{"3":6},"nodeType_":"CLASSIFICATION_LEAF"},"rightChild_":{"id_":13,"size_":4,"maxDepth_":2,"label_":"2","responseCounts_":{"2":4},"nodeType_":"CLASSIFICATION_LEAF"}},"rightChild_":{"id_":7,"size_":45,"maxDepth_":3,"label_":"3","responseCounts_":{"3":45},"nodeType_":"CLASSIFICATION_LEAF"}}}
Any trained model is retrieved using get_model method. Best model can retrieved using best_model property. - Retrieve the parameter grid of "rs_obj" object.
>>> rs_obj.get_parameter_grid()
[{'data_id': 'data-1', 'param': {'data': '"ALICE"."ml__select__169836486331463"', 'input_columns': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], 'max_depth': 5, 'maxnum_categorical': 20, 'mtry': 30, 'mtry_seed': 100, 'nodesize': 10, 'ntree': 23, 'response_column': 'species', 'seed': 100, 'tree_size': 100, 'tree_type': 'classification', 'variance': 0.81}}, {'data_id': 'data-2', 'param': {'data': '"ALICE"."ml__select__169836486331463"', 'input_columns': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], 'max_depth': 5, 'maxnum_categorical': 20, 'mtry': 30, 'mtry_seed': 100, 'nodesize': 10, 'ntree': 23, 'response_column': 'species', 'seed': 100, 'tree_size': 100, 'tree_type': 'classification', 'variance': 0.81}}, {'data_id': 'data-1', 'param': {'data': '"ALICE"."ml__select__169836486331463"', 'input_columns': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], 'max_depth': 15, 'maxnum_categorical': 20, 'mtry': 30, 'mtry_seed': 100, 'nodesize': 10, 'ntree': 99, 'response_column': 'species', 'seed': 100, 'tree_size': 200, 'tree_type': 'classification', 'variance': 0.13}}, {'data_id': 'data-2', 'param': {'data': '"ALICE"."ml__select__169836486331463"', 'input_columns': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], 'max_depth': 15, 'maxnum_categorical': 20, 'mtry': 30, 'mtry_seed': 100, 'nodesize': 10, 'ntree': 99, 'response_column': 'species', 'seed': 100, 'tree_size': 200, 'tree_type': 'classification', 'variance': 0.13}}, {'data_id': 'data-1', 'param': {'data': '"ALICE"."ml__select__169836486331463"', 'input_columns': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], 'max_depth': 12, 'maxnum_categorical': 20, 'mtry': 30, 'mtry_seed': 100, 'nodesize': 10, 'ntree': 89, 'response_column': 'species', 'seed': 100, 'tree_size': 100, 'tree_type': 'classification', 'variance': 0.13}}, {'data_id': 'data-2', 'param': {'data': '"ALICE"."ml__select__169836486331463"', 'input_columns': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], 'max_depth': 12, 'maxnum_categorical': 20, 'mtry': 30, 'mtry_seed': 100, 'nodesize': 10, 'ntree': 89, 'response_column': 'species', 'seed': 100, 'tree_size': 100, 'tree_type': 'classification', 'variance': 0.13}}, {'data_id': 'data-1', 'param': {'data': '"ALICE"."ml__select__169836486331463"', 'input_columns': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], 'max_depth': 13, 'maxnum_categorical': 20, 'mtry': 30, 'mtry_seed': 100, 'nodesize': 10, 'ntree': 107, 'response_column': 'species', 'seed': 100, 'tree_size': 200, 'tree_type': 'classification', 'variance': 0.73}}, {'data_id': 'data-2', 'param': {'data': '"ALICE"."ml__select__169836486331463"', 'input_columns': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], 'max_depth': 13, 'maxnum_categorical': 20, 'mtry': 30, 'mtry_seed': 100, 'nodesize': 10, 'ntree': 107, 'response_column': 'species', 'seed': 100, 'tree_size': 200, 'tree_type': 'classification', 'variance': 0.73}}, {'data_id': 'data-1', 'param': {'data': '"ALICE"."ml__select__169836486331463"', 'input_columns': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], 'max_depth': 9, 'maxnum_categorical': 20, 'mtry': 30, 'mtry_seed': 100, 'nodesize': 10, 'ntree': 22, 'response_column': 'species', 'seed': 100, 'tree_size': 200, 'tree_type': 'classification', 'variance': 0.69}}, {'data_id': 'data-2', 'param': {'data': '"ALICE"."ml__select__169836486331463"', 'input_columns': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], 'max_depth': 9, 'maxnum_categorical': 20, 'mtry': 30, 'mtry_seed': 100, 'nodesize': 10, 'ntree': 22, 'response_column': 'species', 'seed': 100, 'tree_size': 200, 'tree_type': 'classification', 'variance': 0.69}}]