teradataml offers hyper-parameterization of training data for hyperparameter tuning tasks. This example builds a SVM regression model to predict house value in California.
In this example, teradataml example California housing data is sliced into multiple data and used to build the SVM regression model.
- Example setup.
- Load example data from "cal_housing_ex_raw" that contains California housing data.
>>> load_example_data("teradataml", ["cal_housing_ex_raw"])
- Create teradataml DataFrame objects.
>>> data_input = DataFrame.from_table("cal_housing_ex_raw")
- Scale "target_columns" with respect to 'STD' value of the column.
>>> fit_obj = ScaleFit(data=data_input, target_columns=['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'], scale_method="USTD")
- Transform the data.
>>> transform_obj = ScaleTransform(data=data_input, object=fit_obj.output, accumulate=["id", "MedHouseVal"])
- Sample train and validation dataframe, where 80% data used for model training and 20% used for model validation.
>>> train_val_sample = transform_obj.result.sample(frac=[0.8, 0.2])
>>> train_df = train_val_sample[train_val_sample.sampleid == 1].drop(\ "sampleid", axis = 1)
>>> val_df = train_val_sample[train_val_sample.sampleid == 2].drop(\ "sampleid", axis = 1)
- Create two training data samples for model optimization.
>>> train_df1 = train_df.iloc[:30]
>>> train_df2 = train_df.iloc[30:]
- Load example data from "cal_housing_ex_raw" that contains California housing data.
- Define a parameter space and use GridSearch for Hyperparameterization.
- Define parameter space for model training.
>>> params = {"input_columns":['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'], "response_column":"MedHouseVal", "model_type":"regression", "batch_size":(11, 50, 75), "iter_max":(100, 301), "lambda1":0.1, "alpha":0.5, "iter_num_no_change":60, "tolerance":0.01, "intercept":False, "learning_rate":"INVTIME", "initial_data":0.5, "decay_rate":0.5, "momentum":0.6, "nesterov_optimization":True, "local_sgd_iterations":1}
- Define required argument for model prediction and evaluation.
>>> eval_params = {"id_column": "id", "accumulate": "MedHouseVal"}
- Import trainer function and optimizer.
>>> from teradataml import SVM, GridSearch
- Initialize the GridSearch optimizer with model trainer function and parameter space required for model training.
>>> gs_obj = GridSearch(func=SVM, params=params)
Model optimization is initiated using fit method.
- Define parameter space for model training.
- Pass multiple training datasets as tuple of DataFrames for model trainer function. DataFrames are passed as hyper-parameterized and hyperparameter tuning execution viewed using detailed training logs along with progress bar.
Parallel execution mode is disabled, and early stop criteria is set.
>>> gs_obj.fit(data=(train_df1, train_df2), run_parallel=False, early_stop=0.85, evaluation_metric="MAE", verbose=2, **eval_params)
Model_id:SVM_0 - Run time:11.608s - Status:PASS - MAE:2.2 Model_id:SVM_1 - Run time:11.624s - Status:PASS - MAE:3.11 Model_id:SVM_2 - Run time:13.068s - Status:PASS - MAE:2.2 Model_id:SVM_3 - Run time:11.549s - Status:PASS - MAE:3.11 Model_id:SVM_4 - Run time:11.965s - Status:PASS - MAE:2.187 Model_id:SVM_5 - Run time:11.586s - Status:PASS - MAE:3.11 Model_id:SVM_6 - Run time:12.506s - Status:PASS - MAE:2.187 Model_id:SVM_7 - Run time:11.695s - Status:PASS - MAE:3.11 Model_id:SVM_8 - Run time:11.491s - Status:PASS - MAE:2.187 Model_id:SVM_9 - Run time:14.294s - Status:PASS - MAE:3.11 Model_id:SVM_10 - Run time:11.422s - Status:PASS - MAE:2.187 Model_id:SVM_11 - Run time:15.748s - Status:PASS - MAE:3.11 Completed: |████████████████████████████████████████████████████████████| 100% - 12/12
All model training has been passed. In case of failure, use get_error_log method to retrieve corresponding error logs. - View hyperparameter tuning trained model metadata using models property. Retrieve the model metadata of "gs_obj" instance.
>>> gs_obj.models
MODEL_ID DATA_ID PARAMETERS STATUS MAE 0 SVM_0 DF_0 {'input_columns': ['MedInc', 'HouseAge', 'AveR PASS 2.199501 1 SVM_1 DF_1 {'input_columns': ['MedInc', 'HouseAge', 'AveR PASS 3.110119 2 SVM_2 DF_0 {'input_columns': ['MedInc', 'HouseAge', 'AveR PASS 2.199501 3 SVM_3 DF_1 {'input_columns': ['MedInc', 'HouseAge', 'AveR PASS 3.110119 4 SVM_4 DF_0 {'input_columns': ['MedInc', 'HouseAge', 'AveR PASS 2.187302 5 SVM_5 DF_1 {'input_columns': ['MedInc', 'HouseAge', 'AveR PASS 3.110119 6 SVM_6 DF_0 {'input_columns': ['MedInc', 'HouseAge', 'AveR PASS 2.187302 7 SVM_7 DF_1 {'input_columns': ['MedInc', 'HouseAge', 'AveR PASS 3.110119 8 SVM_8 DF_0 {'input_columns': ['MedInc', 'HouseAge', 'AveR PASS 2.187302 9 SVM_9 DF_1 {'input_columns': ['MedInc', 'HouseAge', 'AveR PASS 3.110119 10 SVM_10 DF_0 {'input_columns': ['MedInc', 'HouseAge', 'AveR PASS 2.187302 11 SVM_11 DF_1 {'input_columns': ['MedInc', 'HouseAge', 'AveR PASS 3.110119
DATA_ID column contain unique data identifier used to train the corresponding model. Input data can be retrieved using get_input_data method. - View the best model and corresponding information identified by GridSearch.
- Retrieve the best model id identified by "gs_obj" instance.
>>> gs_obj.best_model_id
'SVM_4'
- Retrieve the best data id.
>>> gs_obj.best_data_id
'DF_0'
- Retrieve the best model of "gs_obj1" instance.
>>> gs_obj.best_model
############ output_data Output ############ iterNum loss eta bias 0 3 1.787430 0.028868 0.0 1 5 1.702996 0.022361 0.0 2 6 1.671471 0.020412 0.0 3 7 1.644866 0.018898 0.0 4 9 1.608033 0.016667 0.0 5 10 1.596019 0.015811 0.0 6 8 1.624216 0.017678 0.0 7 4 1.742452 0.025000 0.0 8 2 1.834916 0.035355 0.0 9 1 1.888336 0.050000 0.0 ############ result Output ############ predictor estimate value attribute -5 BIC 37.025676 None -9 Learning Rate (Initial) 0.050000 None -11 Momentum 0.600000 None -14 Epsilon 0.100000 None -1 Loss Function NaN EPSILON_INSENSITIVE -3 Number of Observations 24.000000 None 7 Latitude -0.538928 None 5 Population 0.091047 None -16 Kernel NaN LINEAR -7 Alpha 0.500000 Elasticnet
Identified best model is stored as a default model for future prediction and evaluation operations.
- Retrieve the best model id identified by "gs_obj" instance.
- Perform prediction on validation data using the identified best model.
>>> gs_obj.predict(newdata=val_df, **eval_params)
############ result Output ############ id prediction MedHouseVal 0 15749 -1.176686 3.500 1 2313 -0.130279 0.863 2 5611 0.736420 1.587 3 5300 1.108423 3.500 4 6558 0.766487 3.594 5 7114 0.762072 2.187 6 670 -1.094195 1.922 7 7581 1.010409 1.334 8 16102 -1.229020 2.841 9 8090 0.404590 1.607
- Perform evaluation using internally sampled data using best model.
>>> gs_obj.evaluate()
############ result Output ############ MAE MSE MSLE MAPE MPE RMSE RMSLE ME R2 EV MPD MGD 0 2.187302 6.385763 0.119889 83.446339 83.446339 2.527007 0.346249 4.414041 -2.361604 0.156949 NaN NaN
When validation data is not passed to evaluate() method, it will use internally sampled test data for evaluation. - View all trained models stats report. Retrieve the model stats of "gs_obj" instance.
>>> gs_obj.model_stats
MODEL_ID MAE MSE MSLE MAPE ... ME R2 EV MPD MGD 0 SVM_0 2.199501 6.485457 0.127586 83.595306 ... 4.360486 -2.414085 0.132640 NaN NaN 1 SVM_1 3.110119 10.385778 0.079247 101.796641 ... 4.151917 -5.607787 0.546405 NaN NaN 2 SVM_2 2.199501 6.485457 0.127586 83.595306 ... 4.360486 -2.414085 0.132640 NaN NaN 3 SVM_3 3.110119 10.385778 0.079247 101.796641 ... 4.151917 -5.607787 0.546405 NaN NaN 4 SVM_4 2.187302 6.385763 0.119889 83.446339 ... 4.414041 -2.361604 0.156949 NaN NaN 5 SVM_5 3.110119 10.385778 0.079247 101.796641 ... 4.151917 -5.607787 0.546405 NaN NaN 6 SVM_6 2.187302 6.385763 0.119889 83.446339 ... 4.414041 -2.361604 0.156949 NaN NaN 7 SVM_7 3.110119 10.385778 0.079247 101.796641 ... 4.151917 -5.607787 0.546405 NaN NaN 8 SVM_8 2.187302 6.385763 0.119889 83.446339 ... 4.414041 -2.361604 0.156949 NaN NaN 9 SVM_9 3.110119 10.385778 0.079247 101.796641 ... 4.151917 -5.607787 0.546405 NaN NaN 10 SVM_10 2.187302 6.385763 0.119889 83.446339 ... 4.414041 -2.361604 0.156949 NaN NaN 11 SVM_11 3.110119 10.385778 0.079247 101.796641 ... 4.151917 -5.607787 0.546405 NaN NaN [12 rows x 13 columns]
- Update default model with other trained model and perform predictions.
- Find the best model.
>>> gs_obj.best_model_id
'SVM_4'
GridSearch identifies 'SVM_4' as a best model and same is considered as default model. - Update the default trained model. Default model of GridSearch instance is updated using set_model method.
>>> gs_obj.set_model(model_id="SVM_1")
- Perform prediction using "SVM_1" model.
>>> gs_obj.predict(newdata=val_df.iloc[:5], **eval_params)
############ result Output ############ id prediction MedHouseVal 0 3687 -1.067513 1.741 1 6044 -0.524250 1.109 2 5611 -1.042297 1.587 3 3593 0.300275 2.676 4 686 0.688285 1.578
- Find the best model.
- Retrieve any trained model from the GridSearch instance.
>>> gs_obj.get_model("SVM_2")
############ output_data Output ############ iterNum loss eta bias 0 3 1.749836 0.028868 0.0 1 5 1.649338 0.022361 0.0 2 6 1.616893 0.020412 0.0 3 7 1.590792 0.018898 0.0 4 9 1.554049 0.016667 0.0 5 10 1.541651 0.015811 0.0 6 8 1.569562 0.017678 0.0 7 4 1.693831 0.025000 0.0 8 2 1.808278 0.035355 0.0 9 1 1.932960 0.050000 0.0 ############ result Output ############ predictor estimate value attribute -5 BIC 36.383901 None -9 Learning Rate (Initial) 0.050000 None -11 Momentum 0.600000 None -14 Epsilon 0.100000 None -1 Loss Function NaN EPSILON_INSENSITIVE -3 Number of Observations 24.000000 None 7 Latitude -0.542485 None 5 Population 0.133593 None -16 Kernel NaN LINEAR -7 Alpha 0.500000 Elasticnet