Input Data Hyperparameterization for Model Trainer Function Tuning | GridSearch - Example 2: Input Data Hyper-parameterization for Model Trainer Function Tuning - Teradata Package for Python

Teradata® Package for Python User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
March 2024
Language
English (United States)
Last Update
2024-04-09
dita:mapPath
nvi1706202040305.ditamap
dita:ditavalPath
plt1683835213376.ditaval
dita:id
rkb1531260709148
Product Category
Teradata Vantage

teradataml offers hyper-parameterization of training data for hyperparameter tuning tasks. This example builds a SVM regression model to predict house value in California.

In this example, teradataml example California housing data is sliced into multiple data and used to build the SVM regression model.

  1. Example setup.
    1. Load example data from "cal_housing_ex_raw" that contains California housing data.
      >>> load_example_data("teradataml", ["cal_housing_ex_raw"])
    2. Create teradataml DataFrame objects.
      >>> data_input = DataFrame.from_table("cal_housing_ex_raw")
    3. Scale "target_columns" with respect to 'STD' value of the column.
      >>> fit_obj = ScaleFit(data=data_input,
                             target_columns=['MedInc', 'HouseAge', 'AveRooms',
                                             'AveBedrms', 'Population', 'AveOccup',
                                             'Latitude', 'Longitude'],
                             scale_method="USTD")
    4. Transform the data.
      >>> transform_obj = ScaleTransform(data=data_input,
                                         object=fit_obj.output,
                                         accumulate=["id", "MedHouseVal"])
    5. Sample train and validation dataframe, where 80% data used for model training and 20% used for model validation.
      >>> train_val_sample = transform_obj.result.sample(frac=[0.8, 0.2])
      >>> train_df = train_val_sample[train_val_sample.sampleid == 1].drop(\
                                      "sampleid", axis = 1)
      >>> val_df = train_val_sample[train_val_sample.sampleid == 2].drop(\
                                      "sampleid", axis = 1)
    6. Create two training data samples for model optimization.
      >>> train_df1 = train_df.iloc[:30]
      >>> train_df2 = train_df.iloc[30:]
  2. Define a parameter space and use GridSearch for Hyperparameterization.
    1. Define parameter space for model training.
      >>> params = {"input_columns":['MedInc', 'HouseAge', 'AveRooms',
                                       'AveBedrms', 'Population', 'AveOccup',
                                       'Latitude', 'Longitude'],
                     "response_column":"MedHouseVal",
                     "model_type":"regression",
                     "batch_size":(11, 50, 75),
                     "iter_max":(100, 301),
                     "lambda1":0.1,
                     "alpha":0.5,
                     "iter_num_no_change":60,
                     "tolerance":0.01,
                     "intercept":False,
                     "learning_rate":"INVTIME",
                     "initial_data":0.5,
                     "decay_rate":0.5,
                     "momentum":0.6,
                     "nesterov_optimization":True,
                     "local_sgd_iterations":1}
    2. Define required argument for model prediction and evaluation.
      >>> eval_params = {"id_column": "id",
                          "accumulate": "MedHouseVal"}
    3. Import trainer function and optimizer.
      >>> from teradataml import SVM, GridSearch
    4. Initialize the GridSearch optimizer with model trainer function and parameter space required for model training.
      >>> gs_obj = GridSearch(func=SVM, params=params)
      Model optimization is initiated using fit method.
  3. Pass multiple training datasets as tuple of DataFrames for model trainer function.
    DataFrames are passed as hyper-parameterized and hyperparameter tuning execution viewed using detailed training logs along with progress bar.

    Parallel execution mode is disabled, and early stop criteria is set.

    >>> gs_obj.fit(data=(train_df1, train_df2),
                   run_parallel=False,
                   early_stop=0.85,
                   evaluation_metric="MAE",
                   verbose=2,
                   **eval_params)
    Model_id:SVM_0 - Run time:11.608s - Status:PASS - MAE:2.2
    Model_id:SVM_1 - Run time:11.624s - Status:PASS - MAE:3.11
    Model_id:SVM_2 - Run time:13.068s - Status:PASS - MAE:2.2
    Model_id:SVM_3 - Run time:11.549s - Status:PASS - MAE:3.11
    Model_id:SVM_4 - Run time:11.965s - Status:PASS - MAE:2.187
    Model_id:SVM_5 - Run time:11.586s - Status:PASS - MAE:3.11
    Model_id:SVM_6 - Run time:12.506s - Status:PASS - MAE:2.187
    Model_id:SVM_7 - Run time:11.695s - Status:PASS - MAE:3.11
    Model_id:SVM_8 - Run time:11.491s - Status:PASS - MAE:2.187
    Model_id:SVM_9 - Run time:14.294s - Status:PASS - MAE:3.11
    Model_id:SVM_10 - Run time:11.422s - Status:PASS - MAE:2.187
    Model_id:SVM_11 - Run time:15.748s - Status:PASS - MAE:3.11
    Completed: |████████████████████████████████████████████████████████████| 100% - 12/12
    All model training has been passed. In case of failure, use get_error_log method to retrieve corresponding error logs.
  4. View hyperparameter tuning trained model metadata using models property. Retrieve the model metadata of "gs_obj" instance.
    >>> gs_obj.models
       MODEL_ID    DATA_ID                                       PARAMETERS STATUS      MAE
    0     SVM_0    DF_0  {'input_columns': ['MedInc', 'HouseAge', 'AveR      PASS  2.199501
    1     SVM_1    DF_1  {'input_columns': ['MedInc', 'HouseAge', 'AveR      PASS  3.110119
    2     SVM_2    DF_0  {'input_columns': ['MedInc', 'HouseAge', 'AveR      PASS  2.199501
    3     SVM_3    DF_1  {'input_columns': ['MedInc', 'HouseAge', 'AveR      PASS  3.110119
    4     SVM_4    DF_0  {'input_columns': ['MedInc', 'HouseAge', 'AveR      PASS  2.187302
    5     SVM_5    DF_1  {'input_columns': ['MedInc', 'HouseAge', 'AveR      PASS  3.110119
    6     SVM_6    DF_0  {'input_columns': ['MedInc', 'HouseAge', 'AveR      PASS  2.187302
    7     SVM_7    DF_1  {'input_columns': ['MedInc', 'HouseAge', 'AveR      PASS  3.110119
    8     SVM_8    DF_0  {'input_columns': ['MedInc', 'HouseAge', 'AveR      PASS  2.187302
    9     SVM_9    DF_1  {'input_columns': ['MedInc', 'HouseAge', 'AveR      PASS  3.110119
    10   SVM_10    DF_0  {'input_columns': ['MedInc', 'HouseAge', 'AveR      PASS  2.187302
    11   SVM_11    DF_1  {'input_columns': ['MedInc', 'HouseAge', 'AveR      PASS  3.110119
    DATA_ID column contain unique data identifier used to train the corresponding model. Input data can be retrieved using get_input_data method.
  5. View the best model and corresponding information identified by GridSearch.
    1. Retrieve the best model id identified by "gs_obj" instance.
      >>> gs_obj.best_model_id
      'SVM_4'
    2. Retrieve the best data id.
      >>> gs_obj.best_data_id
      'DF_0'
    3. Retrieve the best model of "gs_obj1" instance.
      >>> gs_obj.best_model
      ############ output_data Output ############
      
         iterNum      loss       eta  bias
      0        3  1.787430  0.028868   0.0
      1        5  1.702996  0.022361   0.0
      2        6  1.671471  0.020412   0.0
      3        7  1.644866  0.018898   0.0
      4        9  1.608033  0.016667   0.0
      5       10  1.596019  0.015811   0.0
      6        8  1.624216  0.017678   0.0
      7        4  1.742452  0.025000   0.0
      8        2  1.834916  0.035355   0.0
      9        1  1.888336  0.050000   0.0
      
      
      ############ result Output ############
      
                                predictor   estimate                                   value
      attribute
      -5                             BIC   37.025676                                    None
      -9          Learning Rate (Initial)   0.050000                                    None
      -11                       Momentum    0.600000                                    None
      -14                        Epsilon    0.100000                                    None
      -1                   Loss Function         NaN  EPSILON_INSENSITIVE
      -3          Number of Observations   24.000000                                    None
       7                         Latitude  -0.538928                                    None
       5                       Population   0.091047                                    None
      -16                         Kernel         NaN                                  LINEAR
      -7                           Alpha    0.500000                              Elasticnet
      Identified best model is stored as a default model for future prediction and evaluation operations.
  6. Perform prediction on validation data using the identified best model.
    >>> gs_obj.predict(newdata=val_df, **eval_params)
    ############ result Output ############
    
          id  prediction  MedHouseVal
    0  15749   -1.176686        3.500
    1   2313   -0.130279        0.863
    2   5611    0.736420        1.587
    3   5300    1.108423        3.500
    4   6558    0.766487        3.594
    5   7114    0.762072        2.187
    6    670   -1.094195        1.922
    7   7581    1.010409        1.334
    8  16102   -1.229020        2.841
    9   8090    0.404590        1.607
  7. Perform evaluation using internally sampled data using best model.
    >>> gs_obj.evaluate()
    ############ result Output ############
    
            MAE       MSE      MSLE       MAPE        MPE      RMSE     RMSLE        ME        R2        EV  MPD  MGD
    0  2.187302  6.385763  0.119889  83.446339  83.446339  2.527007  0.346249  4.414041 -2.361604  0.156949  NaN  NaN
    When validation data is not passed to evaluate() method, it will use internally sampled test data for evaluation.
  8. View all trained models stats report. Retrieve the model stats of "gs_obj" instance.
    >>> gs_obj.model_stats
       MODEL_ID       MAE        MSE      MSLE        MAPE  ...        ME        R2        EV  MPD  MGD
    0     SVM_0  2.199501   6.485457  0.127586   83.595306  ...  4.360486 -2.414085  0.132640  NaN  NaN
    1     SVM_1  3.110119  10.385778  0.079247  101.796641  ...  4.151917 -5.607787  0.546405  NaN  NaN
    2     SVM_2  2.199501   6.485457  0.127586   83.595306  ...  4.360486 -2.414085  0.132640  NaN  NaN
    3     SVM_3  3.110119  10.385778  0.079247  101.796641  ...  4.151917 -5.607787  0.546405  NaN  NaN
    4     SVM_4  2.187302   6.385763  0.119889   83.446339  ...  4.414041 -2.361604  0.156949  NaN  NaN
    5     SVM_5  3.110119  10.385778  0.079247  101.796641  ...  4.151917 -5.607787  0.546405  NaN  NaN
    6     SVM_6  2.187302   6.385763  0.119889   83.446339  ...  4.414041 -2.361604  0.156949  NaN  NaN
    7     SVM_7  3.110119  10.385778  0.079247  101.796641  ...  4.151917 -5.607787  0.546405  NaN  NaN
    8     SVM_8  2.187302   6.385763  0.119889   83.446339  ...  4.414041 -2.361604  0.156949  NaN  NaN
    9     SVM_9  3.110119  10.385778  0.079247  101.796641  ...  4.151917 -5.607787  0.546405  NaN  NaN
    10   SVM_10  2.187302   6.385763  0.119889   83.446339  ...  4.414041 -2.361604  0.156949  NaN  NaN
    11   SVM_11  3.110119  10.385778  0.079247  101.796641  ...  4.151917 -5.607787  0.546405  NaN  NaN
    
    [12 rows x 13 columns]
  9. Update default model with other trained model and perform predictions.
    1. Find the best model.
      >>> gs_obj.best_model_id
      'SVM_4'
      GridSearch identifies 'SVM_4' as a best model and same is considered as default model.
    2. Update the default trained model. Default model of GridSearch instance is updated using set_model method.
      >>> gs_obj.set_model(model_id="SVM_1")
    3. Perform prediction using "SVM_1" model.
      >>> gs_obj.predict(newdata=val_df.iloc[:5], **eval_params)
      ############ result Output ############
      
           id  prediction  MedHouseVal
      0  3687   -1.067513        1.741
      1  6044   -0.524250        1.109
      2  5611   -1.042297        1.587
      3  3593    0.300275        2.676
      4   686    0.688285        1.578
  10. Retrieve any trained model from the GridSearch instance.
    >>> gs_obj.get_model("SVM_2")
    ############ output_data Output ############
    
       iterNum      loss       eta  bias
    0        3  1.749836  0.028868   0.0
    1        5  1.649338  0.022361   0.0
    2        6  1.616893  0.020412   0.0
    3        7  1.590792  0.018898   0.0
    4        9  1.554049  0.016667   0.0
    5       10  1.541651  0.015811   0.0
    6        8  1.569562  0.017678   0.0
    7        4  1.693831  0.025000   0.0
    8        2  1.808278  0.035355   0.0
    9        1  1.932960  0.050000   0.0
    
    
    ############ result Output ############
    
                              predictor   estimate                                   value
    attribute
    -5                             BIC   36.383901                                    None
    -9         Learning Rate (Initial)    0.050000                                    None
    -11                       Momentum    0.600000                                    None
    -14                        Epsilon    0.100000                                    None
    -1                   Loss Function         NaN  EPSILON_INSENSITIVE
    -3          Number of Observations   24.000000                                    None
     7                         Latitude  -0.542485                                    None
     5                       Population   0.133593                                    None
    -16                         Kernel         NaN                                  LINEAR
    -7                           Alpha    0.500000                              Elasticnet