Using H2OPredict to Score using Externally Trained Models | teradataml - Using H2OPredict to Score using Externally Trained Models - Teradata Package for Python

Teradata® Package for Python User Guide

Product
Teradata Package for Python
Release Number
17.10
Published
May 2022
Language
English (United States)
Last Update
2022-08-18
dita:mapPath
rsu1641592952675.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
B700-4006
lifecycle
previous
Product Category
Teradata Vantage

This example uses the iris_input dataset and performs a prediction on each row of the input table using a model previously trained in H2O and then loaded into the database.

  1. Set up the environment.
    1. Import required libraries.
      import tempfile
      import getpass
      from teradataml import H2OPredict, DataFrame, load_example_data, create_context, db_drop_table, remove_context, save_byom, delete_byom, retrieve_byom, list_byom
      from teradataml.options.configure import configure
    2. Create the connection to database.
      con = create_context(host=getpass.getpass("Hostname: "),
                           username=getpass.getpass("Username: "),
                           password=getpass.getpass("Password: "))
    3. Load example data.
      load_example_data("byom", "iris_input")
      iris_input = DataFrame("iris_input")
  2. Create train dataset and test dataset.
    1. Create two samples of input data.
      This step creates two samples of input data: sample 1 has 80% of total rows and sample 2 has 20% of total rows.
      iris_sample = iris_input.sample(frac=[0.8, 0.2])
      iris_sample
    2. Create train dataset.
      This step creates train dataset from sample 1 by filtering on "sampleid" and dropping "sampleid" column as it is not required for training model.
      iris_train = iris_sample[iris_sample.sampleid == "1"].drop("sampleid", axis = 1)
      iris_train
    3. Create test dataset.
      This step creates test dataset from sample 2 by filtering on "sampleid" and dropping "sampleid" column as it is not required for scoring.
      iris_test = iris_sample[iris_sample.sampleid == "2"].drop("sampleid", axis = 1)
      iris_test
  3. Train the Gradient Boosting Machine model and perform the Prediction using H2OPredict().
    1. Import required libraries.
      import h2o
      from teradataml.analytics.byom.H2OPredict import H2OPredict
      from h2o.estimators import H2OGradientBoostingEstimator
    2. Prepare dataset to create a Gradient Boosting Machine model.
      h2o.init()
      # Since H2OFrame accepts pandas DataFrame, converting teradataml DataFrame to pandas DataFrame.
      iris_train_pd = iris_train.to_pandas()
      h2o_df = h2o.H2OFrame(iris_train_pd)
      h2o_df
    3. Train the Gradient Boosting Machine model.
      # Add the code for training model.
      h2o_df["species"] = h2o_df["species"].asfactor()
      predictors = h2o_df.columns
      response = "species"
      gbm_model = H2OGradientBoostingEstimator(nfolds=5, seed=1111, keep_cross_validation_predictions = True)
      gbm_model.train(x=predictors, y=response, training_frame=h2o_df)
    4. Save the model in MOJO format.
      # Saving H2O Model to a file.
      temp_dir = tempfile.TemporaryDirectory()
      model_file_path = gbm_model.save_mojo(path=f"{temp_dir.name}", force=True)
    5. Save the model in Vantage.
      # Save the H2O Model in Vantage.
      save_byom(model_id="h2o_gbm_iris", model_file=model_file_path, table_name="byom_models")
    6. List the model in Vantage.
      list_byom("byom_models")
    7. Retrieve the model from Vantage.
      # Retrieve the model from vantage using the model name 'h2o_gbm_iris'.
      model=retrieve_byom("h2o_gbm_iris", "byom_models")
    8. Set "configure.byom_install_location" to the database where BYOM functions are installed.
      configure.byom_install_location = getpass.getpass("byom_install_location: ")
    9. Score the test data using H2OPredict function with the retrieved model.
      # Score the model on 'iris_test' data.
      result = H2OPredict(newdata=iris_test,
                          newdata_partition_column='id',
                          newdata_order_column='id',
                          modeldata=model,
                          modeldata_order_column='model_id',
                          model_output_fields=['label', 'classProbabilities'],
                          accumulate=['id', 'sepal_length', 'petal_length'],
                          overwrite_cached_models='*',
                          enable_options='stageProbabilities',
                          model_type='OpenSource'
                         )
    10. Print the equivalent SQL query and Score result.
      # Print the query.
      print(result.show_query())
      # Print the result.
      result.result
  4. Clean up.
    # Delete the saved Model.
    delete_byom("h2o_gbm_iris", table_name="byom_models")
    # Drop models table.
    db_drop_table("byom_models")
    # Drop input data tables.
    db_drop_table("iris_input")
    # One must run remove_context() to close the connection and garbage collect internally generated objects.
    remove_context()