Using H2OPredict to Score using Externally Trained Models | teradataml - Using H2OPredict to Score using Externally Trained Models

Using H2OPredict to Score using Externally Trained Models | teradataml - Using H2OPredict to Score using Externally Trained Models - Teradata Package for Python

Teradata® Package for Python User Guide

Product

Teradata Package for Python

Release Number

17.00

Published

November 2021

Language

English (United States)

Last Update

2022-01-14

dita:mapPath

bol1585763678431.ditamap

dita:ditavalPath

ayr1485454803741.ditaval

dita:id

B700-4006

lifecycle

Product Category

Teradata Vantage

This example uses the iris_input dataset and performs a prediction on each row of the input table using a model previously trained in H2O and then loaded into the database.

Set up the environment.

Import required libraries.

import tempfile

import getpass

from teradataml import H2OPredict, DataFrame, load_example_data, create_context, db_drop_table, remove_context, save_byom, delete_byom, retrieve_byom, list_byom

from teradataml.options.configure import configure

Create the connection to database.

con = create_context(host=getpass.getpass("Hostname: "),
                     username=getpass.getpass("Username: "),
                     password=getpass.getpass("Password: "))

Load example data.

load_example_data("byom", "iris_input")

iris_input = DataFrame("iris_input")

Create train dataset and test dataset.
1. Create two samples of input data.
  This step creates two samples of input data: sample 1 has 80% of total rows and sample 2 has 20% of total rows.
```
iris_sample = iris_input.sample(frac=[0.8, 0.2])
```
```
iris_sample
```
2. Create train dataset.
  This step creates train dataset from sample 1 by filtering on "sampleid" and dropping "sampleid" column as it is not required for training model.
```
iris_train = iris_sample[iris_sample.sampleid == "1"].drop("sampleid", axis = 1)
```
```
iris_train
```
3. Create test dataset.
  This step creates test dataset from sample 2 by filtering on "sampleid" and dropping "sampleid" column as it is not required for scoring.
```
iris_test = iris_sample[iris_sample.sampleid == "2"].drop("sampleid", axis = 1)
```
```
iris_test
```

Train the Gradient Boosting Machine model and perform the Prediction using H2OPredict().

Import required libraries.

import h2o

from teradataml.analytics.byom.H2OPredict import H2OPredict

from h2o.estimators import H2OGradientBoostingEstimator

Prepare dataset to create a Gradient Boosting Machine model.

h2o.init()

# Since H2OFrame accepts pandas DataFrame, converting teradataml DataFrame to pandas DataFrame.
iris_train_pd = iris_train.to_pandas()
h2o_df = h2o.H2OFrame(iris_train_pd)
h2o_df

Train the Gradient Boosting Machine model.

# Add the code for training model.
h2o_df["species"] = h2o_df["species"].asfactor()
predictors = h2o_df.columns
response = "species"

gbm_model = H2OGradientBoostingEstimator(nfolds=5, seed=1111, keep_cross_validation_predictions = True)

gbm_model.train(x=predictors, y=response, training_frame=h2o_df)

Save the model in MOJO format.

# Saving H2O Model to a file.
temp_dir = tempfile.TemporaryDirectory()
model_file_path = gbm_model.save_mojo(path=f"{temp_dir.name}", force=True)

Save the model in Vantage.

# Save the H2O Model in Vantage.
save_byom(model_id="h2o_gbm_iris", model_file=model_file_path, table_name="byom_models")

List the model in Vantage.
```
list_byom("byom_models")
```

Retrieve the model from Vantage.

# Retrieve the model from vantage using the model name 'h2o_gbm_iris'.
model=retrieve_byom("h2o_gbm_iris", "byom_models")

Set "configure.byom_install_location" to the database where BYOM functions are installed.
```
configure.byom_install_location = getpass.getpass("byom_install_location: ")
```

Score the test data using H2OPredict function with the retrieved model.

# Score the model on 'iris_test' data.
result = H2OPredict(newdata=iris_test,
                    newdata_partition_column='id',
                    newdata_order_column='id',
                    modeldata=model,
                    modeldata_order_column='model_id',
                    model_output_fields=['label', 'classProbabilities'],
                    accumulate=['id', 'sepal_length', 'petal_length'],
                    overwrite_cached_models='*',
                    enable_options='stageProbabilities',
                    model_type='OpenSource'
                   )

Print the equivalent SQL query and Score result.

# Print the query.
print(result.show_query())

# Print the result.
result.result

Clean up.

# Delete the saved Model.
delete_byom("h2o_gbm_iris", table_name="byom_models")

# Drop models table.
db_drop_table("byom_models")

# Drop input data tables.
db_drop_table("iris_input")

# One must run remove_context() to close the connection and garbage collect internally generated objects.
remove_context()