Teradata Package for Python Function Reference | 20.00 - Distributed Random Forest - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.

Teradata® Package for Python Function Reference - 20.00

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00.00.03
Published
December 2024
ft:locale
en-US
ft:lastEdition
2024-12-19
dita:id
TeradataPython_FxRef_Enterprise_2000
Product Category
Teradata Vantage

H2OPredict() using Distributed Random Forest model.

Setup

In [1]:
import tempfile
import getpass
from teradataml import create_context, DataFrame, save_byom, retrieve_byom, \
delete_byom, list_byom, remove_context, load_example_data, db_drop_table
from teradataml.options.configure import configure
from teradataml.analytics.byom.H2OPredict import H2OPredict
import h2o
In [2]:
# Create the connection.
host = getpass.getpass("Host: ")
username = getpass.getpass("Username: ")
password = getpass.getpass("Password: ")

con = create_context(host=host, username=username, password=password)

Load example data and use sample() for splitting input data into testing and training dataset.

In [3]:
load_example_data("byom", "iris_input")
iris_input = DataFrame("iris_input")

# Create 2 samples of input data - sample 1 will have 80% of total rows and sample 2 will have 20% of total rows. 
iris_sample = iris_input.sample(frac=[0.8, 0.2])
WARNING: Skipped loading table iris_input since it already exists in the database.
In [4]:
# Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.
iris_train = iris_sample[iris_sample.sampleid == "1"].drop("sampleid", axis = 1)
iris_train
Out[4]:
id sepal_length sepal_width petal_length petal_width species
57 6.3 3.3 4.7 1.6 2
59 6.6 2.9 4.6 1.3 2
36 5.0 3.2 1.2 0.2 1
78 6.7 3.0 5.0 1.7 2
93 5.8 2.6 4.0 1.2 2
101 6.3 3.3 6.0 2.5 3
141 6.7 3.1 5.6 2.4 3
17 5.4 3.9 1.3 0.4 1
116 6.4 3.2 5.3 2.3 3
19 5.7 3.8 1.7 0.3 1
In [5]:
# Create test dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is not required for scoring.
iris_test = iris_sample[iris_sample.sampleid == "2"].drop("sampleid", axis = 1)
iris_test
Out[5]:
id sepal_length sepal_width petal_length petal_width species
95 5.6 2.7 4.2 1.3 2
59 6.6 2.9 4.6 1.3 2
30 4.7 3.2 1.6 0.2 1
76 6.6 3.0 4.4 1.4 2
18 5.1 3.5 1.4 0.3 1
89 5.6 3.0 4.1 1.3 2
87 6.7 3.1 4.7 1.5 2
121 6.9 3.2 5.7 2.3 3
148 6.5 3.0 5.2 2.0 3
122 5.6 2.8 4.9 2.0 3

Prepare dataset for creating an Distributed Random Forest model.

In [6]:
h2o.init()

# Since H2OFrame accepts pandas DataFrame, converting teradataml DataFrame to pandas DataFrame.
iris_train_pd = iris_train.to_pandas()
h2o_df = h2o.H2OFrame(iris_train_pd)
h2o_df
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: java version "11.0.2" 2019-01-15 LTS; Java(TM) SE Runtime Environment 18.9 (build 11.0.2+9-LTS); Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.2+9-LTS, mixed mode)
  Starting server from /Users/gp186005/anaconda3/lib/python3.7/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/92/yt4b9fh178x2xhc_tnhpvzsr0000gn/T/tmpurl5ttsm
  JVM stdout: /var/folders/92/yt4b9fh178x2xhc_tnhpvzsr0000gn/T/tmpurl5ttsm/h2o_gp186005_started_from_python.out
  JVM stderr: /var/folders/92/yt4b9fh178x2xhc_tnhpvzsr0000gn/T/tmpurl5ttsm/h2o_gp186005_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
H2O_cluster_uptime: 15 secs
H2O_cluster_timezone: America/Los_Angeles
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.32.1.6
H2O_cluster_version_age: 1 month and 21 days
H2O_cluster_name: H2O_from_python_gp186005_ip5q0u
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 4 Gb
H2O_cluster_total_cores: 12
H2O_cluster_allowed_cores: 12
H2O_cluster_status: accepting new members, healthy
H2O_connection_url: http://127.0.0.1:54321
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
H2O_API_Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python_version: 3.7.3 final
Parse progress: |█████████████████████████████████████████████████████████| 100%
sepal_length sepal_width petal_length petal_width species
5 2 3.5 1 2
6.3 3.3 6 2.5 3
5.1 3.4 1.5 0.2 1
5.6 2.8 4.9 2 3
4.9 3.6 1.4 0.1 1
6.7 3.1 5.6 2.4 3
5.7 2.6 3.5 1 2
6.6 2.9 4.6 1.3 2
6.7 3 5 1.7 2
4.8 3 1.4 0.1 1
Out[6]:

Train Distributed Random Forest Model.

In [7]:
# Import required libraries.
from h2o.estimators import H2ORandomForestEstimator
In [8]:
# Add the code for training model. 
h2o_df["species"] = h2o_df["species"].asfactor()
predictors = h2o_df.columns
response = "species"
In [9]:
iris_rf = H2ORandomForestEstimator(ntrees=100, max_depth=0)
In [10]:
iris_rf.train(x=predictors, y=response, training_frame=h2o_df)
drf Model Build progress: |███████████████████████████████████████████████| 100%

Save the model in MOJO format.

In [11]:
# Saving H2O Model to a file.
temp_dir = tempfile.TemporaryDirectory()
model_file_path = iris_rf.save_mojo(path=f"{temp_dir.name}", force=True)

Save the model in Vantage.

In [13]:
# Save the H2O Model in Vantage.
save_byom("h2o_rf_iris", 
          model_file_path,
          table_name="byom_models", 
          additional_columns={"description": "Random forest model generated using H2O"}
         )
Created the model table 'byom_models' as it does not exist.
Model is saved.

List the models from Vantage.

In [14]:
# List the models from "byom_models".
list_byom("byom_models")
                                model                              description
model_id                                                                      
h2o_rf_iris  b'504B03041400080808...'  Random forest model generated using H2O

Retrieve the model from Vantage.

In [15]:
# Retrieve the model from vantage using the model id 'h2o_rf_iris'.
modeldata = retrieve_byom(model_id="h2o_rf_iris", table_name="byom_models")

Set "configure.byom_install_location" to the database where BYOM functions are installed.

In [16]:
configure.byom_install_location = getpass.getpass("byom_install_location: ")

Score the model.

In [17]:
result = H2OPredict(newdata=iris_test,
                    newdata_partition_column='id',
                    newdata_order_column='id',
                    modeldata=modeldata,
                    modeldata_order_column='model_id',
                    model_output_fields=['label', 'classProbabilities'],
                    accumulate=['id', 'sepal_length', 'petal_length'],
                    overwrite_cached_models='*',
                    enable_options='stageProbabilities',
                    model_type='OpenSource'
                   )
In [18]:
# Print the query.
print(result.show_query())
SELECT * FROM "mldb".H2OPredict(
	ON "MLDB"."ml__select__16344897347247" AS InputTable
	PARTITION BY "id"
	ORDER BY "id" 
	ON (select model_id,model from "MLDB"."ml__filter__16344924296580") AS ModelTable
	DIMENSION
	ORDER BY "model_id"
	USING
	Accumulate('id','sepal_length','petal_length')
	ModelOutputFields('label','classProbabilities')
	OverwriteCachedModel('*')
	EnableOptions('stageProbabilities')
) as sqlmr
In [19]:
# Print the result.
result.result
Out[19]:
id sepal_length petal_length prediction label classprobabilities
69 6.2 4.5 2 2 {"1": 7.458774870464025E-4,"2": 0.9582039807214917,"3": 0.04105014179146193}
37 5.5 1.3 1 1 {"1": 0.9896364785244597,"2": 0.010090797219753938,"3": 2.727242557863893E-4}
44 5.0 1.6 1 1 {"1": 0.999729802749471,"2": 0.0,"3": 2.7019725052900413E-4}
16 5.7 1.5 1 1 {"1": 0.9896364785244597,"2": 0.010090797219753938,"3": 2.727242557863893E-4}
81 5.5 3.8 2 2 {"1": 7.309748693940461E-4,"2": 0.9989990253874769,"3": 2.6999974312908804E-4}
18 5.1 1.4 1 1 {"1": 0.999729802749471,"2": 0.0,"3": 2.7019725052900413E-4}
24 5.1 1.7 1 1 {"1": 0.999729802749471,"2": 0.0,"3": 2.7019725052900413E-4}
25 4.8 1.9 1 1 {"1": 0.999729802749471,"2": 0.0,"3": 2.7019725052900413E-4}
34 5.5 1.4 1 1 {"1": 0.9896364785244597,"2": 0.010090797219753938,"3": 2.727242557863893E-4}
4 4.6 1.5 1 1 {"1": 0.999729802749471,"2": 0.0,"3": 2.7019725052900413E-4}

Cleanup.

In [20]:
# Delete the saved Model.
delete_byom("h2o_rf_iris", table_name="byom_models")
Model is deleted.
In [21]:
# Drop model table.
db_drop_table("byom_models")
Out[21]:
True
In [22]:
# Drop input data table.
db_drop_table("iris_input")
Out[22]:
True
In [23]:
# One must run remove_context() to close the connection and garbage collect internally generated objects.
remove_context()
Out[23]:
True
In [ ]: