H2OPredict() using PCA model.¶
Setup¶
In [1]:
import tempfile
import getpass
from teradataml import create_context, DataFrame, save_byom, retrieve_byom, \
delete_byom, list_byom, remove_context, load_example_data, db_drop_table
from teradataml.options.configure import configure
from teradataml.analytics.byom.H2OPredict import H2OPredict
import h2o
In [2]:
# Create the connection.
host = getpass.getpass("Host: ")
username = getpass.getpass("Username: ")
password = getpass.getpass("Password: ")
con = create_context(host=host, username=username, password=password)
Host: ········ Username: ········ Password: ········
Load example data and use sample() for splitting input data into testing and training dataset.¶
In [3]:
# Load the example data.
load_example_data("byom", "iris_input")
In [4]:
# Create teradataml DataFrames.
iris_input = DataFrame("iris_input")
In [5]:
# Create 2 samples of input data - sample 1 will have 80% of total rows and sample 2 will have 20% of total rows.
iris_sample = iris_input.sample(frac=[0.8, 0.2])
In [6]:
# Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.
iris_train = iris_sample[iris_sample.sampleid == "1"].drop("sampleid", axis = 1)
iris_train
Out[6]:
id | sepal_length | sepal_width | petal_length | petal_width | species |
---|---|---|---|---|---|
17 | 5.4 | 3.9 | 1.3 | 0.4 | 1 |
38 | 4.9 | 3.6 | 1.4 | 0.1 | 1 |
78 | 6.7 | 3.0 | 5.0 | 1.7 | 2 |
122 | 5.6 | 2.8 | 4.9 | 2.0 | 3 |
59 | 6.6 | 2.9 | 4.6 | 1.3 | 2 |
40 | 5.1 | 3.4 | 1.5 | 0.2 | 1 |
120 | 6.0 | 2.2 | 5.0 | 1.5 | 3 |
57 | 6.3 | 3.3 | 4.7 | 1.6 | 2 |
19 | 5.7 | 3.8 | 1.7 | 0.3 | 1 |
61 | 5.0 | 2.0 | 3.5 | 1.0 | 2 |
In [7]:
# Create test dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is not required for scoring.
iris_test = iris_sample[iris_sample.sampleid == "2"].drop("sampleid", axis = 1)
iris_test
Out[7]:
id | sepal_length | sepal_width | petal_length | petal_width | species |
---|---|---|---|---|---|
87 | 6.7 | 3.1 | 4.7 | 1.5 | 2 |
76 | 6.6 | 3.0 | 4.4 | 1.4 | 2 |
116 | 6.4 | 3.2 | 5.3 | 2.3 | 3 |
99 | 5.1 | 2.5 | 3.0 | 1.1 | 2 |
114 | 5.7 | 2.5 | 5.0 | 2.0 | 3 |
80 | 5.7 | 2.6 | 3.5 | 1.0 | 2 |
118 | 7.7 | 3.8 | 6.7 | 2.2 | 3 |
55 | 6.5 | 2.8 | 4.6 | 1.5 | 2 |
15 | 5.8 | 4.0 | 1.2 | 0.2 | 1 |
61 | 5.0 | 2.0 | 3.5 | 1.0 | 2 |
Prepare dataset for creating PCA Model.¶
In [8]:
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321 . connected. Warning: Your H2O cluster version is too old (5 months and 7 days)!Please download and install the latest version from http://h2o.ai/download/
H2O_cluster_uptime: | 15 mins 54 secs |
H2O_cluster_timezone: | Asia/Kolkata |
H2O_data_parsing_timezone: | UTC |
H2O_cluster_version: | 3.34.0.1 |
H2O_cluster_version_age: | 5 months and 7 days !!! |
H2O_cluster_name: | H2O_from_python_pg255042_ie998m |
H2O_cluster_total_nodes: | 1 |
H2O_cluster_free_memory: | 6.507 Gb |
H2O_cluster_total_cores: | 16 |
H2O_cluster_allowed_cores: | 16 |
H2O_cluster_status: | locked, healthy |
H2O_connection_url: | http://localhost:54321 |
H2O_connection_proxy: | {"http": null, "https": null} |
H2O_internal_security: | False |
H2O_API_Extensions: | Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 |
Python_version: | 3.6.12 final |
In [9]:
# Since H2OFrame accepts pandas DataFrame, converting teradataml DataFrame to pandas DataFrame.
iris_train_pd = iris_train.to_pandas()
h2o_df = h2o.H2OFrame(iris_train_pd)
h2o_df
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
sepal_length | sepal_width | petal_length | petal_width | species |
---|---|---|---|---|
5 | 2 | 3.5 | 1 | 2 |
6.3 | 3.3 | 6 | 2.5 | 3 |
5.1 | 3.4 | 1.5 | 0.2 | 1 |
5.6 | 2.8 | 4.9 | 2 | 3 |
4.9 | 3.6 | 1.4 | 0.1 | 1 |
6.7 | 3.1 | 5.6 | 2.4 | 3 |
5.7 | 2.6 | 3.5 | 1 | 2 |
5.7 | 3.8 | 1.7 | 0.3 | 1 |
6.7 | 3 | 5 | 1.7 | 2 |
5.4 | 3.9 | 1.3 | 0.4 | 1 |
Out[9]:
Train PCA Model.¶
In [10]:
# Import required libraries.
from h2o.estimators import H2OPrincipalComponentAnalysisEstimator
In [11]:
# Add the code for training model.
h2o_df["species"] = h2o_df["species"].asfactor()
predictors = h2o_df.columns
response = "species"
In [12]:
pca_model = H2OPrincipalComponentAnalysisEstimator(k = 4,
use_all_factor_levels = True,
pca_method = "glrm",
transform = "standardize",
impute_missing = True)
In [13]:
pca_model.train(x=predictors, y=response, training_frame=h2o_df)
pca Model Build progress: |██████████████████████████████████████████████████████| (done) 100% Model Details ============= H2OPrincipalComponentAnalysisEstimator : Principal Components Analysis Model Key: PCA_model_python_1645512818794_3 Importance of components:
pc1 | pc2 | pc3 | pc4 | ||
---|---|---|---|---|---|
0 | Standard deviation | 1.621899 | 1.320243 | 0.647021 | 0.464940 |
1 | Proportion of Variance | 0.525229 | 0.348023 | 0.083587 | 0.043161 |
2 | Cumulative Proportion | 0.525229 | 0.873252 | 0.956839 | 1.000000 |
ModelMetricsPCA: pca ** Reported on train data. ** MSE: NaN RMSE: NaN Scoring history from GLRM:
timestamp | duration | iterations | step_size | objective | ||
---|---|---|---|---|---|---|
0 | 2022-02-22 12:39:50 | 0.062 sec | 0.0 | 0.666667 | 894.252841 | |
1 | 2022-02-22 12:39:50 | 0.062 sec | 1.0 | 0.444444 | 894.252841 | |
2 | 2022-02-22 12:39:50 | 0.062 sec | 2.0 | 0.222222 | 894.252841 | |
3 | 2022-02-22 12:39:50 | 0.078 sec | 3.0 | 0.233333 | 712.328243 | |
4 | 2022-02-22 12:39:50 | 0.078 sec | 4.0 | 0.155556 | 712.328243 | |
5 | 2022-02-22 12:39:50 | 0.078 sec | 5.0 | 0.163333 | 588.726748 | |
6 | 2022-02-22 12:39:50 | 0.078 sec | 6.0 | 0.171500 | 588.159003 | |
7 | 2022-02-22 12:39:50 | 0.078 sec | 7.0 | 0.180075 | 244.333433 | |
8 | 2022-02-22 12:39:50 | 0.078 sec | 8.0 | 0.189079 | 113.292014 | |
9 | 2022-02-22 12:39:50 | 0.078 sec | 9.0 | 0.198533 | 48.716534 | |
10 | 2022-02-22 12:39:50 | 0.078 sec | 10.0 | 0.208459 | 27.685304 | |
11 | 2022-02-22 12:39:50 | 0.078 sec | 11.0 | 0.218882 | 20.933356 | |
12 | 2022-02-22 12:39:50 | 0.078 sec | 12.0 | 0.229826 | 19.800430 | |
13 | 2022-02-22 12:39:50 | 0.084 sec | 13.0 | 0.241318 | 19.242464 | |
14 | 2022-02-22 12:39:50 | 0.084 sec | 14.0 | 0.253384 | 18.975207 | |
15 | 2022-02-22 12:39:50 | 0.084 sec | 15.0 | 0.168922 | 18.975207 | |
16 | 2022-02-22 12:39:50 | 0.084 sec | 16.0 | 0.177369 | 18.671021 | |
17 | 2022-02-22 12:39:50 | 0.084 sec | 17.0 | 0.186237 | 18.653417 | |
18 | 2022-02-22 12:39:50 | 0.084 sec | 18.0 | 0.124158 | 18.653417 | |
19 | 2022-02-22 12:39:50 | 0.084 sec | 19.0 | 0.130366 | 17.888250 |
See the whole table with table.as_data_frame()
Out[13]:
Save the model in MOJO format.¶
In [14]:
# Saving H2O Model to a file.
temp_dir = tempfile.TemporaryDirectory()
model_file_path = pca_model.save_mojo(path=f"{temp_dir.name}", force=True)
Save the model in Vantage.¶
In [15]:
# Save the H2O Model in Vantage.
save_byom(model_id="h2o_pca_iris", model_file=model_file_path, table_name="byom_models")
Created the model table 'byom_models' as it does not exist. Model is saved.
In [16]:
# List the models from "byom_models".
list_byom("byom_models")
model model_id h2o_pca_iris b'504B03041400080808...'
Retrieve the model from Vantage.¶
In [17]:
# Retrieve the model from vantage using the model name 'h2o_pca_iris'.
model=retrieve_byom(model_id="h2o_pca_iris", table_name="byom_models")
In [18]:
configure.byom_install_location = getpass.getpass("byom_install_location: ")
byom_install_location: ········
Score the model.¶
In [19]:
# Score the model on 'iris_test' data.
result = H2OPredict(newdata=iris_test,
newdata_partition_column='id',
newdata_order_column='id',
modeldata=model,
modeldata_order_column='model_id',
accumulate=['id', 'sepal_length', 'petal_length'],
overwrite_cached_models='*',
model_type='OpenSource'
)
In [20]:
# Print the query.
print(result.show_query())
SELECT * FROM "mldb".H2OPredict( ON "MLDB"."ml__select__1645515297986767" AS InputTable PARTITION BY "id" ORDER BY "id" ON (select model_id,model from "MLDB"."ml__filter__1645516153481333") AS ModelTable DIMENSION ORDER BY "model_id" USING Accumulate('id','sepal_length','petal_length') OverwriteCachedModel('*') ) as sqlmr
In [21]:
# Print the result.
result.result
Out[21]:
id | sepal_length | petal_length | prediction | json_report |
---|---|---|---|---|
52 | 6.4 | 4.5 | {"dimensions":[-0.1441832230496968,-0.9194340730124243,-0.615501545531632,0.668914840466421]} | |
18 | 5.1 | 1.4 | {"dimensions":[-1.3167810354416312,1.554559149811599,-0.29399982340185005,0.22075453102419862]} | |
24 | 5.1 | 1.7 | {"dimensions":[-1.1831048786134577,1.3498913361132854,-0.3301298202908464,-0.1546627605928083]} | |
28 | 5.2 | 1.5 | {"dimensions":[-1.3127184928032658,1.5368862516091986,-0.2880898531286974,0.28988895260585745]} | |
63 | 6.0 | 4.0 | {"dimensions":[-0.4296527482753505,-1.2193116326341704,-0.4195427606020329,-1.4738343592564334]} | |
23 | 4.6 | 1.0 | {"dimensions":[-1.5525177295466552,1.826668060039769,-0.23485386079713957,0.08217345112567695]} | |
36 | 5.0 | 1.2 | {"dimensions":[-1.3859266051223962,1.454377300716858,-0.242108780864958,-0.4118185383859379]} | |
59 | 6.6 | 4.6 | {"dimensions":[-0.12708038384252834,-1.118665879271585,-0.5793035674628677,0.23845396949628075]} | |
51 | 7.0 | 4.7 | {"dimensions":[0.006838305319815413,-1.0866813881782158,-0.6429165658028986,1.0606461578918103]} | |
2 | 4.9 | 1.4 | {"dimensions":[-1.3674041623918451,1.3364292790567744,-0.23401428509572839,-0.8484336521271371]} |
Cleanup.¶
In [22]:
# Delete the saved Model from the table byom_models, using the model id h2o_pca_iris.
delete_byom("h2o_pca_iris", table_name="byom_models")
Model is deleted.
In [23]:
# Drop model table.
db_drop_table("byom_models")
Out[23]:
True
In [24]:
# Drop input data table.
db_drop_table("iris_input")
Out[24]:
True
In [25]:
# One must run remove_context() to close the connection and garbage collect internally generated objects.
remove_context()
Out[25]:
True