H2OPredict() using Generalized Low Rank model.¶
Setup¶
In [1]:
# Import required libraries
import tempfile
import getpass
from teradataml import create_context, DataFrame, save_byom, retrieve_byom, \
delete_byom, list_byom, remove_context, load_example_data, db_drop_table
from teradataml.options.configure import configure
from teradataml.analytics.byom.H2OPredict import H2OPredict
import h2o
In [2]:
# Create the connection.
host = getpass.getpass("Host: ")
username = getpass.getpass("Username: ")
password = getpass.getpass("Password: ")
con = create_context(host=host, username=username, password=password)
Host: ········ Username: ········ Password: ········
Load example data and use sample() for splitting input data into testing and training dataset.¶
In [3]:
# Load the example data.
load_example_data("byom", "iris_input")
In [4]:
# Create the teradataml DataFrames.
iris_input = DataFrame("iris_input")
In [5]:
# Create 2 samples of input data - sample 1 will have 80% of total rows and sample 2 will have 20% of total rows.
iris_sample = iris_input.sample(frac=[0.8, 0.2])
In [6]:
# Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.
iris_train = iris_sample[iris_sample.sampleid == "1"].drop("sampleid", axis = 1)
iris_train
Out[6]:
id | sepal_length | sepal_width | petal_length | petal_width | species |
---|---|---|---|---|---|
120 | 6.0 | 2.2 | 5.0 | 1.5 | 3 |
59 | 6.6 | 2.9 | 4.6 | 1.3 | 2 |
99 | 5.1 | 2.5 | 3.0 | 1.1 | 2 |
61 | 5.0 | 2.0 | 3.5 | 1.0 | 2 |
78 | 6.7 | 3.0 | 5.0 | 1.7 | 2 |
141 | 6.7 | 3.1 | 5.6 | 2.4 | 3 |
17 | 5.4 | 3.9 | 1.3 | 0.4 | 1 |
34 | 5.5 | 4.2 | 1.4 | 0.2 | 1 |
38 | 4.9 | 3.6 | 1.4 | 0.1 | 1 |
122 | 5.6 | 2.8 | 4.9 | 2.0 | 3 |
In [7]:
# Create test dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is not required for scoring.
iris_test = iris_sample[iris_sample.sampleid == "2"].drop("sampleid", axis = 1)
iris_test
Out[7]:
id | sepal_length | sepal_width | petal_length | petal_width | species |
---|---|---|---|---|---|
95 | 5.6 | 2.7 | 4.2 | 1.3 | 2 |
74 | 6.1 | 2.8 | 4.7 | 1.2 | 2 |
114 | 5.7 | 2.5 | 5.0 | 2.0 | 3 |
61 | 5.0 | 2.0 | 3.5 | 1.0 | 2 |
5 | 5.0 | 3.6 | 1.4 | 0.2 | 1 |
13 | 4.8 | 3.0 | 1.4 | 0.1 | 1 |
11 | 5.4 | 3.7 | 1.5 | 0.2 | 1 |
49 | 5.3 | 3.7 | 1.5 | 0.2 | 1 |
76 | 6.6 | 3.0 | 4.4 | 1.4 | 2 |
122 | 5.6 | 2.8 | 4.9 | 2.0 | 3 |
Prepare dataset for creating a Generalized Low Rank model.¶
In [8]:
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found. Attempting to start a local H2O server... ; Java HotSpot(TM) 64-Bit Server VM (build 25.271-b09, mixed mode) Starting server from C:\Users\pg255042\Anaconda3\envs\teraml\lib\site-packages\h2o\backend\bin\h2o.jar Ice root: C:\Users\pg255042\AppData\Local\Temp\tmp75i0b0gf JVM stdout: C:\Users\pg255042\AppData\Local\Temp\tmp75i0b0gf\h2o_pg255042_started_from_python.out JVM stderr: C:\Users\pg255042\AppData\Local\Temp\tmp75i0b0gf\h2o_pg255042_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321 ... successful. Warning: Your H2O cluster version is too old (5 months and 7 days)!Please download and install the latest version from http://h2o.ai/download/
H2O_cluster_uptime: | 02 secs |
H2O_cluster_timezone: | Asia/Kolkata |
H2O_data_parsing_timezone: | UTC |
H2O_cluster_version: | 3.34.0.1 |
H2O_cluster_version_age: | 5 months and 7 days !!! |
H2O_cluster_name: | H2O_from_python_pg255042_ie998m |
H2O_cluster_total_nodes: | 1 |
H2O_cluster_free_memory: | 7.052 Gb |
H2O_cluster_total_cores: | 16 |
H2O_cluster_allowed_cores: | 16 |
H2O_cluster_status: | locked, healthy |
H2O_connection_url: | http://127.0.0.1:54321 |
H2O_connection_proxy: | {"http": null, "https": null} |
H2O_internal_security: | False |
H2O_API_Extensions: | Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 |
Python_version: | 3.6.12 final |
In [9]:
# Since H2OFrame accepts pandas DataFrame, converting teradataml DataFrame to pandas DataFrame.
iris_train_pd = iris_train.to_pandas()
h2o_df = h2o.H2OFrame(iris_train_pd)
h2o_df
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
sepal_length | sepal_width | petal_length | petal_width | species |
---|---|---|---|---|
5 | 2 | 3.5 | 1 | 2 |
6.3 | 3.3 | 6 | 2.5 | 3 |
5.1 | 3.4 | 1.5 | 0.2 | 1 |
6.6 | 2.9 | 4.6 | 1.3 | 2 |
6.7 | 3 | 5 | 1.7 | 2 |
6.7 | 3.1 | 5.6 | 2.4 | 3 |
5.7 | 2.6 | 3.5 | 1 | 2 |
5.1 | 2.5 | 3 | 1.1 | 2 |
6.6 | 3 | 4.4 | 1.4 | 2 |
5.4 | 3.9 | 1.3 | 0.4 | 1 |
Out[9]:
Train Generalized Low Rank Model.¶
In [10]:
# Import required libraries.
from h2o.estimators import H2OGeneralizedLowRankEstimator
In [11]:
# Add the code for training model.
h2o_df["species"] = h2o_df["species"].asfactor()
predictors = h2o_df.columns
response = "species"
In [12]:
glrm_model = H2OGeneralizedLowRankEstimator(k=4,
loss="quadratic",
gamma_x=0.5,
gamma_y=0.5,
max_iterations=700,
recover_svd=True,
init="SVD",
transform="standardize")
In [13]:
glrm_model.train(x=predictors, y=response, training_frame=h2o_df)
glrm Model Build progress: |█████████████████████████████████████████████████████| (done) 100% Model Details ============= H2OGeneralizedLowRankEstimator : Generalized Low Rank Modeling Model Key: GLRM_model_python_1645512818794_1 Model Summary:
number_of_iterations | final_step_size | final_objective_value | ||
---|---|---|---|---|
0 | 164.0 | 0.000054 | 2.439693 |
ModelMetricsGLRM: glrm ** Reported on train data. ** MSE: NaN RMSE: NaN Sum of Squared Error (Numeric): 2.4396928934183713 Misclassification Error (Categorical): 0.0 Scoring History:
timestamp | duration | iterations | step_size | objective | ||
---|---|---|---|---|---|---|
0 | 2022-02-22 12:23:42 | 0.194 sec | 0.0 | 0.666667 | 267.142579 | |
1 | 2022-02-22 12:23:42 | 0.202 sec | 1.0 | 0.444444 | 267.142579 | |
2 | 2022-02-22 12:23:42 | 0.204 sec | 2.0 | 0.222222 | 267.142579 | |
3 | 2022-02-22 12:23:42 | 0.204 sec | 3.0 | 0.233333 | 115.558661 | |
4 | 2022-02-22 12:23:42 | 0.204 sec | 4.0 | 0.245000 | 81.991977 | |
5 | 2022-02-22 12:23:42 | 0.212 sec | 5.0 | 0.163333 | 81.991977 | |
6 | 2022-02-22 12:23:42 | 0.214 sec | 6.0 | 0.108889 | 81.991977 | |
7 | 2022-02-22 12:23:42 | 0.214 sec | 7.0 | 0.114333 | 78.045369 | |
8 | 2022-02-22 12:23:42 | 0.214 sec | 8.0 | 0.120050 | 32.878496 | |
9 | 2022-02-22 12:23:42 | 0.224 sec | 9.0 | 0.126053 | 26.644778 | |
10 | 2022-02-22 12:23:42 | 0.224 sec | 10.0 | 0.132355 | 22.624142 | |
11 | 2022-02-22 12:23:42 | 0.224 sec | 11.0 | 0.138973 | 21.787503 | |
12 | 2022-02-22 12:23:42 | 0.224 sec | 12.0 | 0.145922 | 20.384177 | |
13 | 2022-02-22 12:23:42 | 0.232 sec | 13.0 | 0.153218 | 20.325123 | |
14 | 2022-02-22 12:23:42 | 0.234 sec | 14.0 | 0.102145 | 20.325123 | |
15 | 2022-02-22 12:23:42 | 0.234 sec | 15.0 | 0.107252 | 19.113147 | |
16 | 2022-02-22 12:23:42 | 0.234 sec | 16.0 | 0.112615 | 18.151736 | |
17 | 2022-02-22 12:23:42 | 0.234 sec | 17.0 | 0.118246 | 16.879529 | |
18 | 2022-02-22 12:23:42 | 0.234 sec | 18.0 | 0.124158 | 15.773687 | |
19 | 2022-02-22 12:23:42 | 0.234 sec | 19.0 | 0.130366 | 14.635686 |
See the whole table with table.as_data_frame()
Out[13]:
Save the model in MOJO format.¶
In [14]:
# Saving H2O Model to a file.
temp_dir = tempfile.TemporaryDirectory()
model_file_path = glrm_model.save_mojo(path=f"{temp_dir.name}", force=True)
Save the model in Vantage.¶
In [15]:
# Save the H2O Model in Vantage.
save_byom(model_id="h2o_glrm_iris", model_file=model_file_path, table_name="byom_models")
Created the model table 'byom_models' as it does not exist. Model is saved.
In [16]:
# List the models from "byom_models".
list_byom("byom_models")
model model_id h2o_glrm_iris b'504B03041400080808...'
Retrieve the model from Vantage.¶
In [17]:
# Retrieve the model from vantage using the model name 'h2o_glrm_iris'.
model=retrieve_byom(model_id="h2o_glrm_iris", table_name="byom_models")
In [18]:
configure.byom_install_location = getpass.getpass("byom_install_location: ")
byom_install_location: ········
Score the model.¶
In [19]:
# Score the model on 'iris_test' data.
result = H2OPredict(newdata=iris_test,
newdata_partition_column='id',
newdata_order_column='id',
modeldata=model,
modeldata_order_column='model_id',
accumulate=['id', 'sepal_length', 'petal_length'],
overwrite_cached_models='*',
model_type='OpenSource'
)
In [20]:
# Print the query.
print(result.show_query())
SELECT * FROM "mldb".H2OPredict( ON "MLDB"."ml__select__1645515524758530" AS InputTable PARTITION BY "id" ORDER BY "id" ON (select model_id,model from "MLDB"."ml__filter__1645515219191021") AS ModelTable DIMENSION ORDER BY "model_id" USING Accumulate('id','sepal_length','petal_length') OverwriteCachedModel('*') ) as sqlmr
In [21]:
# Print the result.
result.result
Out[21]:
id | sepal_length | petal_length | prediction | json_report |
---|---|---|---|---|
22 | 5.1 | 1.5 | {"dimensions":[-0.8755884190821872,0.4472017598306644,0.0532540017688253,0.2505514991946427],"reconstructed":[-1.009702009214914,1.5104019390305319,-1.3566496362877019,-1.1886252973763436,0.0]} | |
69 | 6.2 | 4.5 | {"dimensions":[0.44062561105843706,-0.7916202873653729,-0.3336130726414549,-0.4420232986958351],"reconstructed":[0.4809519782109252,-1.93648862326932,0.5148422067271403,0.049406695050708264,1.0]} | |
70 | 5.6 | 3.9 | {"dimensions":[0.13632624776477564,-0.6378740362476013,-0.0722942433882558,-0.33335869761244624],"reconstructed":[-0.27350355503007406,-1.2936172770748544,0.07635317739547576,-0.10145571777357752,1.0]} | |
38 | 4.9 | 1.4 | {"dimensions":[-0.8875034739578439,0.1932973956542112,0.008587761196810413,-0.04397949380212729],"reconstructed":[-1.0909535331053124,1.2128618884339772,-1.4153364081567559,-1.335276578823825,0.0]} | |
79 | 6.0 | 4.5 | {"dimensions":[0.26771446285926187,-0.6336090550181356,0.09822169481263057,-1.1825998615107771],"reconstructed":[0.19960990942625179,-0.3716295612296847,0.4469025366948274,0.3939021784864294,1.0]} | |
28 | 5.2 | 1.5 | {"dimensions":[-0.7745605227270265,0.2835218241950847,-0.11580179457286491,0.19887924875050728],"reconstructed":[-0.7606347036494236,0.9995229315594196,-1.254574771653111,-1.2761817446023618,0.0]} | |
32 | 5.4 | 1.5 | {"dimensions":[-0.7716361460682887,0.2706672006585394,-0.2502528329963879,0.1877470447422418],"reconstructed":[-0.5507176169053385,0.8985721493293917,-1.2721708549892874,-1.4227990192352702,0.0]} | |
49 | 5.3 | 1.5 | {"dimensions":[-0.8194097876612587,0.39619978768187836,-0.09611918939413275,0.06587828131922056],"reconstructed":[-0.6329177334723323,1.455303697358055,-1.273552156703061,-1.2628573264621934,0.0]} | |
60 | 5.2 | 3.9 | {"dimensions":[0.1025619380710769,-0.5344523996790395,0.27905786476378636,-0.3166428876496549],"reconstructed":[-0.7654455093212404,-0.8313998540458245,0.1012621944000312,0.2715011240655108,1.0]} | |
15 | 5.8 | 1.2 | {"dimensions":[-0.8718901017234719,0.7039189705709991,-0.2396379306648317,0.037308632699902235],"reconstructed":[-0.0754781649926874,2.166104713835355,-1.287416301738693,-1.3662770290526147,0.0]} |
Cleanup.¶
In [22]:
# Delete the saved Model from the table byom_models, using the model id h2o_glrm_iris.
delete_byom("h2o_glrm_iris", table_name="byom_models")
Model is deleted.
In [23]:
# Drop model table.
db_drop_table("byom_models")
Out[23]:
True
In [24]:
# Drop input data table.
db_drop_table("iris_input")
Out[24]:
True
In [25]:
# One must run remove_context() to close the connection and garbage collect internally generated objects.
remove_context()
Out[25]:
True