H2OPredict() using Isolation Forest model.¶
Setup¶
In [1]:
# Import required libraries
import tempfile
import getpass
from teradataml import create_context, DataFrame, save_byom, retrieve_byom, \
delete_byom, list_byom, remove_context, load_example_data, db_drop_table
from teradataml.options.configure import configure
from teradataml.analytics.byom.H2OPredict import H2OPredict
import h2o
In [2]:
# Create the connection.
host = getpass.getpass("Host: ")
username = getpass.getpass("Username: ")
password = getpass.getpass("Password: ")
con = create_context(host=host, username=username, password=password)
Host: ········ Username: ········ Password: ········
Load example data and use sample() for splitting input data into testing and training dataset.¶
In [3]:
# Load the example data.
load_example_data("byom", "iris_input")
WARNING: Skipped loading table iris_input since it already exists in the database.
In [4]:
# Create the teradataml DataFrames.
iris_input = DataFrame("iris_input")
In [5]:
# Create 2 samples of input data - sample 1 will have 80% of total rows and sample 2 will have 20% of total rows.
iris_sample = iris_input.sample(frac=[0.8, 0.2])
In [6]:
# Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.
iris_train = iris_sample[iris_sample.sampleid == "1"].drop("sampleid", axis = 1)
iris_train
Out[6]:
id | sepal_length | sepal_width | petal_length | petal_width | species |
---|---|---|---|---|---|
116 | 6.4 | 3.2 | 5.3 | 2.3 | 3 |
139 | 6.0 | 3.0 | 4.8 | 1.8 | 3 |
34 | 5.5 | 4.2 | 1.4 | 0.2 | 1 |
40 | 5.1 | 3.4 | 1.5 | 0.2 | 1 |
120 | 6.0 | 2.2 | 5.0 | 1.5 | 3 |
122 | 5.6 | 2.8 | 4.9 | 2.0 | 3 |
59 | 6.6 | 2.9 | 4.6 | 1.3 | 2 |
99 | 5.1 | 2.5 | 3.0 | 1.1 | 2 |
80 | 5.7 | 2.6 | 3.5 | 1.0 | 2 |
17 | 5.4 | 3.9 | 1.3 | 0.4 | 1 |
In [7]:
# Create test dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is not required for scoring.
iris_test = iris_sample[iris_sample.sampleid == "2"].drop("sampleid", axis = 1)
iris_test
Out[7]:
id | sepal_length | sepal_width | petal_length | petal_width | species |
---|---|---|---|---|---|
60 | 5.2 | 2.7 | 3.9 | 1.4 | 2 |
139 | 6.0 | 3.0 | 4.8 | 1.8 | 3 |
51 | 7.0 | 3.2 | 4.7 | 1.4 | 2 |
120 | 6.0 | 2.2 | 5.0 | 1.5 | 3 |
55 | 6.5 | 2.8 | 4.6 | 1.5 | 2 |
59 | 6.6 | 2.9 | 4.6 | 1.3 | 2 |
36 | 5.0 | 3.2 | 1.2 | 0.2 | 1 |
97 | 5.7 | 2.9 | 4.2 | 1.3 | 2 |
118 | 7.7 | 3.8 | 6.7 | 2.2 | 3 |
101 | 6.3 | 3.3 | 6.0 | 2.5 | 3 |
Prepare dataset for creating an Isolation Forest.¶
In [8]:
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found. Attempting to start a local H2O server... ; Java HotSpot(TM) 64-Bit Server VM (build 17.0.2+8-LTS-86, mixed mode, sharing) Starting server from c:\users\ar255086\appdata\local\programs\python\python37\lib\site-packages\h2o\backend\bin\h2o.jar Ice root: C:\Users\ar255086\AppData\Local\Temp\tmpbvd7h_jb JVM stdout: C:\Users\ar255086\AppData\Local\Temp\tmpbvd7h_jb\h2o_ar255086_started_from_python.out JVM stderr: C:\Users\ar255086\AppData\Local\Temp\tmpbvd7h_jb\h2o_ar255086_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321 ... successful.
H2O_cluster_uptime: | 01 secs |
H2O_cluster_timezone: | Asia/Kolkata |
H2O_data_parsing_timezone: | UTC |
H2O_cluster_version: | 3.36.0.3 |
H2O_cluster_version_age: | 1 month and 4 days |
H2O_cluster_name: | H2O_from_python_ar255086_tizp5y |
H2O_cluster_total_nodes: | 1 |
H2O_cluster_free_memory: | 7.934 Gb |
H2O_cluster_total_cores: | 16 |
H2O_cluster_allowed_cores: | 16 |
H2O_cluster_status: | locked, healthy |
H2O_connection_url: | http://127.0.0.1:54321 |
H2O_connection_proxy: | {"http": null, "https": null} |
H2O_internal_security: | False |
Python_version: | 3.7.8 final |
In [9]:
# Since H2OFrame accepts pandas DataFrame, converting teradataml DataFrame to pandas DataFrame.
iris_train_pd = iris_train.to_pandas()
h2o_df = h2o.H2OFrame(iris_train_pd)
h2o_df
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
sepal_length | sepal_width | petal_length | petal_width | species |
---|---|---|---|---|
5 | 2 | 3.5 | 1 | 2 |
6.3 | 3.3 | 6 | 2.5 | 3 |
5.1 | 3.4 | 1.5 | 0.2 | 1 |
5.6 | 2.8 | 4.9 | 2 | 3 |
4.9 | 3.6 | 1.4 | 0.1 | 1 |
5.4 | 3.9 | 1.3 | 0.4 | 1 |
5.7 | 2.6 | 3.5 | 1 | 2 |
5.7 | 3.8 | 1.7 | 0.3 | 1 |
6.7 | 3 | 5 | 1.7 | 2 |
6 | 3 | 4.8 | 1.8 | 3 |
Out[9]:
Train an isolation forest model using H2O.¶
In [10]:
# Import required libraries.
from h2o.estimators import H2OIsolationForestEstimator
In [11]:
# Add the code for training model.
h2o_df["species"] = h2o_df["species"].asfactor()
predictors = h2o_df.columns
response = "species"
In [12]:
Isolation_Forest_model = H2OIsolationForestEstimator(sample_rate = 0.1,
max_depth = 20,
ntrees = 50)
In [13]:
Isolation_Forest_model.train(x=predictors, y=response, training_frame=h2o_df)
isolationforest Model Build progress: |██████████████████████████████████████████| (done) 100% Model Details ============= H2OIsolationForestEstimator : Isolation Forest Model Key: IsolationForest_model_python_1647851494184_1 Model Summary:
number_of_trees | number_of_internal_trees | model_size_in_bytes | min_depth | max_depth | mean_depth | min_leaves | max_leaves | mean_leaves | ||
---|---|---|---|---|---|---|---|---|---|---|
0 | 50.0 | 50.0 | 8034.0 | 1.0 | 9.0 | 4.84 | 1.0 | 18.0 | 8.16 |
ModelMetricsAnomaly: isolationforest ** Reported on train data. ** Anomaly Score: 2.316002830125688 Normalized Anomaly Score: 0.43456615424340295 Scoring History:
timestamp | duration | number_of_trees | mean_tree_path_length | mean_anomaly_score | ||
---|---|---|---|---|---|---|
0 | 2022-03-21 14:01:38 | 0.012 sec | 0.0 | NaN | NaN | |
1 | 2022-03-21 14:01:38 | 0.085 sec | 1.0 | 2.061947 | 0.469027 | |
2 | 2022-03-21 14:01:38 | 0.095 sec | 2.0 | 2.025210 | 0.389916 | |
3 | 2022-03-21 14:01:38 | 0.103 sec | 3.0 | 2.378531 | 0.533128 | |
4 | 2022-03-21 14:01:39 | 0.110 sec | 4.0 | 2.309117 | 0.517569 | |
5 | 2022-03-21 14:01:39 | 0.113 sec | 5.0 | 1.849858 | 0.516714 | |
6 | 2022-03-21 14:01:39 | 0.120 sec | 6.0 | 1.876891 | 0.541036 | |
7 | 2022-03-21 14:01:39 | 0.123 sec | 7.0 | 1.602302 | 0.543549 | |
8 | 2022-03-21 14:01:39 | 0.130 sec | 8.0 | 1.847429 | 0.511028 | |
9 | 2022-03-21 14:01:39 | 0.135 sec | 9.0 | 1.982093 | 0.507326 | |
10 | 2022-03-21 14:01:39 | 0.141 sec | 10.0 | 2.222943 | 0.452714 | |
11 | 2022-03-21 14:01:39 | 0.146 sec | 11.0 | 2.321387 | 0.411510 | |
12 | 2022-03-21 14:01:39 | 0.153 sec | 12.0 | 2.291757 | 0.395788 | |
13 | 2022-03-21 14:01:39 | 0.159 sec | 13.0 | 2.281198 | 0.406280 | |
14 | 2022-03-21 14:01:39 | 0.165 sec | 14.0 | 2.286250 | 0.428304 | |
15 | 2022-03-21 14:01:39 | 0.172 sec | 15.0 | 2.336795 | 0.431602 | |
16 | 2022-03-21 14:01:39 | 0.185 sec | 16.0 | 2.258393 | 0.423775 | |
17 | 2022-03-21 14:01:39 | 0.194 sec | 17.0 | 2.214138 | 0.405702 | |
18 | 2022-03-21 14:01:39 | 0.201 sec | 18.0 | 2.228324 | 0.410006 | |
19 | 2022-03-21 14:01:39 | 0.209 sec | 19.0 | 2.345957 | 0.412195 |
See the whole table with table.as_data_frame()
Out[13]:
Save the model in MOJO format.¶
In [14]:
# Saving H2O Model to a file.
temp_dir = tempfile.TemporaryDirectory()
model_file_path = Isolation_Forest_model.save_mojo(path=f"{temp_dir.name}", force=True)
Save the model in Vantage.¶
In [15]:
# Save the H2O Model in Vantage.
save_byom(model_id="h2o_Isolation_Forest_iris", model_file=model_file_path, table_name="byom_models")
Created the model table 'byom_models' as it does not exist. Model is saved.
In [16]:
# List the models from "byom_models".
list_byom("byom_models")
model model_id h2o_Isolation_Forest_iris b'504B03041400080808...'
Retrieve the model from Vantage.¶
In [17]:
# Retrieve the model from vantage using the model name 'h2o_Isolation_Forest_iris'.
model=retrieve_byom(model_id="h2o_Isolation_Forest_iris", table_name="byom_models")
In [18]:
configure.byom_install_location = getpass.getpass("byom_install_location: ")
byom_install_location: ········
Score the model.¶
In [19]:
# Score the model on 'iris_test' data.
result = H2OPredict(newdata=iris_test,
newdata_partition_column='id',
newdata_order_column='id',
modeldata=model,
modeldata_order_column='model_id',
model_output_fields=['normalizedScore'],
accumulate=['id', 'sepal_length', 'petal_length'],
overwrite_cached_models='*',
model_type='OpenSource'
)
In [20]:
# Print the query.
print(result.show_query())
SELECT * FROM "alice".H2OPredict( ON "ALICE"."ml__select__1647853163572773" AS InputTable PARTITION BY "id" ORDER BY "id" ON (select model_id,model from "ALICE"."ml__filter__1647856509161774") AS ModelTable DIMENSION ORDER BY "model_id" USING Accumulate('id','sepal_length','petal_length') ModelOutputFields('normalizedScore') OverwriteCachedModel('*') ) as sqlmr
In [21]:
# Print the result.
result.result
Out[21]:
id | sepal_length | petal_length | prediction | normalizedscore |
---|---|---|---|---|
50 | 5.0 | 1.4 | 2.5 | 0.32098765432098764 |
60 | 5.2 | 3.9 | 2.52 | 0.30864197530864196 |
73 | 6.3 | 4.9 | 2.5 | 0.32098765432098764 |
9 | 4.4 | 1.4 | 2.32 | 0.43209876543209874 |
66 | 6.7 | 4.4 | 2.04 | 0.6049382716049383 |
46 | 4.8 | 1.4 | 2.5 | 0.32098765432098764 |
53 | 6.9 | 4.9 | 2.1 | 0.5679012345679012 |
59 | 6.6 | 4.6 | 2.14 | 0.5432098765432098 |
11 | 5.4 | 1.5 | 2.42 | 0.37037037037037035 |
38 | 4.9 | 1.4 | 2.24 | 0.48148148148148145 |
Cleanup.¶
In [22]:
# Delete the saved Model from the table byom_models, using the model id h2o_Isolation_Forest_iris.
delete_byom("h2o_Isolation_Forest_iris", table_name="byom_models")
Model is deleted.
In [23]:
# Drop model table.
db_drop_table("byom_models")
Out[23]:
True
In [24]:
# Drop input data table.
db_drop_table("iris_input")
Out[24]:
True
In [25]:
# One must run remove_context() to close the connection and garbage collect internally generated objects.
remove_context()
Out[25]:
True