This use case shows the steps to use SageMaker RandomCutForest Estimator with tdapiclient.
You can download the aws-usecases.zip file in the attachment as a reference. The random-cut-forest folder in the zip file includes a Jupyter notebook file for this use case..
- Import necessary libraries.
import getpass import sagemaker from tdapiclient import create_tdapi_context,TDApiClient from teradataml import create_context, DataFrame, copy_to_sql,load_example_data, configure, LabelEncoder, valib,Retain import pandas as pd import numpy as np from sklearn.preprocessing import LabelEncoder from teradatasqlalchemy.types import *
- Create the connection.
host = input("Host: ") username = input("Username: ") password = getpass.getpass("Password: ")
td_context = create_context(host=host, username=username, password=password)
- Create TDAPI context and TDApiClient object.
s3_bucket = input("S3 Bucket(Please provide just the bucket name, for example: test-bucket): ") access_id = input("Access ID:") access_key = getpass.getpass("Acess Key: ") region = input("AWS Region: ")
os.environ["AWS_ACCESS_KEY_ID"] = access_id os.environ["AWS_SECRET_ACCESS_KEY"] = access_key os.environ["AWS_REGION"] = region
tdapi_context = create_tdapi_context("aws", bucket_name=s3_bucket)
td_apiclient = TDApiClient(tdapi_context)
- Set up data.
- Load the example data.
load_example_data("seriessplitter", "ibm_stock")
data = DataFrame("ibm_stock")
- Drop unnecessary columns.
data=data.drop(['period'],axis=1)
- Encode the column 'name' using label encoder.
from teradataml import LabelEncoder
rc = LabelEncoder(values=("ibm", 1), columns=["name"])
feature_columns_names= Retain(columns=["stockprice"])
configure.val_install_location = "alice"
data = valib.Transform(data=data, label_encode=rc,index_columns="id",unique_index=True,retain=feature_columns_names)
data=data.result
- Drop unnecessary columns.
data=data.drop("id",axis=1)
- Create two samples of input data: sample 1 has 80% of total rows and sample 2 has 20% of total rows.
data_sample = data.sample(frac=[0.8, 0.2])
data_sample
The output:stockprice name sampleid 552 1 2 352 1 1 556 1 1 370 1 1 475 1 1 385 1 1 557 1 2 350 1 1 596 1 2 474 1 1
- Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.
data_train = data_sample[data_sample.sampleid == "1"].drop("sampleid", axis = 1)
data_train
The output:stockprice name 475 1 557 1 577 1 497 1 552 1 556 1 521 1 487 1 387 1 385 1
- Create test dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is not required for scoring.
data_test = data_sample[data_sample.sampleid == "2"].drop("sampleid", axis = 1)
data_test
The output:stockprice name 555 1 496 1 409 1 551 1 491 1 587 1 399 1 542 1 531 1 387 1
- Load the example data.
- Create RandomCutForest instance through tdapiclient.
exec_role_arn = "arn:aws:iam::076782961461:role/service-role/AmazonSageMaker-ExecutionRole-20210112T215668"
rcf = td_apiclient.RandomCutForest(role=exec_role_arn, instance_count=1, instance_type="ml.m5.large", num_samples_per_tree=512, num_trees=50)
- Convert train data to RecordSet object to be passed to fit method.
train_set=rcf.record_set(data_train.get_values().astype('float32'))
- Start training using RecordSet objects.
rcf.fit(train_set)
- Create Serializer and Deserializer, so predictor can handle CSV input and output.
from sagemaker.serializers import CSVSerializer from sagemaker.deserializers import CSVDeserializer csv_ser = CSVSerializer() csv_dser = CSVDeserializer()
predictor = rcf.deploy("aws-endpoint", sagemaker_kw_args={"instance_type": "ml.m5.large", "initial_instance_count": 1, "serializer": csv_ser, "deserializer": csv_dser})
- Try prediction integration using teradataml DataFrame and the predictor object created in previous step.
- Confirm that predictor is correctly configured for accepting csv input.
print(predictor.cloudObj.accept)
The output:('text/csv',)
- Try prediction with UDF and Client options.Prediction with Client option:
output = predictor.predict(data_test, mode="client",content_type='csv')
output
The output:[['0.7504941002883164'], ['1.0207224796626084'], ['1.0308394635056672'], ['0.9211458191353374'], ['0.7386602728840181'], ['0.9023601909268845'], ['0.7716498762848135'], ['0.9157308017573146'], ['1.1074548624933822'], ['0.7695934123198193'], ['0.7240618827098061'], ['0.9301838696836991'], ['0.7543856434161177'], ['0.7758886919821008'], ['0.8995296519196438'], ['0.7900451450353364'], ['0.7543856434161177'], ['0.9845105265173143'], ['0.9041053286804649'], ['0.7517221420434166'], ['0.7543856434161177'], ['0.7755021272225036'], ['0.7312255026665386'], ['0.7846481495495119'], ['1.323345181866896'], ['0.7574469092696446'], ['0.9172574024288217'], ['0.8865849688703292'], ['0.9558031346431431'], ['1.1401907300479466'], ['0.7783665988095414'], ['0.8606176638947314'], ['0.8194513516498502'], ['1.0037295269063529'], ['0.7756173809067299'], ['0.7574469092696446'], ['0.8474858343051359'], ['0.8069028137727154'], ['1.1945449394990506'], ['0.7574469092696446'], ['0.7200622617886648'], ['0.907478795887426'], ['0.7200622617886648'], ['0.7760705068507692'], ['0.724224159464554'], ['0.9545242913829999'], ['0.9420880565470928'], ['0.9065678443198703'], ['0.7157960453933286'], ['0.960482824608952'], ['0.9318529852186351'], ['0.9023601909268845'], ['0.9630174703884888'], ['0.7517221420434166'], ['0.8606176638947314'], ['0.7517221420434166'], ['0.7823662290441952'], ['0.7376855731824589'], ['1.0636259692671353'], ['0.7480312078032537'], ['0.81220691722653'], ['0.8493853404084787'], ['0.8401529182505993'], ['0.7242812649681286'], ['0.7823662290441952'], ['1.0037295269063529'], ['0.9380221499147791'], ['0.8468019836106951'], ['0.7504941002883164'], ['0.8606176638947314'], ['0.8957845289058336'], ['0.9041053286804649'], ['0.9420880565470928'], ['0.7787192480469368']]
item=data_test.tail()
Prediction with UDF option:output = predictor.predict(item, mode="UDF",content_type='csv')
output
The output:stockprice name Output 590 1 {"scores":[{"score":0.9964993369}]} 585 1 {"scores":[{"score":0.9545242914}]} 584 1 {"scores":[{"score":0.9420421609}]} 583 1 {"scores":[{"score":0.953565364}]} 581 1 {"scores":[{"score":0.9211458191}]} 578 1 {"scores":[{"score":0.9041053287}]} 581 1 {"scores":[{"score":0.9211458191}]} 588 1 {"scores":[{"score":0.9759149282}]} 592 1 {"scores":[{"score":1.0207224797}]} 596 1 {"scores":[{"score":1.0538021653}]}
- Confirm that predictor is correctly configured for accepting csv input.
- Clean up.
predictor.cloudObj.delete_model() predictor.cloudObj.delete_endpoint() remove_tdapi_context(tdapi_context)