This use case shows the steps to use SageMaker KMeans Estimator with tdapiclient.
You can download the aws-usecases.zip file in the attachment as a reference. The kmeans folder in the zip file includes a Jupyter notebook file (ipynb) for this use case.
- Import necessary packages.
import os import getpass from tdapiclient import create_tdapi_context, TDApiClient from teradataml import create_context, DataFrame, copy_to_sql,load_example_data import pandas as pd from teradatasqlalchemy.types import *
- Create the connection.
host = input("Host: ") username = input("Username: ") password = getpass.getpass("Password: ")
td_context = create_context(host=host, username=username, password=password)
- Create TDAPI context and TDApiClient object.
s3_bucket = input("S3 Bucket(Please provide just the bucket name, for example: test-bucket): ") access_id = input("Access ID:") access_key = getpass.getpass("Acess Key: ") region = input("AWS Region: ")
os.environ["AWS_ACCESS_KEY_ID"] = access_id os.environ["AWS_SECRET_ACCESS_KEY"] = access_key os.environ["AWS_REGION"] = region
tdapi_context = create_tdapi_context("aws", bucket_name=s3_bucket)
td_apiclient = TDApiClient(tdapi_context)
- Set up data.
- Load the example 'iris' data from teradataml.
load_example_data("byom", "iris_input")
iris_input = DataFrame("iris_input")
- Create two samples of input data: sample 1 has 80% of total rows and sample 2 has 20% of total rows.
iris_sample = iris.sample(frac=[0.8, 0.2])
- Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.
iris_train = iris_sample[iris_sample.sampleid == "1"].drop("sampleid", axis = 1)
iris_train
The output:id sepal_length sepal_width petal_length petal_width species 120 6.0 2.2 5.0 1.5 3 118 7.7 3.8 6.7 2.2 3 15 5.8 4.0 1.2 0.2 1 61 5.0 2.0 3.5 1.0 2 19 5.7 3.8 1.7 0.3 1 80 5.7 2.6 3.5 1.0 2 59 6.6 2.9 4.6 1.3 2 38 4.9 3.6 1.4 0.1 1 40 5.1 3.4 1.5 0.2 1 99 5.1 2.5 3.0 1.1 2
- Create test dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is not required for scoring.
iris_test = iris_sample[iris_sample.sampleid == "2"].drop("sampleid", axis = 1)
iris_test
The output:id sepal_length sepal_width petal_length petal_width species 133 6.4 2.8 5.6 2.2 3 144 6.8 3.2 5.9 2.3 3 18 5.1 3.5 1.4 0.3 1 17 5.4 3.9 1.3 0.4 1 32 5.4 3.4 1.5 0.4 1 45 5.1 3.8 1.9 0.4 1 106 7.6 3.0 6.6 2.1 3 125 6.7 3.3 5.7 2.1 3 137 6.3 3.4 5.6 2.4 3 108 7.3 2.9 6.3 1.8 3
- Load the example 'iris' data from teradataml.
- Define the bucket locations.
# Bucket location where your custom code are saved in the tar.gz format. custom_code_upload_location = "s3://{}/Kmeans/code".format(s3_bucket)
# Bucket location where results of model training are saved. model_artifacts_location = "s3://{}/Kmeans/artifacts".format(s3_bucket)
- Create KMeans SageMaker estimator instance through tdapiclient.
exec_role_arn = "arn:aws:iam::076782961461:role/service-role/AmazonSageMaker-ExecutionRole-20210112T215668"
# Create a object based on Kmeans sagemaker class kmeans = td_apiclient.KMeans( role=exec_role_arn, instance_count=1, instance_type="ml.m5.large", k=3, output_path=model_artifacts_location, code_location=custom_code_upload_location, )
- Prepare data for KMeans.
- Convert teradataml DataFrame to pandas , and then to NumPy ndarray.
train_pd = iris_train.to_pandas()
features = train_pd.columns.drop('species')
train_data = train_pd[features].values.astype('float32')
- Convert NumPy ndarray to RecordSet object to be passed to fit method.
obj= kmeans.record_set(train_data)
type(obj)
- Convert teradataml DataFrame to pandas , and then to NumPy ndarray.
- Start training using RecordSet objects.
kmeans.fit(obj , mini_batch_size=10)
- Create Serializer and Deserializer, so predictor can handle CSV input and output.
from sagemaker.serializers import CSVSerializer from sagemaker.deserializers import CSVDeserializer csv_ser = CSVSerializer() csv_dser = CSVDeserializer()
kmeans_predictor = kmeans.deploy("aws-endpoint", sagemaker_kw_args={"instance_type": "ml.m5.large", "initial_instance_count": 1, "serializer": csv_ser, "deserializer": csv_dser})
- Score the model using teradataml DataFrame and the predictor object created in previous step.
- Prepare test DataFrame by dropping 'id' and 'species' columns.
iris_test = iris_test.drop("species", axis = 1) iris_test = iris_test.drop("id", axis = 1)
iris_test
- Show the DataFrame.
iris_test
The output:sepal_length sepal_width petal_length petal_width 7.7 3.8 6.7 2.2 6.3 3.4 5.6 2.4 6.9 3.1 4.9 1.5 6.6 2.9 4.6 1.3 6.1 2.8 4.7 1.2 5.7 2.5 5.0 2.0 6.1 2.8 4.0 1.3 6.4 2.7 5.3 1.9 6.4 3.2 5.3 2.3 5.8 4.0 1.2 0.2
- Try prediction with UDF and Client options.Prediction with UDF option:
output = kmeans_predictor.predict(iris_test, mode="UDF",content_type="csv")
output
The output:sepal_length sepal_width petal_length petal_width Output 6.9 3.1 4.9 1.5 {"predictions": [{"closest_cluster": 2.0, "distance_to_cluster": 1.04985511302948}]} 6.7 3.1 4.7 1.5 {"predictions": [{"closest_cluster": 0.0, "distance_to_cluster": 0.9353123903274536}]} 7.6 3.0 6.6 2.1 {"predictions": [{"closest_cluster": 2.0, "distance_to_cluster": 1.1450010538101196}]} 5.0 2.0 3.5 1.0 {"predictions": [{"closest_cluster": 0.0, "distance_to_cluster": 1.527915596961975}]} 5.7 2.9 4.2 1.3 {"predictions": [{"closest_cluster": 0.0, "distance_to_cluster": 0.3593007028102875}]} 5.6 2.7 4.2 1.3 {"predictions": [{"closest_cluster": 0.0, "distance_to_cluster": 0.38576072454452515}]} 6.4 2.8 5.6 2.2 {"predictions": [{"closest_cluster": 2.0, "distance_to_cluster": 0.5681478977203369}]} 5.1 3.3 1.7 0.5 {"predictions": [{"closest_cluster": 1.0, "distance_to_cluster": 0.35348325967788696}]} 6.7 3.0 5.0 1.7 {"predictions": [{"closest_cluster": 2.0, "distance_to_cluster": 0.8753966093063354}]} 5.4 3.7 1.5 0.2 {"predictions": [{"closest_cluster": 1.0, "distance_to_cluster": 0.4583273231983185}]}
Prediction with Client option:output = kmeans_predictor.predict(iris_test, mode="client",content_type='csv')
output
The output:[['0.0', '1.5279181003570557'], ['2.0', '1.4551581144332886'], ['1.0', '0.35409244894981384'], ['0.0', '1.2355302572250366'], ['2.0', '0.6641027331352234'], ['1.0', '0.8323636054992676'], ['2.0', '0.7607014179229736'], ['2.0', '0.22456690669059753'], ['0.0', '0.8359512090682983'], ['2.0', '0.35871627926826477'], ['1.0', '0.21502123773097992'], ['2.0', '0.5929785370826721'], ['1.0', '0.44751715660095215'], ['1.0', '0.9462401866912842'], ['1.0', '0.7068477869033813'], ['1.0', '0.15458093583583832'], ['0.0', '0.7202901244163513'], ['0.0', '0.4245430529117584'], ['0.0', '1.199622631072998'], ['1.0', '0.48908916115760803'], ['0.0', '1.5461325645446777'], ['0.0', '0.7100999355316162'], ['2.0', '0.2805166244506836'], ['0.0', '0.7256220579147339'], ['2.0', '0.6365150213241577'], ['0.0', '0.7958523035049438'], ['1.0', '0.657414436340332'], ['0.0', '0.9718656539916992'], ['2.0', '0.5427298545837402'], ['1.0', '0.3770846724510193']]
- Prepare test DataFrame by dropping 'id' and 'species' columns.
- Clean up.
kmeans_predictor.cloudObj.delete_model() kmeans_predictor.cloudObj.delete_endpoint() remove_tdapi_context(tdapi_context)