Using SageMaker KMeans Estimator with tdapiclient | API Integration - Using SageMaker KMeans Estimator with tdapiclient

This use case shows the steps to use SageMaker KMeans Estimator with tdapiclient.

You can download the aws-usecases.zip file in the attachment as a reference. The kmeans folder in the zip file includes a Jupyter notebook file (ipynb) for this use case.

Import necessary packages.

import os
import getpass
from tdapiclient import create_tdapi_context, TDApiClient
from teradataml import create_context, DataFrame, copy_to_sql,load_example_data
import pandas as pd
from teradatasqlalchemy.types import *

Create the connection.

host = input("Host: ")
username = input("Username: ")
password = getpass.getpass("Password: ")

td_context = create_context(host=host, username=username, password=password)

Create TDAPI context and TDApiClient object.

s3_bucket = input("S3 Bucket(Please provide just the bucket name, for example: test-bucket): ")
access_id = input("Access ID:")
access_key = getpass.getpass("Acess Key: ")
region = input("AWS Region: ")

os.environ["AWS_ACCESS_KEY_ID"] = access_id
os.environ["AWS_SECRET_ACCESS_KEY"] = access_key
os.environ["AWS_REGION"] = region

tdapi_context = create_tdapi_context("aws", bucket_name=s3_bucket)

td_apiclient = TDApiClient(tdapi_context)

Set up data.

Load the example 'iris' data from teradataml.

load_example_data("byom", "iris_input")

iris_input = DataFrame("iris_input")

Create two samples of input data: sample 1 has 80% of total rows and sample 2 has 20% of total rows.
```
iris_sample = iris.sample(frac=[0.8, 0.2])
```

Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.

iris_train = iris_sample[iris_sample.sampleid == "1"].drop("sampleid", axis = 1)

iris_train

The output:

id	sepal_length	sepal_width	petal_length	petal_width	species
120	6.0	2.2	5.0	1.5	3
118	7.7	3.8	6.7	2.2	3
15	5.8	4.0	1.2	0.2	1
61	5.0	2.0	3.5	1.0	2
19	5.7	3.8	1.7	0.3	1
80	5.7	2.6	3.5	1.0	2
59	6.6	2.9	4.6	1.3	2
38	4.9	3.6	1.4	0.1	1
40	5.1	3.4	1.5	0.2	1
99	5.1	2.5	3.0	1.1	2

Create test dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is not required for scoring.

iris_test = iris_sample[iris_sample.sampleid == "2"].drop("sampleid", axis = 1)

iris_test

The output:

id	sepal_length	sepal_width	petal_length	petal_width	species
133	6.4	2.8	5.6	2.2	3
144	6.8	3.2	5.9	2.3	3
18	5.1	3.5	1.4	0.3	1
17	5.4	3.9	1.3	0.4	1
32	5.4	3.4	1.5	0.4	1
45	5.1	3.8	1.9	0.4	1
106	7.6	3.0	6.6	2.1	3
125	6.7	3.3	5.7	2.1	3
137	6.3	3.4	5.6	2.4	3
108	7.3	2.9	6.3	1.8	3

Define the bucket locations.

# Bucket location where your custom code are saved in the tar.gz format.
custom_code_upload_location = "s3://{}/Kmeans/code".format(s3_bucket)

# Bucket location where results of model training are saved.
model_artifacts_location = "s3://{}/Kmeans/artifacts".format(s3_bucket)

Create KMeans SageMaker estimator instance through tdapiclient.

exec_role_arn = "arn:aws:iam::076782961461:role/service-role/AmazonSageMaker-ExecutionRole-20210112T215668"

# Create a object based on Kmeans sagemaker class
kmeans = td_apiclient.KMeans(
    role=exec_role_arn,
    instance_count=1,
    instance_type="ml.m5.large",
    k=3,
    output_path=model_artifacts_location,
    code_location=custom_code_upload_location,
)

Prepare data for KMeans.

Convert teradataml DataFrame to pandas , and then to NumPy ndarray.

train_pd = iris_train.to_pandas()

features = train_pd.columns.drop('species')

train_data = train_pd[features].values.astype('float32')

Convert NumPy ndarray to RecordSet object to be passed to fit method.
```
obj= kmeans.record_set(train_data)
```
```
type(obj)
```

Start training using RecordSet objects.
```
kmeans.fit(obj , mini_batch_size=10)
```

Create Serializer and Deserializer, so predictor can handle CSV input and output.

from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import CSVDeserializer
csv_ser = CSVSerializer()
csv_dser = CSVDeserializer()

kmeans_predictor = kmeans.deploy("aws-endpoint",
                                 sagemaker_kw_args={"instance_type": "ml.m5.large", "initial_instance_count": 1, "serializer": csv_ser, "deserializer": csv_dser})

Score the model using teradataml DataFrame and the predictor object created in previous step.

Prepare test DataFrame by dropping 'id' and 'species' columns.

iris_test = iris_test.drop("species", axis = 1)
iris_test = iris_test.drop("id", axis = 1)

iris_test

Show the DataFrame.

iris_test

The output:

sepal_length	sepal_width	petal_length	petal_width
7.7	3.8	6.7	2.2
6.3	3.4	5.6	2.4
6.9	3.1	4.9	1.5
6.6	2.9	4.6	1.3
6.1	2.8	4.7	1.2
5.7	2.5	5.0	2.0
6.1	2.8	4.0	1.3
6.4	2.7	5.3	1.9
6.4	3.2	5.3	2.3
5.8	4.0	1.2	0.2

Try prediction with UDF and Client options.

Prediction with UDF option:

output = kmeans_predictor.predict(iris_test, mode="UDF",content_type="csv")

output

The output:

sepal_length	sepal_width	petal_length	petal_width	Output
6.9	3.1	4.9	1.5	{"predictions": [{"closest_cluster": 2.0, "distance_to_cluster": 1.04985511302948}]}
6.7	3.1	4.7	1.5	{"predictions": [{"closest_cluster": 0.0, "distance_to_cluster": 0.9353123903274536}]}
7.6	3.0	6.6	2.1	{"predictions": [{"closest_cluster": 2.0, "distance_to_cluster": 1.1450010538101196}]}
5.0	2.0	3.5	1.0	{"predictions": [{"closest_cluster": 0.0, "distance_to_cluster": 1.527915596961975}]}
5.7	2.9	4.2	1.3	{"predictions": [{"closest_cluster": 0.0, "distance_to_cluster": 0.3593007028102875}]}
5.6	2.7	4.2	1.3	{"predictions": [{"closest_cluster": 0.0, "distance_to_cluster": 0.38576072454452515}]}
6.4	2.8	5.6	2.2	{"predictions": [{"closest_cluster": 2.0, "distance_to_cluster": 0.5681478977203369}]}
5.1	3.3	1.7	0.5	{"predictions": [{"closest_cluster": 1.0, "distance_to_cluster": 0.35348325967788696}]}
6.7	3.0	5.0	1.7	{"predictions": [{"closest_cluster": 2.0, "distance_to_cluster": 0.8753966093063354}]}
5.4	3.7	1.5	0.2	{"predictions": [{"closest_cluster": 1.0, "distance_to_cluster": 0.4583273231983185}]}

Prediction with Client option:

output = kmeans_predictor.predict(iris_test, mode="client",content_type='csv')

output

The output:

[['0.0', '1.5279181003570557'],
 ['2.0', '1.4551581144332886'],
 ['1.0', '0.35409244894981384'],
 ['0.0', '1.2355302572250366'],
 ['2.0', '0.6641027331352234'],
 ['1.0', '0.8323636054992676'],
 ['2.0', '0.7607014179229736'],
 ['2.0', '0.22456690669059753'],
 ['0.0', '0.8359512090682983'],
 ['2.0', '0.35871627926826477'],
 ['1.0', '0.21502123773097992'],
 ['2.0', '0.5929785370826721'],
 ['1.0', '0.44751715660095215'],
 ['1.0', '0.9462401866912842'],
 ['1.0', '0.7068477869033813'],
 ['1.0', '0.15458093583583832'],
 ['0.0', '0.7202901244163513'],
 ['0.0', '0.4245430529117584'],
 ['0.0', '1.199622631072998'],
 ['1.0', '0.48908916115760803'],
 ['0.0', '1.5461325645446777'],
 ['0.0', '0.7100999355316162'],
 ['2.0', '0.2805166244506836'],
 ['0.0', '0.7256220579147339'],
 ['2.0', '0.6365150213241577'],
 ['0.0', '0.7958523035049438'],
 ['1.0', '0.657414436340332'],
 ['0.0', '0.9718656539916992'],
 ['2.0', '0.5427298545837402'],
 ['1.0', '0.3770846724510193']]

Clean up.

kmeans_predictor.cloudObj.delete_model()
kmeans_predictor.cloudObj.delete_endpoint()
remove_tdapi_context(tdapi_context)

Using SageMaker KMeans Estimator with tdapiclient | API Integration - Using SageMaker KMeans Estimator with tdapiclient - Teradata Vantage

Teradata Vantage™ - API Integration Guide for Cloud Machine Learning