Using SageMaker kNN with tdapiclient | API Integration - Using SageMaker kNN with tdapiclient

This use case shows the steps to use SageMaker kNN with tdapiclient.

You can download the aws-usecases.zip file in the attachment as a reference. The knn folder in the zip file includes a Jupyter notebook file (ipynb) and a data file (csv) containing the dataset required to run this use case.

Import necessary libraries.

import getpass
from tdapiclient import create_tdapi_context,TDApiClient
from teradataml import create_context, DataFrame, copy_to_sql,load_example_data
import pandas as pd
from teradatasqlalchemy.types import *

Create the connection.

host = input("Host: ")
username = input("Username: ")
password = getpass.getpass("Password: ")

td_context = create_context(host=host, username=username, password=password)

Create TDAPI context and TDApiClient object.

s3_bucket = input("S3 Bucket(Please provide just the bucket name, for example: test-bucket): ")
access_id = input("Access ID:")
access_key = getpass.getpass("Acess Key: ")
region = input("AWS Region: ")

os.environ["AWS_ACCESS_KEY_ID"] = access_id
os.environ["AWS_SECRET_ACCESS_KEY"] = access_key
os.environ["AWS_REGION"] = region

tdapi_context = create_tdapi_context("aws", bucket_name=s3_bucket)

td_apiclient = TDApiClient(tdapi_context)

Set up data to be used for this workflow.

Read the wheat seeds dataset.

df = pd.read_csv ("seeds.csv")

df

The output:

	Area	Perimeter	Compactness	Kernel.Length	Kernel.Width	Asymmetry.Coeff	Kernel.Groove	Type
0	15.26	14.84	0.8710	5.763	3.312	2.221	5.220	1
1	14.88	14.57	0.8811	5.554	3.333	1.018	4.956	1
2	14.29	14.09	0.9050	5.291	3.337	2.699	4.825	1
3	13.84	13.94	0.8955	5.324	3.379	2.259	4.805	1
4	16.14	14.99	0.9034	5.658	3.562	1.355	5.175	1
...	...	...	...	...	...	...	...	...
194	12.19	13.20	0.8783	5.137	2.981	3.631	4.870	3
195	11.23	12.88	0.8511	5.140	2.795	4.325	5.003	3
196	13.20	13.66	0.8883	5.236	3.232	8.315	5.056	3
197	11.84	13.21	0.8521	5.175	2.836	3.598	5.044	3
198	12.30	13.34	0.8684	5.243	2.974	5.637	5.063	3
199 rows × 8 columns

Rename columns for creating teradataml DataFrame.

df.rename(columns={'Kernel.Length':'Kernel_Length',
                         'Kernel.Width':'Kernel_Width',
                         'Kernel.Groove':'Kernel_Groove',
                         'Asymmetry.Coeff':'Asymmetry_Coeff'}, 
                 inplace=True)

Inset DataFrame in table.

data_table="wheat_data"

column_types ={'Area': FLOAT, 
               'Perimeter':FLOAT, 
               'Compactness':FLOAT,
               'Kernel_Length':FLOAT, 
               'Kernel_Width':FLOAT,                
               'Asymmetry_Coeff':FLOAT,
               'Kernel_Groove':FLOAT, 'Type':INTEGER}

copy_to_sql(df=df, table_name=data_table, if_exists="replace", types=column_types)

Create a teradataml DataFrame using the table.

data = DataFrame(table_name=data_table)

data

The output:

Area	Perimeter	Compactness	Kernel_Length	Kernel_Width	Asymmetry_Coeff	Kernel_Groove	Type
14.29	14.09	0.905	5.291	3.337	2.699	4.825	1
16.14	14.99	0.9034	5.658	3.562	1.355	5.175	1
14.38	14.21	0.8951	5.386	3.312	2.462	4.956	1
14.69	14.49	0.8799	5.563	3.259	3.586	5.219	1
16.44	15.25	0.888	5.884	3.505	1.969	5.533	1
15.26	14.85	0.8696	5.714	3.242	4.543	5.314	1
16.63	15.46	0.8747	6.053	3.465	2.04	5.877	1
13.84	13.94	0.8955	5.324	3.379	2.259	4.805	1
14.88	14.57	0.8811	5.554	3.333	1.018	4.956	1
15.26	14.84	0.871	5.763	3.312	2.221	5.22	1

Create two samples of input data: sample 1 has 80% of total rows and sample 2 has 20% of total rows.

data_sample = data.sample(frac=[0.8, 0.2])

data_sample

The output:

Area	Perimeter	Compactness	Kernel_Length	Kernel_Width	Asymmetry_Coeff	Kernel_Groove	Type	sampleid
14.29	14.09	0.905	5.291	3.337	2.699	4.825	1	2
16.14	14.99	0.9034	5.658	3.562	1.355	5.175	1	1
14.38	14.21	0.8951	5.386	3.312	2.462	4.956	1	1
14.69	14.49	0.8799	5.563	3.259	3.586	5.219	1	1
16.44	15.25	0.888	5.884	3.505	1.969	5.533	1	1
15.26	14.85	0.8696	5.714	3.242	4.543	5.314	1	1
16.63	15.46	0.8747	6.053	3.465	2.04	5.877	1	1
13.84	13.94	0.8955	5.324	3.379	2.259	4.805	1	1
14.88	14.57	0.8811	5.554	3.333	1.018	4.956	1	1
15.26	14.84	0.871	5.763	3.312	2.221	5.22	1	1

Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.

data_train = data_sample[data_sample.sampleid == "1"].drop("sampleid", axis = 1)

data_train

The output:

Area	Perimeter	Compactness	Kernel_Length	Kernel_Width	Asymmetry_Coeff	Kernel_Groove	Type
13.84	13.94	0.8955	5.324	3.379	2.259	4.805	1
14.38	14.21	0.8951	5.386	3.312	2.462	4.956	1
16.63	15.46	0.8747	6.053	3.465	2.04	5.877	1
15.26	14.85	0.8696	5.714	3.242	4.543	5.314	1
13.74	14.05	0.8744	5.482	3.114	2.932	4.825	1
14.59	14.28	0.8993	5.351	3.333	4.185	4.781	1
13.89	14.02	0.888	5.439	3.199	3.986	4.738	1
16.14	14.99	0.9034	5.658	3.562	1.355	5.175	1
14.29	14.09	0.905	5.291	3.337	2.699	4.825	1
14.88	14.57	0.8811	5.554	3.333	1.018	4.956	1

Create test dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is not required for scoring.

data_test = data_sample[data_sample.sampleid == "2"].drop("sampleid", axis = 1)

data_test

The output:

Area	Perimeter	Compactness	Kernel_Length	Kernel_Width	Asymmetry_Coeff	Kernel_Groove	Type
15.26	14.85	0.8696	5.714	3.242	4.543	5.314	1
13.74	14.05	0.8744	5.482	3.114	2.932	4.825	1
13.02	13.76	0.8641	5.395	3.026	3.373	4.825	1
13.94	14.17	0.8728	5.585	3.15	2.124	5.012	1
14.8	14.52	0.8823	5.656	3.288	3.112	5.309	1
13.16	13.55	0.9009	5.138	3.201	2.461	4.783	1
17.08	15.38	0.9079	5.832	3.683	2.956	5.484	1
13.89	14.02	0.888	5.439	3.199	3.986	4.738	1
14.38	14.21	0.8951	5.386	3.312	2.462	4.956	1
14.29	14.09	0.905	5.291	3.337	2.699	4.825	1

Create kNN SageMaker instance through tdapiclient.

exec_role_arn = "arn:aws:iam::076782961461:role/service-role/AmazonSageMaker-ExecutionRole-20210112T215668"

knn = td_apiclient.KNN(
    role=exec_role_arn,
    instance_count=1,
    instance_type="ml.m5.large",
    k= 3,
    sample_size= 30,
    predictor_type= "classifier"
)

Prepare data for kNN.

Convert teradataml DataFrame to NumPy ndarray.

train_data=data_train.drop('Type',axis=1).get_values()
label_train=data_train.get_values()
label_train=label_train[:,7]

label_train=label_train.astype('float32') 
train_data=train_data.astype('float32')

Convert NumPy ndarray to RecordSet object to be passed to fit method.

training_data_recordset = knn.record_set(train=train_data, labels=label_train)

Start training using RecordSet objects.
```
knn.fit(training_data_recordset)
```

Create Serializer and Deserializer, so predictor can handle CSV input and output.

from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import CSVDeserializer
csv_ser = CSVSerializer()
csv_dser = CSVDeserializer()

predictor = knn.deploy("aws-endpoint",
                       sagemaker_kw_args={"instance_type": "ml.m5.large", "initial_instance_count": 1, "serializer": csv_ser, "deserializer": csv_dser})

Score the model using teradataml DataFrame and the predictor object created in previous step.

Show the predictor object created in previous step.
```
print(predictor.cloudObj.accept)
```
The output:
```
('text/csv',)
```
Prepare test DataFrame by dropping target variable "Type".
```
data_test=data_test.drop("Type",axis=1)
```

Show the DataFrame.

data_test

The output:

Area	Perimeter	Compactness	Kernel_Length	Kernel_Width	Asymmetry_Coeff	Kernel_Groove
14.03	14.16	0.8796	5.438	3.201	1.717	5.001
15.88	14.9	0.8988	5.618	3.507	0.7651	5.091
12.08	13.23	0.8664	5.099	2.936	1.415	4.961
16.19	15.16	0.8849	5.833	3.421	0.903	5.307
17.08	15.38	0.9079	5.832	3.683	2.956	5.484
15.36	14.76	0.8861	5.701	3.393	1.367	5.132
12.74	13.67	0.8564	5.395	2.956	2.504	4.869
14.11	14.26	0.8722	5.52	3.168	2.688	5.219
16.44	15.25	0.888	5.884	3.505	1.969	5.533
16.14	14.99	0.9034	5.658	3.562	1.355	5.175

Try prediction with UDF and Client options.

Prediction with UDF option:

output = predictor.predict(data_test, mode="UDF",content_type='csv')

output

The output:

Area	Perimeter	Compactness	Kernel_Length	Kernel_Width	Asymmetry_Coeff	Kernel_Groove	Output
16.63	15.46	0.8747	6.053	3.465	2.04	5.877	{"predictions": [{"predicted_label": 1.0}]}
14.7	14.21	0.9153	5.205	3.466	1.767	4.649	{"predictions": [{"predicted_label": 1.0}]}
12.74	13.67	0.8564	5.395	2.956	2.504	4.869	{"predictions": [{"predicted_label": 1.0}]}
13.16	13.82	0.8662	5.454	2.975	0.8551	5.056	{"predictions": [{"predicted_label": 1.0}]}
16.2	15.27	0.8734	5.826	3.464	2.823	5.527	{"predictions": [{"predicted_label": 1.0}]}
17.08	15.38	0.9079	5.832	3.683	2.956	5.484	{"predictions": [{"predicted_label": 1.0}]}
14.09	14.41	0.8529	5.717	3.186	3.92	5.299	{"predictions": [{"predicted_label": 1.0}]}
13.99	13.83	0.9183	5.119	3.383	5.234	4.781	{"predictions": [{"predicted_label": 3.0}]}
14.38	14.21	0.8951	5.386	3.312	2.462	4.956	{"predictions": [{"predicted_label": 1.0}]}
14.29	14.09	0.905	5.291	3.337	2.699	4.825	{"predictions": [{"predicted_label": 1.0}]}

Prediction with Client option:

output = predictor.predict(data_test, mode="client",content_type='csv')

output

The output:

[['{"predictions": [{"predicted_label": 1.0}',
  ' {"predicted_label": 3.0}',
  ' {"predicted_label": 1.0}',
  ' {"predicted_label": 1.0}',
  ' {"predicted_label": 3.0}',
  ' {"predicted_label": 1.0}',
  ' {"predicted_label": 1.0}',
  ' {"predicted_label": 2.0}',
  ' {"predicted_label": 3.0}',
  ' {"predicted_label": 2.0}',
  ' {"predicted_label": 2.0}',
  ' {"predicted_label": 2.0}',
  ' {"predicted_label": 2.0}',
  ' {"predicted_label": 2.0}',
  ' {"predicted_label": 2.0}',
  ' {"predicted_label": 2.0}',
  ' {"predicted_label": 2.0}',
  ' {"predicted_label": 2.0}',
  ' {"predicted_label": 2.0}',
  ' {"predicted_label": 2.0}',
  ' {"predicted_label": 2.0}',
  ' {"predicted_label": 2.0}',
  ' {"predicted_label": 2.0}',
  ' {"predicted_label": 2.0}',
  ' {"predicted_label": 2.0}',
  ' {"predicted_label": 3.0}',
  ' {"predicted_label": 3.0}',
  ' {"predicted_label": 3.0}',
  ' {"predicted_label": 3.0}',
  ' {"predicted_label": 3.0}',
  ' {"predicted_label": 3.0}',
  ' {"predicted_label": 3.0}',
  ' {"predicted_label": 3.0}',
  ' {"predicted_label": 1.0}',
  ' {"predicted_label": 3.0}',
  ' {"predicted_label": 3.0}',
  ' {"predicted_label": 3.0}',
  ' {"predicted_label": 3.0}',
  ' {"predicted_label": 3.0}',
  ' {"predicted_label": 3.0}]}']]

Clean up.

predictor.cloudObj.delete_model()
predictor.cloudObj.delete_endpoint()
remove_tdapi_context(tdapi_context)

Using SageMaker kNN with tdapiclient | API Integration - Using SageMaker kNN with tdapiclient - Teradata Vantage

Teradata Vantage™ - API Integration Guide for Cloud Machine Learning