This use case shows the steps to use SageMaker kNN with tdapiclient.
You can download the aws-usecases.zip file in the attachment as a reference. The knn folder in the zip file includes a Jupyter notebook file (ipynb) and a data file (csv) containing the dataset required to run this use case.
- Import necessary libraries.
import getpass from tdapiclient import create_tdapi_context,TDApiClient from teradataml import create_context, DataFrame, copy_to_sql,load_example_data import pandas as pd from teradatasqlalchemy.types import *
- Create the connection.
host = input("Host: ") username = input("Username: ") password = getpass.getpass("Password: ")
td_context = create_context(host=host, username=username, password=password)
- Create TDAPI context and TDApiClient object.
s3_bucket = input("S3 Bucket(Please provide just the bucket name, for example: test-bucket): ") access_id = input("Access ID:") access_key = getpass.getpass("Acess Key: ") region = input("AWS Region: ")
os.environ["AWS_ACCESS_KEY_ID"] = access_id os.environ["AWS_SECRET_ACCESS_KEY"] = access_key os.environ["AWS_REGION"] = region
tdapi_context = create_tdapi_context("aws", bucket_name=s3_bucket)
td_apiclient = TDApiClient(tdapi_context)
- Set up data to be used for this workflow.
- Read the wheat seeds dataset.
df = pd.read_csv ("seeds.csv")
df
The output:Area Perimeter Compactness Kernel.Length Kernel.Width Asymmetry.Coeff Kernel.Groove Type 0 15.26 14.84 0.8710 5.763 3.312 2.221 5.220 1 1 14.88 14.57 0.8811 5.554 3.333 1.018 4.956 1 2 14.29 14.09 0.9050 5.291 3.337 2.699 4.825 1 3 13.84 13.94 0.8955 5.324 3.379 2.259 4.805 1 4 16.14 14.99 0.9034 5.658 3.562 1.355 5.175 1 ... ... ... ... ... ... ... ... ... 194 12.19 13.20 0.8783 5.137 2.981 3.631 4.870 3 195 11.23 12.88 0.8511 5.140 2.795 4.325 5.003 3 196 13.20 13.66 0.8883 5.236 3.232 8.315 5.056 3 197 11.84 13.21 0.8521 5.175 2.836 3.598 5.044 3 198 12.30 13.34 0.8684 5.243 2.974 5.637 5.063 3 199 rows × 8 columns
- Rename columns for creating teradataml DataFrame.
df.rename(columns={'Kernel.Length':'Kernel_Length', 'Kernel.Width':'Kernel_Width', 'Kernel.Groove':'Kernel_Groove', 'Asymmetry.Coeff':'Asymmetry_Coeff'}, inplace=True)
- Inset DataFrame in table.
data_table="wheat_data"
column_types ={'Area': FLOAT, 'Perimeter':FLOAT, 'Compactness':FLOAT, 'Kernel_Length':FLOAT, 'Kernel_Width':FLOAT, 'Asymmetry_Coeff':FLOAT, 'Kernel_Groove':FLOAT, 'Type':INTEGER}
copy_to_sql(df=df, table_name=data_table, if_exists="replace", types=column_types)
- Create a teradataml DataFrame using the table.
data = DataFrame(table_name=data_table)
data
The output:Area Perimeter Compactness Kernel_Length Kernel_Width Asymmetry_Coeff Kernel_Groove Type 14.29 14.09 0.905 5.291 3.337 2.699 4.825 1 16.14 14.99 0.9034 5.658 3.562 1.355 5.175 1 14.38 14.21 0.8951 5.386 3.312 2.462 4.956 1 14.69 14.49 0.8799 5.563 3.259 3.586 5.219 1 16.44 15.25 0.888 5.884 3.505 1.969 5.533 1 15.26 14.85 0.8696 5.714 3.242 4.543 5.314 1 16.63 15.46 0.8747 6.053 3.465 2.04 5.877 1 13.84 13.94 0.8955 5.324 3.379 2.259 4.805 1 14.88 14.57 0.8811 5.554 3.333 1.018 4.956 1 15.26 14.84 0.871 5.763 3.312 2.221 5.22 1
- Create two samples of input data: sample 1 has 80% of total rows and sample 2 has 20% of total rows.
data_sample = data.sample(frac=[0.8, 0.2])
data_sample
The output:Area Perimeter Compactness Kernel_Length Kernel_Width Asymmetry_Coeff Kernel_Groove Type sampleid 14.29 14.09 0.905 5.291 3.337 2.699 4.825 1 2 16.14 14.99 0.9034 5.658 3.562 1.355 5.175 1 1 14.38 14.21 0.8951 5.386 3.312 2.462 4.956 1 1 14.69 14.49 0.8799 5.563 3.259 3.586 5.219 1 1 16.44 15.25 0.888 5.884 3.505 1.969 5.533 1 1 15.26 14.85 0.8696 5.714 3.242 4.543 5.314 1 1 16.63 15.46 0.8747 6.053 3.465 2.04 5.877 1 1 13.84 13.94 0.8955 5.324 3.379 2.259 4.805 1 1 14.88 14.57 0.8811 5.554 3.333 1.018 4.956 1 1 15.26 14.84 0.871 5.763 3.312 2.221 5.22 1 1
- Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.
data_train = data_sample[data_sample.sampleid == "1"].drop("sampleid", axis = 1)
data_train
The output:Area Perimeter Compactness Kernel_Length Kernel_Width Asymmetry_Coeff Kernel_Groove Type 13.84 13.94 0.8955 5.324 3.379 2.259 4.805 1 14.38 14.21 0.8951 5.386 3.312 2.462 4.956 1 16.63 15.46 0.8747 6.053 3.465 2.04 5.877 1 15.26 14.85 0.8696 5.714 3.242 4.543 5.314 1 13.74 14.05 0.8744 5.482 3.114 2.932 4.825 1 14.59 14.28 0.8993 5.351 3.333 4.185 4.781 1 13.89 14.02 0.888 5.439 3.199 3.986 4.738 1 16.14 14.99 0.9034 5.658 3.562 1.355 5.175 1 14.29 14.09 0.905 5.291 3.337 2.699 4.825 1 14.88 14.57 0.8811 5.554 3.333 1.018 4.956 1
- Create test dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is not required for scoring.
data_test = data_sample[data_sample.sampleid == "2"].drop("sampleid", axis = 1)
data_test
The output:Area Perimeter Compactness Kernel_Length Kernel_Width Asymmetry_Coeff Kernel_Groove Type 15.26 14.85 0.8696 5.714 3.242 4.543 5.314 1 13.74 14.05 0.8744 5.482 3.114 2.932 4.825 1 13.02 13.76 0.8641 5.395 3.026 3.373 4.825 1 13.94 14.17 0.8728 5.585 3.15 2.124 5.012 1 14.8 14.52 0.8823 5.656 3.288 3.112 5.309 1 13.16 13.55 0.9009 5.138 3.201 2.461 4.783 1 17.08 15.38 0.9079 5.832 3.683 2.956 5.484 1 13.89 14.02 0.888 5.439 3.199 3.986 4.738 1 14.38 14.21 0.8951 5.386 3.312 2.462 4.956 1 14.29 14.09 0.905 5.291 3.337 2.699 4.825 1
- Read the wheat seeds dataset.
- Create kNN SageMaker instance through tdapiclient.
exec_role_arn = "arn:aws:iam::076782961461:role/service-role/AmazonSageMaker-ExecutionRole-20210112T215668"
knn = td_apiclient.KNN( role=exec_role_arn, instance_count=1, instance_type="ml.m5.large", k= 3, sample_size= 30, predictor_type= "classifier" )
- Prepare data for kNN.
- Convert teradataml DataFrame to NumPy ndarray.
train_data=data_train.drop('Type',axis=1).get_values() label_train=data_train.get_values() label_train=label_train[:,7]
label_train=label_train.astype('float32') train_data=train_data.astype('float32')
- Convert NumPy ndarray to RecordSet object to be passed to fit method.
training_data_recordset = knn.record_set(train=train_data, labels=label_train)
- Convert teradataml DataFrame to NumPy ndarray.
- Start training using RecordSet objects.
knn.fit(training_data_recordset)
- Create Serializer and Deserializer, so predictor can handle CSV input and output.
from sagemaker.serializers import CSVSerializer from sagemaker.deserializers import CSVDeserializer csv_ser = CSVSerializer() csv_dser = CSVDeserializer()
predictor = knn.deploy("aws-endpoint", sagemaker_kw_args={"instance_type": "ml.m5.large", "initial_instance_count": 1, "serializer": csv_ser, "deserializer": csv_dser})
- Score the model using teradataml DataFrame and the predictor object created in previous step.
- Show the predictor object created in previous step.
print(predictor.cloudObj.accept)
The output:('text/csv',)
- Prepare test DataFrame by dropping target variable "Type".
data_test=data_test.drop("Type",axis=1)
- Show the DataFrame.
data_test
The output:Area Perimeter Compactness Kernel_Length Kernel_Width Asymmetry_Coeff Kernel_Groove 14.03 14.16 0.8796 5.438 3.201 1.717 5.001 15.88 14.9 0.8988 5.618 3.507 0.7651 5.091 12.08 13.23 0.8664 5.099 2.936 1.415 4.961 16.19 15.16 0.8849 5.833 3.421 0.903 5.307 17.08 15.38 0.9079 5.832 3.683 2.956 5.484 15.36 14.76 0.8861 5.701 3.393 1.367 5.132 12.74 13.67 0.8564 5.395 2.956 2.504 4.869 14.11 14.26 0.8722 5.52 3.168 2.688 5.219 16.44 15.25 0.888 5.884 3.505 1.969 5.533 16.14 14.99 0.9034 5.658 3.562 1.355 5.175
- Try prediction with UDF and Client options.Prediction with UDF option:
output = predictor.predict(data_test, mode="UDF",content_type='csv')
output
The output:Area Perimeter Compactness Kernel_Length Kernel_Width Asymmetry_Coeff Kernel_Groove Output 16.63 15.46 0.8747 6.053 3.465 2.04 5.877 {"predictions": [{"predicted_label": 1.0}]} 14.7 14.21 0.9153 5.205 3.466 1.767 4.649 {"predictions": [{"predicted_label": 1.0}]} 12.74 13.67 0.8564 5.395 2.956 2.504 4.869 {"predictions": [{"predicted_label": 1.0}]} 13.16 13.82 0.8662 5.454 2.975 0.8551 5.056 {"predictions": [{"predicted_label": 1.0}]} 16.2 15.27 0.8734 5.826 3.464 2.823 5.527 {"predictions": [{"predicted_label": 1.0}]} 17.08 15.38 0.9079 5.832 3.683 2.956 5.484 {"predictions": [{"predicted_label": 1.0}]} 14.09 14.41 0.8529 5.717 3.186 3.92 5.299 {"predictions": [{"predicted_label": 1.0}]} 13.99 13.83 0.9183 5.119 3.383 5.234 4.781 {"predictions": [{"predicted_label": 3.0}]} 14.38 14.21 0.8951 5.386 3.312 2.462 4.956 {"predictions": [{"predicted_label": 1.0}]} 14.29 14.09 0.905 5.291 3.337 2.699 4.825 {"predictions": [{"predicted_label": 1.0}]}
Prediction with Client option:output = predictor.predict(data_test, mode="client",content_type='csv')
output
The output:[['{"predictions": [{"predicted_label": 1.0}', ' {"predicted_label": 3.0}', ' {"predicted_label": 1.0}', ' {"predicted_label": 1.0}', ' {"predicted_label": 3.0}', ' {"predicted_label": 1.0}', ' {"predicted_label": 1.0}', ' {"predicted_label": 2.0}', ' {"predicted_label": 3.0}', ' {"predicted_label": 2.0}', ' {"predicted_label": 2.0}', ' {"predicted_label": 2.0}', ' {"predicted_label": 2.0}', ' {"predicted_label": 2.0}', ' {"predicted_label": 2.0}', ' {"predicted_label": 2.0}', ' {"predicted_label": 2.0}', ' {"predicted_label": 2.0}', ' {"predicted_label": 2.0}', ' {"predicted_label": 2.0}', ' {"predicted_label": 2.0}', ' {"predicted_label": 2.0}', ' {"predicted_label": 2.0}', ' {"predicted_label": 2.0}', ' {"predicted_label": 2.0}', ' {"predicted_label": 3.0}', ' {"predicted_label": 3.0}', ' {"predicted_label": 3.0}', ' {"predicted_label": 3.0}', ' {"predicted_label": 3.0}', ' {"predicted_label": 3.0}', ' {"predicted_label": 3.0}', ' {"predicted_label": 3.0}', ' {"predicted_label": 1.0}', ' {"predicted_label": 3.0}', ' {"predicted_label": 3.0}', ' {"predicted_label": 3.0}', ' {"predicted_label": 3.0}', ' {"predicted_label": 3.0}', ' {"predicted_label": 3.0}]}']]
- Show the predictor object created in previous step.
- Clean up.
predictor.cloudObj.delete_model() predictor.cloudObj.delete_endpoint() remove_tdapi_context(tdapi_context)