Using SageMaker kNN with tdapiclient | API Integration - Using SageMaker kNN with tdapiclient - Teradata Vantage

Teradata Vantage™ - API Integration Guide for Cloud Machine Learning

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Vantage
Release Number
1.4
Published
September 2023
ft:locale
en-US
ft:lastEdition
2023-09-28
dita:mapPath
mgu1643999543506.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
mgu1643999543506

This use case shows the steps to use SageMaker kNN with tdapiclient.

You can download the aws-usecases.zip file in the attachment as a reference. The knn folder in the zip file includes a Jupyter notebook file (ipynb) and a data file (csv) containing the dataset required to run this use case.

  1. Import necessary libraries.
    import getpass
    from tdapiclient import create_tdapi_context,TDApiClient
    from teradataml import create_context, DataFrame, copy_to_sql,load_example_data
    import pandas as pd
    from teradatasqlalchemy.types import *
  2. Create the connection.
    host = input("Host: ")
    username = input("Username: ")
    password = getpass.getpass("Password: ")
    td_context = create_context(host=host, username=username, password=password)
  3. Create TDAPI context and TDApiClient object.
    s3_bucket = input("S3 Bucket(Please provide just the bucket name, for example: test-bucket): ")
    access_id = input("Access ID:")
    access_key = getpass.getpass("Acess Key: ")
    region = input("AWS Region: ")
    os.environ["AWS_ACCESS_KEY_ID"] = access_id
    os.environ["AWS_SECRET_ACCESS_KEY"] = access_key
    os.environ["AWS_REGION"] = region
    tdapi_context = create_tdapi_context("aws", bucket_name=s3_bucket)
    td_apiclient = TDApiClient(tdapi_context)
  4. Set up data to be used for this workflow.
    1. Read the wheat seeds dataset.
      df = pd.read_csv ("seeds.csv")
      df
      The output:
      	Area	Perimeter	Compactness	Kernel.Length	Kernel.Width	Asymmetry.Coeff	Kernel.Groove	Type
      0	15.26	14.84	0.8710	5.763	3.312	2.221	5.220	1
      1	14.88	14.57	0.8811	5.554	3.333	1.018	4.956	1
      2	14.29	14.09	0.9050	5.291	3.337	2.699	4.825	1
      3	13.84	13.94	0.8955	5.324	3.379	2.259	4.805	1
      4	16.14	14.99	0.9034	5.658	3.562	1.355	5.175	1
      ...	...	...	...	...	...	...	...	...
      194	12.19	13.20	0.8783	5.137	2.981	3.631	4.870	3
      195	11.23	12.88	0.8511	5.140	2.795	4.325	5.003	3
      196	13.20	13.66	0.8883	5.236	3.232	8.315	5.056	3
      197	11.84	13.21	0.8521	5.175	2.836	3.598	5.044	3
      198	12.30	13.34	0.8684	5.243	2.974	5.637	5.063	3
      199 rows × 8 columns
    2. Rename columns for creating teradataml DataFrame.
      df.rename(columns={'Kernel.Length':'Kernel_Length',
                               'Kernel.Width':'Kernel_Width',
                               'Kernel.Groove':'Kernel_Groove',
                               'Asymmetry.Coeff':'Asymmetry_Coeff'}, 
                       inplace=True)
    3. Inset DataFrame in table.
      data_table="wheat_data"
      column_types ={'Area': FLOAT, 
                     'Perimeter':FLOAT, 
                     'Compactness':FLOAT,
                     'Kernel_Length':FLOAT, 
                     'Kernel_Width':FLOAT,                
                     'Asymmetry_Coeff':FLOAT,
                     'Kernel_Groove':FLOAT, 'Type':INTEGER}
      copy_to_sql(df=df, table_name=data_table, if_exists="replace", types=column_types)
    4. Create a teradataml DataFrame using the table.
      data = DataFrame(table_name=data_table)
      data
      The output:
      Area	Perimeter	Compactness	Kernel_Length	Kernel_Width	Asymmetry_Coeff	Kernel_Groove	Type
      14.29	14.09	0.905	5.291	3.337	2.699	4.825	1
      16.14	14.99	0.9034	5.658	3.562	1.355	5.175	1
      14.38	14.21	0.8951	5.386	3.312	2.462	4.956	1
      14.69	14.49	0.8799	5.563	3.259	3.586	5.219	1
      16.44	15.25	0.888	5.884	3.505	1.969	5.533	1
      15.26	14.85	0.8696	5.714	3.242	4.543	5.314	1
      16.63	15.46	0.8747	6.053	3.465	2.04	5.877	1
      13.84	13.94	0.8955	5.324	3.379	2.259	4.805	1
      14.88	14.57	0.8811	5.554	3.333	1.018	4.956	1
      15.26	14.84	0.871	5.763	3.312	2.221	5.22	1
      
    5. Create two samples of input data: sample 1 has 80% of total rows and sample 2 has 20% of total rows.
      data_sample = data.sample(frac=[0.8, 0.2])
      data_sample
      The output:
      Area	Perimeter	Compactness	Kernel_Length	Kernel_Width	Asymmetry_Coeff	Kernel_Groove	Type	sampleid
      14.29	14.09	0.905	5.291	3.337	2.699	4.825	1	2
      16.14	14.99	0.9034	5.658	3.562	1.355	5.175	1	1
      14.38	14.21	0.8951	5.386	3.312	2.462	4.956	1	1
      14.69	14.49	0.8799	5.563	3.259	3.586	5.219	1	1
      16.44	15.25	0.888	5.884	3.505	1.969	5.533	1	1
      15.26	14.85	0.8696	5.714	3.242	4.543	5.314	1	1
      16.63	15.46	0.8747	6.053	3.465	2.04	5.877	1	1
      13.84	13.94	0.8955	5.324	3.379	2.259	4.805	1	1
      14.88	14.57	0.8811	5.554	3.333	1.018	4.956	1	1
      15.26	14.84	0.871	5.763	3.312	2.221	5.22	1	1
    6. Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.
      data_train = data_sample[data_sample.sampleid == "1"].drop("sampleid", axis = 1)
      data_train
      The output:
      Area	Perimeter	Compactness	Kernel_Length	Kernel_Width	Asymmetry_Coeff	Kernel_Groove	Type
      13.84	13.94	0.8955	5.324	3.379	2.259	4.805	1
      14.38	14.21	0.8951	5.386	3.312	2.462	4.956	1
      16.63	15.46	0.8747	6.053	3.465	2.04	5.877	1
      15.26	14.85	0.8696	5.714	3.242	4.543	5.314	1
      13.74	14.05	0.8744	5.482	3.114	2.932	4.825	1
      14.59	14.28	0.8993	5.351	3.333	4.185	4.781	1
      13.89	14.02	0.888	5.439	3.199	3.986	4.738	1
      16.14	14.99	0.9034	5.658	3.562	1.355	5.175	1
      14.29	14.09	0.905	5.291	3.337	2.699	4.825	1
      14.88	14.57	0.8811	5.554	3.333	1.018	4.956	1
    7. Create test dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is not required for scoring.
      data_test = data_sample[data_sample.sampleid == "2"].drop("sampleid", axis = 1)
      data_test
      The output:
      Area	Perimeter	Compactness	Kernel_Length	Kernel_Width	Asymmetry_Coeff	Kernel_Groove	Type
      15.26	14.85	0.8696	5.714	3.242	4.543	5.314	1
      13.74	14.05	0.8744	5.482	3.114	2.932	4.825	1
      13.02	13.76	0.8641	5.395	3.026	3.373	4.825	1
      13.94	14.17	0.8728	5.585	3.15	2.124	5.012	1
      14.8	14.52	0.8823	5.656	3.288	3.112	5.309	1
      13.16	13.55	0.9009	5.138	3.201	2.461	4.783	1
      17.08	15.38	0.9079	5.832	3.683	2.956	5.484	1
      13.89	14.02	0.888	5.439	3.199	3.986	4.738	1
      14.38	14.21	0.8951	5.386	3.312	2.462	4.956	1
      14.29	14.09	0.905	5.291	3.337	2.699	4.825	1
      
  5. Create kNN SageMaker instance through tdapiclient.
    exec_role_arn = "arn:aws:iam::076782961461:role/service-role/AmazonSageMaker-ExecutionRole-20210112T215668"
    knn = td_apiclient.KNN(
        role=exec_role_arn,
        instance_count=1,
        instance_type="ml.m5.large",
        k= 3,
        sample_size= 30,
        predictor_type= "classifier"
    )
  6. Prepare data for kNN.
    1. Convert teradataml DataFrame to NumPy ndarray.
      train_data=data_train.drop('Type',axis=1).get_values()
      label_train=data_train.get_values()
      label_train=label_train[:,7]
      label_train=label_train.astype('float32') 
      train_data=train_data.astype('float32')
    2. Convert NumPy ndarray to RecordSet object to be passed to fit method.
      training_data_recordset = knn.record_set(train=train_data, labels=label_train)
  7. Start training using RecordSet objects.
    knn.fit(training_data_recordset)
  8. Create Serializer and Deserializer, so predictor can handle CSV input and output.
    from sagemaker.serializers import CSVSerializer
    from sagemaker.deserializers import CSVDeserializer
    csv_ser = CSVSerializer()
    csv_dser = CSVDeserializer()
    predictor = knn.deploy("aws-endpoint",
                           sagemaker_kw_args={"instance_type": "ml.m5.large", "initial_instance_count": 1, "serializer": csv_ser, "deserializer": csv_dser})
  9. Score the model using teradataml DataFrame and the predictor object created in previous step.
    1. Show the predictor object created in previous step.
      print(predictor.cloudObj.accept)
      The output:
      ('text/csv',)
    2. Prepare test DataFrame by dropping target variable "Type".
      data_test=data_test.drop("Type",axis=1)
    3. Show the DataFrame.
      data_test
      The output:
      Area	Perimeter	Compactness	Kernel_Length	Kernel_Width	Asymmetry_Coeff	Kernel_Groove
      14.03	14.16	0.8796	5.438	3.201	1.717	5.001
      15.88	14.9	0.8988	5.618	3.507	0.7651	5.091
      12.08	13.23	0.8664	5.099	2.936	1.415	4.961
      16.19	15.16	0.8849	5.833	3.421	0.903	5.307
      17.08	15.38	0.9079	5.832	3.683	2.956	5.484
      15.36	14.76	0.8861	5.701	3.393	1.367	5.132
      12.74	13.67	0.8564	5.395	2.956	2.504	4.869
      14.11	14.26	0.8722	5.52	3.168	2.688	5.219
      16.44	15.25	0.888	5.884	3.505	1.969	5.533
      16.14	14.99	0.9034	5.658	3.562	1.355	5.175
    4. Try prediction with UDF and Client options.
      Prediction with UDF option:
      output = predictor.predict(data_test, mode="UDF",content_type='csv')
      output
      The output:
      Area	Perimeter	Compactness	Kernel_Length	Kernel_Width	Asymmetry_Coeff	Kernel_Groove	Output
      16.63	15.46	0.8747	6.053	3.465	2.04	5.877	{"predictions": [{"predicted_label": 1.0}]}
      14.7	14.21	0.9153	5.205	3.466	1.767	4.649	{"predictions": [{"predicted_label": 1.0}]}
      12.74	13.67	0.8564	5.395	2.956	2.504	4.869	{"predictions": [{"predicted_label": 1.0}]}
      13.16	13.82	0.8662	5.454	2.975	0.8551	5.056	{"predictions": [{"predicted_label": 1.0}]}
      16.2	15.27	0.8734	5.826	3.464	2.823	5.527	{"predictions": [{"predicted_label": 1.0}]}
      17.08	15.38	0.9079	5.832	3.683	2.956	5.484	{"predictions": [{"predicted_label": 1.0}]}
      14.09	14.41	0.8529	5.717	3.186	3.92	5.299	{"predictions": [{"predicted_label": 1.0}]}
      13.99	13.83	0.9183	5.119	3.383	5.234	4.781	{"predictions": [{"predicted_label": 3.0}]}
      14.38	14.21	0.8951	5.386	3.312	2.462	4.956	{"predictions": [{"predicted_label": 1.0}]}
      14.29	14.09	0.905	5.291	3.337	2.699	4.825	{"predictions": [{"predicted_label": 1.0}]}
      Prediction with Client option:
      output = predictor.predict(data_test, mode="client",content_type='csv')
      output
      The output:
      [['{"predictions": [{"predicted_label": 1.0}',
        ' {"predicted_label": 3.0}',
        ' {"predicted_label": 1.0}',
        ' {"predicted_label": 1.0}',
        ' {"predicted_label": 3.0}',
        ' {"predicted_label": 1.0}',
        ' {"predicted_label": 1.0}',
        ' {"predicted_label": 2.0}',
        ' {"predicted_label": 3.0}',
        ' {"predicted_label": 2.0}',
        ' {"predicted_label": 2.0}',
        ' {"predicted_label": 2.0}',
        ' {"predicted_label": 2.0}',
        ' {"predicted_label": 2.0}',
        ' {"predicted_label": 2.0}',
        ' {"predicted_label": 2.0}',
        ' {"predicted_label": 2.0}',
        ' {"predicted_label": 2.0}',
        ' {"predicted_label": 2.0}',
        ' {"predicted_label": 2.0}',
        ' {"predicted_label": 2.0}',
        ' {"predicted_label": 2.0}',
        ' {"predicted_label": 2.0}',
        ' {"predicted_label": 2.0}',
        ' {"predicted_label": 2.0}',
        ' {"predicted_label": 3.0}',
        ' {"predicted_label": 3.0}',
        ' {"predicted_label": 3.0}',
        ' {"predicted_label": 3.0}',
        ' {"predicted_label": 3.0}',
        ' {"predicted_label": 3.0}',
        ' {"predicted_label": 3.0}',
        ' {"predicted_label": 3.0}',
        ' {"predicted_label": 1.0}',
        ' {"predicted_label": 3.0}',
        ' {"predicted_label": 3.0}',
        ' {"predicted_label": 3.0}',
        ' {"predicted_label": 3.0}',
        ' {"predicted_label": 3.0}',
        ' {"predicted_label": 3.0}]}']]
  10. Clean up.
    predictor.cloudObj.delete_model()
    predictor.cloudObj.delete_endpoint()
    remove_tdapi_context(tdapi_context)