Using SageMaker KMeans Estimator with tdapiclient | API Integration - Using SageMaker KMeans Estimator with tdapiclient - Teradata Vantage

Teradata Vantageā„¢ - API Integration Guide for Cloud Machine Learning

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Vantage
Release Number
1.4
Published
September 2023
Language
English (United States)
Last Update
2023-09-28
dita:mapPath
mgu1643999543506.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
mgu1643999543506

This use case shows the steps to use SageMaker KMeans Estimator with tdapiclient.

You can download the aws-usecases.zip file in the attachment as a reference. The kmeans folder in the zip file includes a Jupyter notebook file (ipynb) for this use case.

  1. Import necessary packages.
    import os
    import getpass
    from tdapiclient import create_tdapi_context, TDApiClient
    from teradataml import create_context, DataFrame, copy_to_sql,load_example_data
    import pandas as pd
    from teradatasqlalchemy.types import *
  2. Create the connection.
    host = input("Host: ")
    username = input("Username: ")
    password = getpass.getpass("Password: ")
    td_context = create_context(host=host, username=username, password=password)
  3. Create TDAPI context and TDApiClient object.
    s3_bucket = input("S3 Bucket(Please provide just the bucket name, for example: test-bucket): ")
    access_id = input("Access ID:")
    access_key = getpass.getpass("Acess Key: ")
    region = input("AWS Region: ")
    os.environ["AWS_ACCESS_KEY_ID"] = access_id
    os.environ["AWS_SECRET_ACCESS_KEY"] = access_key
    os.environ["AWS_REGION"] = region
    tdapi_context = create_tdapi_context("aws", bucket_name=s3_bucket)
    td_apiclient = TDApiClient(tdapi_context)
  4. Set up data.
    1. Load the example 'iris' data from teradataml.
      load_example_data("byom", "iris_input")
      iris_input = DataFrame("iris_input")
    2. Create two samples of input data: sample 1 has 80% of total rows and sample 2 has 20% of total rows.
      iris_sample = iris.sample(frac=[0.8, 0.2])
    3. Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.
      iris_train = iris_sample[iris_sample.sampleid == "1"].drop("sampleid", axis = 1)
      iris_train
      The output:
      id	sepal_length	sepal_width	petal_length	petal_width	species
      120	6.0	2.2	5.0	1.5	3
      118	7.7	3.8	6.7	2.2	3
      15	5.8	4.0	1.2	0.2	1
      61	5.0	2.0	3.5	1.0	2
      19	5.7	3.8	1.7	0.3	1
      80	5.7	2.6	3.5	1.0	2
      59	6.6	2.9	4.6	1.3	2
      38	4.9	3.6	1.4	0.1	1
      40	5.1	3.4	1.5	0.2	1
      99	5.1	2.5	3.0	1.1	2
    4. Create test dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is not required for scoring.
      iris_test = iris_sample[iris_sample.sampleid == "2"].drop("sampleid", axis = 1)
      iris_test
      The output:
      id	sepal_length	sepal_width	petal_length	petal_width	species
      133	6.4	2.8	5.6	2.2	3
      144	6.8	3.2	5.9	2.3	3
      18	5.1	3.5	1.4	0.3	1
      17	5.4	3.9	1.3	0.4	1
      32	5.4	3.4	1.5	0.4	1
      45	5.1	3.8	1.9	0.4	1
      106	7.6	3.0	6.6	2.1	3
      125	6.7	3.3	5.7	2.1	3
      137	6.3	3.4	5.6	2.4	3
      108	7.3	2.9	6.3	1.8	3
  5. Define the bucket locations.
    # Bucket location where your custom code are saved in the tar.gz format.
    custom_code_upload_location = "s3://{}/Kmeans/code".format(s3_bucket)
    # Bucket location where results of model training are saved.
    model_artifacts_location = "s3://{}/Kmeans/artifacts".format(s3_bucket)
  6. Create KMeans SageMaker estimator instance through tdapiclient.
    exec_role_arn = "arn:aws:iam::076782961461:role/service-role/AmazonSageMaker-ExecutionRole-20210112T215668"
    # Create a object based on Kmeans sagemaker class
    kmeans = td_apiclient.KMeans(
        role=exec_role_arn,
        instance_count=1,
        instance_type="ml.m5.large",
        k=3,
        output_path=model_artifacts_location,
        code_location=custom_code_upload_location,
    )
  7. Prepare data for KMeans.
    1. Convert teradataml DataFrame to pandas , and then to NumPy ndarray.
      train_pd = iris_train.to_pandas()
      features = train_pd.columns.drop('species')
      train_data = train_pd[features].values.astype('float32')
    2. Convert NumPy ndarray to RecordSet object to be passed to fit method.
      obj= kmeans.record_set(train_data)
      type(obj)
  8. Start training using RecordSet objects.
    kmeans.fit(obj , mini_batch_size=10)
  9. Create Serializer and Deserializer, so predictor can handle CSV input and output.
    from sagemaker.serializers import CSVSerializer
    from sagemaker.deserializers import CSVDeserializer
    csv_ser = CSVSerializer()
    csv_dser = CSVDeserializer()
    kmeans_predictor = kmeans.deploy("aws-endpoint",
                                     sagemaker_kw_args={"instance_type": "ml.m5.large", "initial_instance_count": 1, "serializer": csv_ser, "deserializer": csv_dser})
  10. Score the model using teradataml DataFrame and the predictor object created in previous step.
    1. Prepare test DataFrame by dropping 'id' and 'species' columns.
      iris_test = iris_test.drop("species", axis = 1)
      iris_test = iris_test.drop("id", axis = 1)
      iris_test
    2. Show the DataFrame.
      iris_test
      The output:
      sepal_length	sepal_width	petal_length	petal_width
      7.7	3.8	6.7	2.2
      6.3	3.4	5.6	2.4
      6.9	3.1	4.9	1.5
      6.6	2.9	4.6	1.3
      6.1	2.8	4.7	1.2
      5.7	2.5	5.0	2.0
      6.1	2.8	4.0	1.3
      6.4	2.7	5.3	1.9
      6.4	3.2	5.3	2.3
      5.8	4.0	1.2	0.2
    3. Try prediction with UDF and Client options.
      Prediction with UDF option:
      output = kmeans_predictor.predict(iris_test, mode="UDF",content_type="csv")
      output
      The output:
      sepal_length	sepal_width	petal_length	petal_width	Output
      6.9	3.1	4.9	1.5	{"predictions": [{"closest_cluster": 2.0, "distance_to_cluster": 1.04985511302948}]}
      6.7	3.1	4.7	1.5	{"predictions": [{"closest_cluster": 0.0, "distance_to_cluster": 0.9353123903274536}]}
      7.6	3.0	6.6	2.1	{"predictions": [{"closest_cluster": 2.0, "distance_to_cluster": 1.1450010538101196}]}
      5.0	2.0	3.5	1.0	{"predictions": [{"closest_cluster": 0.0, "distance_to_cluster": 1.527915596961975}]}
      5.7	2.9	4.2	1.3	{"predictions": [{"closest_cluster": 0.0, "distance_to_cluster": 0.3593007028102875}]}
      5.6	2.7	4.2	1.3	{"predictions": [{"closest_cluster": 0.0, "distance_to_cluster": 0.38576072454452515}]}
      6.4	2.8	5.6	2.2	{"predictions": [{"closest_cluster": 2.0, "distance_to_cluster": 0.5681478977203369}]}
      5.1	3.3	1.7	0.5	{"predictions": [{"closest_cluster": 1.0, "distance_to_cluster": 0.35348325967788696}]}
      6.7	3.0	5.0	1.7	{"predictions": [{"closest_cluster": 2.0, "distance_to_cluster": 0.8753966093063354}]}
      5.4	3.7	1.5	0.2	{"predictions": [{"closest_cluster": 1.0, "distance_to_cluster": 0.4583273231983185}]}
      
      Prediction with Client option:
      output = kmeans_predictor.predict(iris_test, mode="client",content_type='csv')
      output
      The output:
      [['0.0', '1.5279181003570557'],
       ['2.0', '1.4551581144332886'],
       ['1.0', '0.35409244894981384'],
       ['0.0', '1.2355302572250366'],
       ['2.0', '0.6641027331352234'],
       ['1.0', '0.8323636054992676'],
       ['2.0', '0.7607014179229736'],
       ['2.0', '0.22456690669059753'],
       ['0.0', '0.8359512090682983'],
       ['2.0', '0.35871627926826477'],
       ['1.0', '0.21502123773097992'],
       ['2.0', '0.5929785370826721'],
       ['1.0', '0.44751715660095215'],
       ['1.0', '0.9462401866912842'],
       ['1.0', '0.7068477869033813'],
       ['1.0', '0.15458093583583832'],
       ['0.0', '0.7202901244163513'],
       ['0.0', '0.4245430529117584'],
       ['0.0', '1.199622631072998'],
       ['1.0', '0.48908916115760803'],
       ['0.0', '1.5461325645446777'],
       ['0.0', '0.7100999355316162'],
       ['2.0', '0.2805166244506836'],
       ['0.0', '0.7256220579147339'],
       ['2.0', '0.6365150213241577'],
       ['0.0', '0.7958523035049438'],
       ['1.0', '0.657414436340332'],
       ['0.0', '0.9718656539916992'],
       ['2.0', '0.5427298545837402'],
       ['1.0', '0.3770846724510193']]
  11. Clean up.
    kmeans_predictor.cloudObj.delete_model()
    kmeans_predictor.cloudObj.delete_endpoint()
    remove_tdapi_context(tdapi_context)