Using SageMaker Linear Learner with tdapiclient | API Integration - Using SageMaker Linear Learner with tdapiclient - Teradata Vantage

Teradata Vantageā„¢ - API Integration Guide for Cloud Machine Learning

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Vantage
Release Number
1.4
Published
September 2023
Language
English (United States)
Last Update
2023-09-28
dita:mapPath
mgu1643999543506.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
mgu1643999543506

This use case shows the steps to use SageMaker Linear Learner with tdapiclient.

You can download the aws-usecases.zip file in the attachment as a reference. The linearlearner folder in the zip file includes a Jupyter notebook file (ipynb) and a data file (csv) containing the dataset required to run this use case.

  1. Import necessary libraries.
    import getpass
    import sagemaker
    from tdapiclient import create_tdapi_context, TDApiClient
    from teradataml import create_context, DataFrame, copy_to_sql,load_example_data
    import pandas as pd
    import numpy as np
    from sklearn.preprocessing import LabelEncoder
    from teradatasqlalchemy.types import *
  2. Create the connection.
    host = input("Host: ")
    username = input("Username: ")
    password = getpass.getpass("Password: ")
    td_context = create_context(host=host, username=username, password=password)
  3. Create TDAPI context and TDApiClient object.
    s3_bucket = input("S3 Bucket(Please provide just the bucket name): ")
    access_id = input("Access ID:")
    access_key = getpass.getpass("Acess Key: ")
    region = input("AWS Region: ")
    os.environ["AWS_ACCESS_KEY_ID"] = access_id
    os.environ["AWS_SECRET_ACCESS_KEY"] = access_key
    os.environ["AWS_REGION"] = region
    tdapi_context = create_tdapi_context("aws", bucket_name=s3_bucket)
    td_apiclient = TDApiClient(tdapi_context)
  4. Set bucket locations.
    # Bucket location where your custom code will be saved in the tar.gz format.
    custom_code_upload_location = "s3://{}/LinearLearner/code".format(s3_bucket)
    # Bucket location where results of model training are saved.
    model_artifacts_location = "s3://{}/LinearLearner/artifacts".format(s3_bucket)
  5. Set up data to be used for this workflow.
    1. Read the wine quality dataset.
      df = pd.read_csv ("winequality-red.csv")
      df.head()
      The output:
      	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
      0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
      1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
      2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
      3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
      4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
    2. Rename columns for creating teradataml DataFrame.
      df.rename(columns={'fixed acidity':'fixed_acidity',
                               'citric acid':'citric_acid' ,
                               'residual sugar':'residual_sugar',
                               'volatile acidity': 'acidity_volatile',
                               'free sulfur dioxide': 'free_sulfur_dioxide',
                               'total sulfur dioxide':'total_sulfur_dioxide'}, 
                      inplace=True)
    3. Insert the dataframe in the table.
      data_table="wine_quality_data"
      column_types ={'fixed_acidity': FLOAT, 
                     'acidity_volatile':FLOAT, 
                     'citric_acid':FLOAT,
                     'residual_sugar':FLOAT, 
                     'chlorides':FLOAT, 
                     'free_sulfur_dioxide':FLOAT,
                     'total_sulfur_dioxide':FLOAT, 
                     'density':FLOAT, 
                     'pH':FLOAT, 
                     'sulphates':FLOAT, 
                     'alcohol':FLOAT,
                     'quality':INTEGER}
      copy_to_sql(df=df, table_name=data_table, if_exists="replace", types=column_types)
    4. Create a teradataml DataFrame using the table.
      data = DataFrame(table_name=data_table)
  6. Data cleaning.
    1. Categorize target variable that is 'quality' into 'good' or 'bad'.
      If the quality value is greater than 6.5, then the quality is considered 'good' (1); if quality value is less or equal to 6.5, it is represented as 'bad' (0).
      from teradataml.dataframe.sql_functions import case
      quality = data['quality']
      data_updated = data.assign(grade = case([(quality > 6.5, 1),
                                               (quality<6.5, 0)]))
    2. Drop unnecessary columns.
      data_updated=data_updated.drop("quality",axis=1)
      data
      The output:
      fixed_acidity	acidity_volatile	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	density	pH	sulphates	alcohol	grade
      7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.997	3.26	0.65	9.8	0
      7.4	0.7	0.0	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	0
      7.4	0.66	0.0	1.8	0.075	13.0	40.0	0.9978	3.51	0.56	9.4	0
      7.9	0.6	0.06	1.6	0.069	15.0	59.0	0.9964	3.3	0.46	9.4	0
      7.8	0.58	0.02	2.0	0.073	9.0	18.0	0.9968	3.36	0.57	9.5	1
      7.5	0.5	0.36	6.1	0.071	17.0	102.0	0.9978	3.35	0.8	10.5	0
      7.3	0.65	0.0	1.2	0.065	15.0	21.0	0.9946	3.39	0.47	10.0	1
      11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.998	3.16	0.58	9.8	0
      7.8	0.88	0.0	2.6	0.098	25.0	67.0	0.9968	3.2	0.68	9.8	0
      7.4	0.7	0.0	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	0
      
    3. Create two samples of input data: sample 1 has 80% of total rows and sample 2 has 20% of total rows.
      wine_sample = data_updated.sample(frac=[0.8, 0.2])
    4. Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.
      wine_train = wine_sample[wine_sample.sampleid == "1"].drop("sampleid", axis = 1)
      wine_train
      The output:
      fixed_acidity	acidity_volatile	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	density	pH	sulphates	alcohol	grade
      7.4	0.7	0.0	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	0
      7.9	0.6	0.06	1.6	0.069	15.0	59.0	0.9964	3.3	0.46	9.4	0
      7.3	0.65	0.0	1.2	0.065	15.0	21.0	0.9946	3.39	0.47	10.0	1
      7.8	0.58	0.02	2.0	0.073	9.0	18.0	0.9968	3.36	0.57	9.5	1
      5.6	0.615	0.0	1.6	0.089	16.0	59.0	0.9943	3.58	0.52	9.9	0
      7.8	0.61	0.29	1.6	0.114	9.0	29.0	0.9974	3.26	1.56	9.1	0
      6.7	0.58	0.08	1.8	0.0969999999999999	15.0	65.0	0.9959	3.28	0.54	9.2	0
      7.4	0.66	0.0	1.8	0.075	13.0	40.0	0.9978	3.51	0.56	9.4	0
      11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.998	3.16	0.58	9.8	0
      7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.997	3.26	0.65	9.8	0
      
    5. Create test dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is not required for scoring.
      wine_test = wine_sample[wine_sample.sampleid == "2"].drop("sampleid", axis = 1)
      wine_test
      The output:
      fixed_acidity	acidity_volatile	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	density	pH	sulphates	alcohol	grade
      7.4	0.66	0.0	1.8	0.075	13.0	40.0	0.9978	3.51	0.56	9.4	0
      7.5	0.5	0.36	6.1	0.071	17.0	102.0	0.9978	3.35	0.8	10.5	0
      7.5	0.5	0.36	6.1	0.071	17.0	102.0	0.9978	3.35	0.8	10.5	0
      8.9	0.62	0.19	3.9	0.17	51.0	148.0	0.9986	3.17	0.93	9.2	0
      6.3	0.39	0.16	1.4	0.08	11.0	23.0	0.9955	3.34	0.56	9.3	0
      7.8	0.645	0.0	2.0	0.0819999999999999	8.0	16.0	0.9964	3.38	0.59	9.8	0
      8.1	0.56	0.28	1.7	0.368	16.0	56.0	0.9968	3.11	1.28	9.3	0
      7.9	0.6	0.06	1.6	0.069	15.0	59.0	0.9964	3.3	0.46	9.4	0
      7.4	0.7	0.0	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	0
      11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.998	3.16	0.58	9.8	0
    6. Convert teradataml DataFrame to NumPy ndarray.
      train_data=wine_train.drop('grade',axis=1).get_values()
      label_train=wine_train.get_values()
      label_train=label_train[:,11]
      label_train=label_train.astype('float32') 
      train_data=train_data.astype('float32')
  7. Create Linear Learner SageMaker estimator instance through tdapiclient.
    exec_role_arn = "arn:aws:iam::076782961461:role/service-role/AmazonSageMaker-ExecutionRole-20210112T215668"
    linear_learner = td_apiclient.LinearLearner(
              role=exec_role_arn,
              instance_count=1,
              instance_type="ml.m5.large",
              predictor_type='binary_classifier',
              output_path=model_artifacts_location,
              code_location=custom_code_upload_location,
              epochs=5
    )
  8. Convert NumPy ndarray to RecordSet object which will be passed to fit method.
    training_data_recordset = linear_learner.record_set(train=train_data,
                                                        labels=label_train)
  9. Start training using RecordSet objects.
    linear_learner.fit(training_data_recordset)
  10. Create Serializer and Deserializer, so predictor can handle CSV input and output.
    from sagemaker.serializers import CSVSerializer
    from sagemaker.deserializers import CSVDeserializer
    csv_ser = CSVSerializer()
    csv_dser = CSVDeserializer()
    predictor = linear_learner.deploy("aws-endpoint",
                                      sagemaker_kw_args={"instance_type": "ml.m5.large", "initial_instance_count": 1, "serializer": csv_ser, "deserializer": csv_dser})
  11. Try prediction integration using the predictor object created in previous step.
    1. Confirm that predictor is correctly configured for accepting csv input.
      print(predictor.cloudObj.accept)
      The output:
      ('text/csv',)
    2. Prepare test dataset by dropping target variable 'grade'.
      wine_test=wine_test.drop("grade",axis=1)
    3. Try prediction with UDF and Client options.
      Prediction with UDF option:
      output = predictor.predict(wine_test, mode="UDF",content_type='csv')
      output
      The output:
      fixed_acidity	acidity_volatile	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	density	pH	sulphates	alcohol	Output
      7.8	0.61	0.29	1.6	0.114	9.0	29.0	0.9974	3.26	1.56	9.1	{"predictions": [{"score": 0.06059710681438446, "predicted_label": 0}]}
      7.4	0.59	0.08	4.4	0.086	6.0	29.0	0.9974	3.38	0.5	9.0	{"predictions": [{"score": 0.061050355434417725, "predicted_label": 0}]}
      7.9	0.43	0.21	1.6	0.106	10.0	37.0	0.9966	3.17	0.91	9.5	{"predictions": [{"score": 0.07426965236663818, "predicted_label": 0}]}
      7.8	0.645	0.0	2.0	0.0819999999999999	8.0	16.0	0.9964	3.38	0.59	9.8	{"predictions": [{"score": 0.07694132626056671, "predicted_label": 0}]}
      7.8	0.59	0.18	2.3	0.076	17.0	54.0	0.9975	3.43	0.59	10.0	{"predictions": [{"score": 0.056873418390750885, "predicted_label": 0}]}
      8.8	0.4	0.4	2.2	0.079	19.0	52.0	0.998	3.44	0.64	9.2	{"predictions": [{"score": 0.05504496023058891, "predicted_label": 0}]}
      7.8	0.645	0.0	5.5	0.086	5.0	18.0	0.9986	3.4	0.55	9.6	{"predictions": [{"score": 0.062305182218551636, "predicted_label": 0}]}
      8.9	0.62	0.19	3.9	0.17	51.0	148.0	0.9986	3.17	0.93	9.2	{"predictions": [{"score": 0.021687734872102737, "predicted_label": 0}]}
      6.7	0.58	0.08	1.8	0.0969999999999999	15.0	65.0	0.9959	3.28	0.54	9.2	{"predictions": [{"score": 0.05492998659610748, "predicted_label": 0}]}
      11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.998	3.16	0.58	9.8	{"predictions": [{"score": 0.07007729262113571, "predicted_label": 0}]}
      Prediction with Client option:
      item=wine_test.tail(1)
      output = predictor.predict(item, mode="client",content_type='csv')
      output
      The output:
      [['0']]
  12. Clean up.
    predictor.cloudObj.delete_model()
    predictor.cloudObj.delete_endpoint()
    remove_tdapi_context(tdapi_context)