This use case shows the steps to use SageMaker Linear Learner with tdapiclient.
You can download the aws-usecases.zip file in the attachment as a reference. The linearlearner folder in the zip file includes a Jupyter notebook file (ipynb) and a data file (csv) containing the dataset required to run this use case.
- Import necessary libraries.
import getpass import sagemaker from tdapiclient import create_tdapi_context, TDApiClient from teradataml import create_context, DataFrame, copy_to_sql,load_example_data import pandas as pd import numpy as np from sklearn.preprocessing import LabelEncoder from teradatasqlalchemy.types import *
- Create the connection.
host = input("Host: ") username = input("Username: ") password = getpass.getpass("Password: ")
td_context = create_context(host=host, username=username, password=password)
- Create TDAPI context and TDApiClient object.
s3_bucket = input("S3 Bucket(Please provide just the bucket name): ") access_id = input("Access ID:") access_key = getpass.getpass("Acess Key: ") region = input("AWS Region: ")
os.environ["AWS_ACCESS_KEY_ID"] = access_id os.environ["AWS_SECRET_ACCESS_KEY"] = access_key os.environ["AWS_REGION"] = region
tdapi_context = create_tdapi_context("aws", bucket_name=s3_bucket)
td_apiclient = TDApiClient(tdapi_context)
- Set bucket locations.
# Bucket location where your custom code will be saved in the tar.gz format. custom_code_upload_location = "s3://{}/LinearLearner/code".format(s3_bucket)
# Bucket location where results of model training are saved. model_artifacts_location = "s3://{}/LinearLearner/artifacts".format(s3_bucket)
- Set up data to be used for this workflow.
- Read the wine quality dataset.
df = pd.read_csv ("winequality-red.csv")
df.head()
The output:fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality 0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5 1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5 2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5 3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6 4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
- Rename columns for creating teradataml DataFrame.
df.rename(columns={'fixed acidity':'fixed_acidity', 'citric acid':'citric_acid' , 'residual sugar':'residual_sugar', 'volatile acidity': 'acidity_volatile', 'free sulfur dioxide': 'free_sulfur_dioxide', 'total sulfur dioxide':'total_sulfur_dioxide'}, inplace=True)
- Insert the dataframe in the table.
data_table="wine_quality_data"
column_types ={'fixed_acidity': FLOAT, 'acidity_volatile':FLOAT, 'citric_acid':FLOAT, 'residual_sugar':FLOAT, 'chlorides':FLOAT, 'free_sulfur_dioxide':FLOAT, 'total_sulfur_dioxide':FLOAT, 'density':FLOAT, 'pH':FLOAT, 'sulphates':FLOAT, 'alcohol':FLOAT, 'quality':INTEGER}
copy_to_sql(df=df, table_name=data_table, if_exists="replace", types=column_types)
- Create a teradataml DataFrame using the table.
data = DataFrame(table_name=data_table)
- Read the wine quality dataset.
- Data cleaning.
- Categorize target variable that is 'quality' into 'good' or 'bad'. If the quality value is greater than 6.5, then the quality is considered 'good' (1); if quality value is less or equal to 6.5, it is represented as 'bad' (0).
from teradataml.dataframe.sql_functions import case
quality = data['quality']
data_updated = data.assign(grade = case([(quality > 6.5, 1), (quality<6.5, 0)]))
- Drop unnecessary columns.
data_updated=data_updated.drop("quality",axis=1)
data
The output:fixed_acidity acidity_volatile citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol grade 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.997 3.26 0.65 9.8 0 7.4 0.7 0.0 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 0 7.4 0.66 0.0 1.8 0.075 13.0 40.0 0.9978 3.51 0.56 9.4 0 7.9 0.6 0.06 1.6 0.069 15.0 59.0 0.9964 3.3 0.46 9.4 0 7.8 0.58 0.02 2.0 0.073 9.0 18.0 0.9968 3.36 0.57 9.5 1 7.5 0.5 0.36 6.1 0.071 17.0 102.0 0.9978 3.35 0.8 10.5 0 7.3 0.65 0.0 1.2 0.065 15.0 21.0 0.9946 3.39 0.47 10.0 1 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.998 3.16 0.58 9.8 0 7.8 0.88 0.0 2.6 0.098 25.0 67.0 0.9968 3.2 0.68 9.8 0 7.4 0.7 0.0 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 0
- Create two samples of input data: sample 1 has 80% of total rows and sample 2 has 20% of total rows.
wine_sample = data_updated.sample(frac=[0.8, 0.2])
- Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.
wine_train = wine_sample[wine_sample.sampleid == "1"].drop("sampleid", axis = 1)
wine_train
The output:fixed_acidity acidity_volatile citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol grade 7.4 0.7 0.0 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 0 7.9 0.6 0.06 1.6 0.069 15.0 59.0 0.9964 3.3 0.46 9.4 0 7.3 0.65 0.0 1.2 0.065 15.0 21.0 0.9946 3.39 0.47 10.0 1 7.8 0.58 0.02 2.0 0.073 9.0 18.0 0.9968 3.36 0.57 9.5 1 5.6 0.615 0.0 1.6 0.089 16.0 59.0 0.9943 3.58 0.52 9.9 0 7.8 0.61 0.29 1.6 0.114 9.0 29.0 0.9974 3.26 1.56 9.1 0 6.7 0.58 0.08 1.8 0.0969999999999999 15.0 65.0 0.9959 3.28 0.54 9.2 0 7.4 0.66 0.0 1.8 0.075 13.0 40.0 0.9978 3.51 0.56 9.4 0 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.998 3.16 0.58 9.8 0 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.997 3.26 0.65 9.8 0
- Create test dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is not required for scoring.
wine_test = wine_sample[wine_sample.sampleid == "2"].drop("sampleid", axis = 1)
wine_test
The output:fixed_acidity acidity_volatile citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol grade 7.4 0.66 0.0 1.8 0.075 13.0 40.0 0.9978 3.51 0.56 9.4 0 7.5 0.5 0.36 6.1 0.071 17.0 102.0 0.9978 3.35 0.8 10.5 0 7.5 0.5 0.36 6.1 0.071 17.0 102.0 0.9978 3.35 0.8 10.5 0 8.9 0.62 0.19 3.9 0.17 51.0 148.0 0.9986 3.17 0.93 9.2 0 6.3 0.39 0.16 1.4 0.08 11.0 23.0 0.9955 3.34 0.56 9.3 0 7.8 0.645 0.0 2.0 0.0819999999999999 8.0 16.0 0.9964 3.38 0.59 9.8 0 8.1 0.56 0.28 1.7 0.368 16.0 56.0 0.9968 3.11 1.28 9.3 0 7.9 0.6 0.06 1.6 0.069 15.0 59.0 0.9964 3.3 0.46 9.4 0 7.4 0.7 0.0 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 0 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.998 3.16 0.58 9.8 0
- Convert teradataml DataFrame to NumPy ndarray.
train_data=wine_train.drop('grade',axis=1).get_values() label_train=wine_train.get_values() label_train=label_train[:,11]
label_train=label_train.astype('float32') train_data=train_data.astype('float32')
- Categorize target variable that is 'quality' into 'good' or 'bad'.
- Create Linear Learner SageMaker estimator instance through tdapiclient.
exec_role_arn = "arn:aws:iam::076782961461:role/service-role/AmazonSageMaker-ExecutionRole-20210112T215668"
linear_learner = td_apiclient.LinearLearner( role=exec_role_arn, instance_count=1, instance_type="ml.m5.large", predictor_type='binary_classifier', output_path=model_artifacts_location, code_location=custom_code_upload_location, epochs=5 )
- Convert NumPy ndarray to RecordSet object which will be passed to fit method.
training_data_recordset = linear_learner.record_set(train=train_data, labels=label_train)
- Start training using RecordSet objects.
linear_learner.fit(training_data_recordset)
- Create Serializer and Deserializer, so predictor can handle CSV input and output.
from sagemaker.serializers import CSVSerializer from sagemaker.deserializers import CSVDeserializer csv_ser = CSVSerializer() csv_dser = CSVDeserializer()
predictor = linear_learner.deploy("aws-endpoint", sagemaker_kw_args={"instance_type": "ml.m5.large", "initial_instance_count": 1, "serializer": csv_ser, "deserializer": csv_dser})
- Try prediction integration using the predictor object created in previous step.
- Confirm that predictor is correctly configured for accepting csv input.
print(predictor.cloudObj.accept)
The output:('text/csv',)
- Prepare test dataset by dropping target variable 'grade'.
wine_test=wine_test.drop("grade",axis=1)
- Try prediction with UDF and Client options.Prediction with UDF option:
output = predictor.predict(wine_test, mode="UDF",content_type='csv')
output
The output:fixed_acidity acidity_volatile citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol Output 7.8 0.61 0.29 1.6 0.114 9.0 29.0 0.9974 3.26 1.56 9.1 {"predictions": [{"score": 0.06059710681438446, "predicted_label": 0}]} 7.4 0.59 0.08 4.4 0.086 6.0 29.0 0.9974 3.38 0.5 9.0 {"predictions": [{"score": 0.061050355434417725, "predicted_label": 0}]} 7.9 0.43 0.21 1.6 0.106 10.0 37.0 0.9966 3.17 0.91 9.5 {"predictions": [{"score": 0.07426965236663818, "predicted_label": 0}]} 7.8 0.645 0.0 2.0 0.0819999999999999 8.0 16.0 0.9964 3.38 0.59 9.8 {"predictions": [{"score": 0.07694132626056671, "predicted_label": 0}]} 7.8 0.59 0.18 2.3 0.076 17.0 54.0 0.9975 3.43 0.59 10.0 {"predictions": [{"score": 0.056873418390750885, "predicted_label": 0}]} 8.8 0.4 0.4 2.2 0.079 19.0 52.0 0.998 3.44 0.64 9.2 {"predictions": [{"score": 0.05504496023058891, "predicted_label": 0}]} 7.8 0.645 0.0 5.5 0.086 5.0 18.0 0.9986 3.4 0.55 9.6 {"predictions": [{"score": 0.062305182218551636, "predicted_label": 0}]} 8.9 0.62 0.19 3.9 0.17 51.0 148.0 0.9986 3.17 0.93 9.2 {"predictions": [{"score": 0.021687734872102737, "predicted_label": 0}]} 6.7 0.58 0.08 1.8 0.0969999999999999 15.0 65.0 0.9959 3.28 0.54 9.2 {"predictions": [{"score": 0.05492998659610748, "predicted_label": 0}]} 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.998 3.16 0.58 9.8 {"predictions": [{"score": 0.07007729262113571, "predicted_label": 0}]}
Prediction with Client option:item=wine_test.tail(1) output = predictor.predict(item, mode="client",content_type='csv')
output
The output:[['0']]
- Confirm that predictor is correctly configured for accepting csv input.
- Clean up.
predictor.cloudObj.delete_model() predictor.cloudObj.delete_endpoint() remove_tdapi_context(tdapi_context)