This use case shows the steps to use SageMaker XGBoost Estimator with tdapiclient.
You can download the aws-usecases.zip file in the attachment as a reference. The xgboost folder in the zip file includes a Jupyter notebook file (ipynb), and Python file (py) and data file (csv) required to run this notebook file.
- Import necessary libraries.
import getpass import sagemaker from tdapiclient import create_tdapi_context, remove_tdapi_context,TDApiClient from teradataml import create_context, DataFrame, copy_to_sql,load_example_data, configure, LabelEncoder, valib,Retain import pandas as pd import numpy as np from teradatasqlalchemy.types import *
- Create the connection.
host = input("Host: ") username = input("Username: ") password = getpass.getpass("Password: ")
td_context = create_context(host=host, username=username, password=password)
- Create TDAPI context and TDApiClient object.
s3_bucket = input("S3 Bucket(Please provide just the bucket name, for example: test-bucket): ") access_id = input("Access ID:") access_key = getpass.getpass("Acess Key: ") region = input("AWS Region: ")
os.environ["AWS_ACCESS_KEY_ID"] = access_id os.environ["AWS_SECRET_ACCESS_KEY"] = access_key os.environ["AWS_REGION"] = region
tdapi_context = create_tdapi_context("aws", bucket_name=s3_bucket)
td_apiclient = TDApiClient(tdapi_context)
- Set bucket locations.
# Bucket location where your custom code will be saved in the tar.gz format. custom_code_upload_location = "s3://{}/xgboost/code".format(s3_bucket)
# Bucket location where results of model training are saved. model_artifacts_location = "s3://{}/xgboost/artifacts".format(s3_bucket)
- Set up data.
- Read the breast cancer dataset.
data = pd.read_csv ("cancer_data.csv")
- Drop unnecessary columns.
data=data.drop(['Unnamed: 32'], axis=1)
- Rename columns for creating teradataml DataFrame.
data.rename(columns={'concave points_mean':'concave_points_mean', "concave points_se":"concave_points_se", "concave points_worst":"concave_points_worst"}, inplace=True)
- Insert the dataframe in the tables.
data_table = "cancer_data"
column_types = { "id":INTEGER, "diagnosis": CHAR(1), "radius_mean": FLOAT, "texture_mean": FLOAT, "perimeter_mean": FLOAT, "area_mean": FLOAT, "smoothness_mean": FLOAT , "compactness_mean": FLOAT , "concavity_mean": FLOAT , "concave_points_mean": FLOAT, "symmetry_mean": FLOAT , "fractal_dimension_mean": FLOAT, "radius_se": FLOAT , "texture_se": FLOAT , "perimeter_se": FLOAT , "area_se": FLOAT , "smoothness_se": FLOAT , "compactness_se": FLOAT , "concavity_se": FLOAT , "concave_points_se": FLOAT , "symmetry_se": FLOAT , "fractal_dimension_se": FLOAT , "radius_worst": FLOAT , "texture_worst": FLOAT , "perimeter_worst": FLOAT , "area_worst": FLOAT , "smoothness_worst": FLOAT , "compactness_worst": FLOAT , "concavity_worst": FLOAT , "concave_points_worst": FLOAT , "symmetry_worst": FLOAT , "fractal_dimension_worst": FLOAT }
copy_to_sql(df=data, table_name=data_table, if_exists="replace", types=column_types)
- Create a teradataml DataFrame using the table.
df = DataFrame(table_name=data_table)
df
The output:id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se concave_points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave_points_worst symmetry_worst fractal_dimension_worst 84300903 M 19.69 21.25 130.0 1203.0 0.1096 0.1599 0.1974 0.1279 0.2069 0.05999 0.7456 0.7869 4.585 94.03 0.00615 0.04006 0.03832 0.02058 0.0225 0.004571 23.57 25.53 152.5 1709.0 0.1444 0.4245 0.4504 0.243 0.3613 0.08758 84358402 M 20.29 14.34 135.1 1297.0 0.1003 0.1328 0.198 0.1043 0.1809 0.05883 0.7572 0.7813 5.438 94.44 0.01149 0.02461 0.05688 0.01885 0.01756 0.005115 22.54 16.67 152.2 1575.0 0.1374 0.205 0.4 0.1625 0.2364 0.07678 843786 M 12.45 15.7 82.57 477.1 0.1278 0.17 0.1578 0.08089 0.2087 0.07613 0.3345 0.8902 2.217 27.19 0.00751 0.03345 0.03672 0.01137 0.02165 0.005082 15.47 23.75 103.4 741.6 0.1791 0.5249 0.5355 0.1741 0.3985 0.1244 844359 M 18.25 19.98 119.6 1040.0 0.09463 0.109 0.1127 0.074 0.1794 0.05742 0.4467 0.7732 3.18 53.91 0.004314 0.01382 0.02254 0.01039 0.01369 0.002179 22.88 27.66 153.2 1606.0 0.1442 0.2576 0.3784 0.1932 0.3063 0.08368 844981 M 13.0 21.82 87.5 519.8 0.1273 0.1932 0.1859 0.09353 0.235 0.07389 0.3063 1.002 2.406 24.32 0.005731 0.03502 0.03553 0.01226 0.02143 0.003749 15.49 30.73 106.2 739.3 0.1703 0.5401 0.539 0.206 0.4378 0.1072 84501001 M 12.46 24.04 83.97 475.9 0.1186 0.2396 0.2273 0.08543 0.203 0.08243 0.2976 1.599 2.039 23.94 0.007149 0.07217 0.07743 0.01432 0.01789 0.01008 15.09 40.68 97.65 711.4 0.1853 1.058 1.105 0.221 0.4366 0.2075 84458202 M 13.71 20.83 90.2 577.9 0.1189 0.1645 0.09366 0.05985 0.2196 0.07451 0.5835 1.377 3.856 50.96 0.008805 0.03029 0.02488 0.01448 0.01486 0.005412 17.06 28.14 110.6 897.0 0.1654 0.3682 0.2678 0.1556 0.3196 0.1151 84348301 M 11.42 20.38 77.58 386.1 0.1425 0.2839 0.2414 0.1052 0.2597 0.09744 0.4956 1.156 3.445 27.23 0.00911 0.07458 0.05661 0.01867 0.05963 0.009208 14.91 26.5 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.173 842517 M 20.57 17.77 132.9 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 0.5435 0.7339 3.398 74.08 0.005225 0.01308 0.0186 0.0134 0.01389 0.003532 24.99 23.41 158.8 1956.0 0.1238 0.1866 0.2416 0.186 0.275 0.08902 842302 M 17.99 10.38 122.8 1001.0 0.1184 0.2776 0.3001 0.1471 0.2419 0.07871 1.095 0.9053 8.589 153.4 0.006399 0.04904 0.05373 0.01587 0.03003 0.006193 25.38 17.33 184.6 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.1189
- Read the breast cancer dataset.
- Prepare the dataset.
- Encode the target column using label encoder.
from teradataml import LabelEncoder
rc = LabelEncoder(values=("M", 1), columns=["diagnosis"], default=0)
feature_columns_names= Retain(columns=["radius_mean", "texture_mean", "perimeter_mean", "area_mean", "smoothness_mean" , "compactness_mean" , "concavity_mean" , "concave_points_mean", "symmetry_mean" , "fractal_dimension_mean", "radius_se" , "texture_se" , "perimeter_se" , "area_se" , "smoothness_se" , "compactness_se" , "concavity_se" , "concave_points_se" , "symmetry_se" , "fractal_dimension_se" , "radius_worst" , "texture_worst" , "perimeter_worst" , "area_worst" , "smoothness_worst" , "compactness_worst" , "concavity_worst" , "concave_points_worst" , "symmetry_worst" , "fractal_dimension_worst" ])
configure.val_install_location = "alice" data = valib.Transform(data=df, label_encode=rc,index_columns="id",unique_index=True,retain=feature_columns_names)
df=data.result
- Rearrange columns to make sure that the target column is the first and there is no header in the dataset.
df=df.drop("id",axis=1)
df= df.select(["diagnosis","radius_mean", "texture_mean", "perimeter_mean", "area_mean", "smoothness_mean" , "compactness_mean" , "concavity_mean" , "concave_points_mean", "symmetry_mean" , "fractal_dimension_mean", "radius_se" , "texture_se" , "perimeter_se" , "area_se" , "smoothness_se" , "compactness_se" , "concavity_se" , "concave_points_se" , "symmetry_se" , "fractal_dimension_se" , "radius_worst" , "texture_worst" , "perimeter_worst" , "area_worst" , "smoothness_worst" , "compactness_worst" , "concavity_worst" , "concave_points_worst" , "symmetry_worst" , "fractal_dimension_worst" ])
df
The output:diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se concave_points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave_points_worst symmetry_worst fractal_dimension_worst 1 18.08 21.84 117.4 1024.0 0.07371 0.08642 0.1103 0.05778 0.177 0.0534 0.6362 1.305 4.312 76.36 0.00553 0.05296 0.0611 0.01444 0.0214 0.005036 19.76 24.7 129.1 1228.0 0.08822 0.1963 0.2535 0.09181 0.2369 0.06558 1 18.05 16.15 120.2 1006.0 0.1065 0.2146 0.1684 0.108 0.2152 0.06673 0.9806 0.5505 6.311 134.8 0.00794 0.05839 0.04658 0.0207 0.02591 0.007054 22.39 18.91 150.1 1610.0 0.1478 0.5634 0.3786 0.2102 0.3751 0.1108 1 19.07 24.81 128.3 1104.0 0.09081 0.219 0.2107 0.09961 0.231 0.06343 0.9811 1.666 8.83 104.9 0.006548 0.1006 0.09723 0.02638 0.05333 0.007646 24.09 33.17 177.4 1651.0 0.1247 0.7444 0.7242 0.2493 0.467 0.1038 0 12.21 14.09 78.78 462.0 0.08108 0.07823 0.06839 0.02534 0.1646 0.06154 0.2666 0.8309 2.097 19.96 0.004405 0.03026 0.04344 0.01087 0.01921 0.004622 13.13 19.29 87.65 529.9 0.1026 0.2431 0.3076 0.0914 0.2677 0.08824 1 17.01 20.26 109.7 904.3 0.08772 0.07304 0.0695 0.0539 0.2026 0.05223 0.5858 0.8554 4.106 68.46 0.005038 0.01503 0.01946 0.01123 0.02294 0.002581 19.8 25.05 130.0 1210.0 0.1111 0.1486 0.1932 0.1096 0.3275 0.06469 0 11.26 19.96 73.72 394.1 0.0802 0.1181 0.09274 0.05588 0.2595 0.06233 0.4866 1.905 2.877 34.68 0.01574 0.08262 0.08099 0.03487 0.03418 0.006517 11.86 22.33 78.27 437.6 0.1028 0.1843 0.1546 0.09314 0.2955 0.07009 0 11.93 10.91 76.14 442.7 0.08872 0.05242 0.02606 0.01796 0.1601 0.05541 0.2522 1.045 1.649 18.95 0.006175 0.01204 0.01376 0.005832 0.01096 0.001857 13.8 20.14 87.64 589.5 0.1374 0.1575 0.1514 0.06876 0.246 0.07262 0 9.042 18.9 60.07 244.5 0.09968 0.1972 0.1975 0.04908 0.233 0.08743 0.4653 1.911 3.769 24.2 0.009845 0.0659 0.1027 0.02527 0.03491 0.007877 10.06 23.4 68.62 297.1 0.1221 0.3748 0.4609 0.1145 0.3135 0.1055 0 12.47 18.6 81.09 481.9 0.09965 0.1058 0.08005 0.03821 0.1925 0.06373 0.3961 1.044 2.497 30.29 0.006953 0.01911 0.02701 0.01037 0.01782 0.003586 14.97 24.64 96.05 677.9 0.1426 0.2378 0.2671 0.1015 0.3014 0.0875 0 14.95 18.77 97.84 689.5 0.08138 0.1167 0.0905 0.03562 0.1744 0.06493 0.422 1.909 3.271 39.43 0.00579 0.04877 0.05303 0.01527 0.03356 0.009368 16.25 25.47 107.1 809.7 0.0997 0.2521 0.25 0.08405 0.2852 0.09218
- Create three samples of input data: sample 1 has 60% of total rows, sample 2 and 3 each has 20% of total rows.
cancer_sample = df.sample(frac=[0.6, 0.2,0.2])
cancer_sample
The output:diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se concave_points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave_points_worst symmetry_worst fractal_dimension_worst sampleid 1 18.08 21.84 117.4 1024.0 0.07371 0.08642 0.1103 0.05778 0.177 0.0534 0.6362 1.305 4.312 76.36 0.00553 0.05296 0.0611 0.01444 0.0214 0.005036 19.76 24.7 129.1 1228.0 0.08822 0.1963 0.2535 0.09181 0.2369 0.06558 1 1 18.05 16.15 120.2 1006.0 0.1065 0.2146 0.1684 0.108 0.2152 0.06673 0.9806 0.5505 6.311 134.8 0.00794 0.05839 0.04658 0.0207 0.02591 0.007054 22.39 18.91 150.1 1610.0 0.1478 0.5634 0.3786 0.2102 0.3751 0.1108 3 1 19.07 24.81 128.3 1104.0 0.09081 0.219 0.2107 0.09961 0.231 0.06343 0.9811 1.666 8.83 104.9 0.006548 0.1006 0.09723 0.02638 0.05333 0.007646 24.09 33.17 177.4 1651.0 0.1247 0.7444 0.7242 0.2493 0.467 0.1038 3 0 16.17 16.07 106.3 788.5 0.0988 0.1438 0.06651 0.05397 0.199 0.06572 0.1745 0.489 1.349 14.91 0.00451 0.01812 0.01951 0.01196 0.01934 0.003696 16.97 19.14 113.1 861.5 0.1235 0.255 0.2114 0.1251 0.3153 0.0896 3 1 16.26 21.88 107.5 826.8 0.1165 0.1283 0.1799 0.07981 0.1869 0.06532 0.5706 1.457 2.961 57.72 0.01056 0.03756 0.05839 0.01186 0.04022 0.006187 17.73 25.21 113.7 975.2 0.1426 0.2116 0.3344 0.1047 0.2736 0.07953 2 1 15.3 25.27 102.4 732.4 0.1082 0.1697 0.1683 0.08751 0.1926 0.0654 0.439 1.012 3.498 43.5 0.005233 0.03057 0.03576 0.01083 0.01768 0.002967 20.27 36.71 149.3 1269.0 0.1641 0.611 0.6335 0.2024 0.4027 0.09876 1 0 14.5 10.89 94.28 640.7 0.1101 0.1099 0.08842 0.05778 0.1856 0.06402 0.2929 0.857 1.928 24.19 0.003818 0.01276 0.02882 0.012 0.0191 0.002808 15.7 15.98 102.8 745.5 0.1313 0.1788 0.256 0.1221 0.2889 0.08006 1 0 15.04 16.74 98.73 689.4 0.09883 0.1364 0.07721 0.06142 0.1668 0.06869 0.372 0.8423 2.304 34.84 0.004123 0.01819 0.01996 0.01004 0.01055 0.003237 16.76 20.43 109.7 856.9 0.1135 0.2176 0.1856 0.1018 0.2177 0.08549 1 1 16.11 18.05 105.1 813.0 0.09721 0.1137 0.09447 0.05943 0.1861 0.06248 0.7049 1.332 4.533 74.08 0.00677 0.01938 0.03067 0.01167 0.01875 0.003434 19.92 25.27 129.0 1233.0 0.1314 0.2236 0.2802 0.1216 0.2792 0.08158 1 0 9.042 18.9 60.07 244.5 0.09968 0.1972 0.1975 0.04908 0.233 0.08743 0.4653 1.911 3.769 24.2 0.009845 0.0659 0.1027 0.02527 0.03491 0.007877 10.06 23.4 68.62 297.1 0.1221 0.3748 0.4609 0.1145 0.3135 0.1055 3
- Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.
train = cancer_sample[cancer_sample.sampleid == "1"].drop("sampleid", axis = 1)
train
The output:diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se concave_points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave_points_worst symmetry_worst fractal_dimension_worst 1 18.08 21.84 117.4 1024.0 0.07371 0.08642 0.1103 0.05778 0.177 0.0534 0.6362 1.305 4.312 76.36 0.00553 0.05296 0.0611 0.01444 0.0214 0.005036 19.76 24.7 129.1 1228.0 0.08822 0.1963 0.2535 0.09181 0.2369 0.06558 0 11.93 10.91 76.14 442.7 0.08872 0.05242 0.02606 0.01796 0.1601 0.05541 0.2522 1.045 1.649 18.95 0.006175 0.01204 0.01376 0.005832 0.01096 0.001857 13.8 20.14 87.64 589.5 0.1374 0.1575 0.1514 0.06876 0.246 0.07262 0 11.26 19.96 73.72 394.1 0.0802 0.1181 0.09274 0.05588 0.2595 0.06233 0.4866 1.905 2.877 34.68 0.01574 0.08262 0.08099 0.03487 0.03418 0.006517 11.86 22.33 78.27 437.6 0.1028 0.1843 0.1546 0.09314 0.2955 0.07009 1 16.11 18.05 105.1 813.0 0.09721 0.1137 0.09447 0.05943 0.1861 0.06248 0.7049 1.332 4.533 74.08 0.00677 0.01938 0.03067 0.01167 0.01875 0.003434 19.92 25.27 129.0 1233.0 0.1314 0.2236 0.2802 0.1216 0.2792 0.08158 0 14.5 10.89 94.28 640.7 0.1101 0.1099 0.08842 0.05778 0.1856 0.06402 0.2929 0.857 1.928 24.19 0.003818 0.01276 0.02882 0.012 0.0191 0.002808 15.7 15.98 102.8 745.5 0.1313 0.1788 0.256 0.1221 0.2889 0.08006 1 17.06 21.0 111.8 918.6 0.1119 0.1056 0.1508 0.09934 0.1727 0.06071 0.8161 2.129 6.076 87.17 0.006455 0.01797 0.04502 0.01744 0.01829 0.003733 20.99 33.15 143.2 1362.0 0.1449 0.2053 0.392 0.1827 0.2623 0.07599 1 14.99 25.2 95.54 698.8 0.09387 0.05131 0.02398 0.02899 0.1565 0.05504 1.214 2.188 8.077 106.0 0.006883 0.01094 0.01818 0.01917 0.007882 0.001754 14.99 25.2 95.54 698.8 0.09387 0.05131 0.02398 0.02899 0.1565 0.05504 1 16.69 20.2 107.1 857.6 0.07497 0.07112 0.03649 0.02307 0.1846 0.05325 0.2473 0.5679 1.775 22.95 0.002667 0.01446 0.01423 0.005297 0.01961 0.0017 19.18 26.56 127.3 1084.0 0.1009 0.292 0.2477 0.08737 0.4677 0.07623 1 15.3 25.27 102.4 732.4 0.1082 0.1697 0.1683 0.08751 0.1926 0.0654 0.439 1.012 3.498 43.5 0.005233 0.03057 0.03576 0.01083 0.01768 0.002967 20.27 36.71 149.3 1269.0 0.1641 0.611 0.6335 0.2024 0.4027 0.09876 0 9.042 18.9 60.07 244.5 0.09968 0.1972 0.1975 0.04908 0.233 0.08743 0.4653 1.911 3.769 24.2 0.009845 0.0659 0.1027 0.02527 0.03491 0.007877 10.06 23.4 68.62 297.1 0.1221 0.3748 0.4609 0.1145 0.3135 0.1055
- Create validate dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.
validate = cancer_sample[cancer_sample.sampleid == "2"].drop("sampleid", axis = 1)
validate
The output:diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se concave_points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave_points_worst symmetry_worst fractal_dimension_worst 0 12.18 20.52 77.22 458.7 0.08013 0.04038 0.02383 0.0177 0.1739 0.05677 0.1924 1.571 1.183 14.68 0.00508 0.006098 0.01069 0.006797 0.01447 0.001532 13.34 32.84 84.58 547.8 0.1123 0.08862 0.1145 0.07431 0.2694 0.06878 0 13.28 13.72 85.79 541.8 0.08363 0.08575 0.05077 0.02864 0.1617 0.05594 0.1833 0.5308 1.592 15.26 0.004271 0.02073 0.02828 0.008468 0.01461 0.002613 14.24 17.37 96.59 623.7 0.1166 0.2685 0.2866 0.09173 0.2736 0.0732 0 12.54 18.07 79.42 491.9 0.07436 0.0265 0.001194 0.005449 0.1528 0.05185 0.3511 0.9527 2.329 28.3 0.005783 0.004693 0.0007929 0.003617 0.02043 0.001058 13.72 20.98 86.82 585.7 0.09293 0.04327 0.003581 0.01635 0.2233 0.05521 1 20.09 23.86 134.7 1247.0 0.108 0.1838 0.2283 0.128 0.2249 0.07469 1.072 1.743 7.804 130.8 0.007964 0.04732 0.07649 0.01936 0.02736 0.005928 23.68 29.43 158.8 1696.0 0.1347 0.3391 0.4932 0.1923 0.3294 0.09469 0 13.59 17.84 86.24 572.3 0.07948 0.04052 0.01997 0.01238 0.1573 0.0552 0.258 1.166 1.683 22.22 0.003741 0.005274 0.01065 0.005044 0.01344 0.001126 15.5 26.1 98.91 739.1 0.105 0.07622 0.106 0.05185 0.2335 0.06263 0 9.397 21.68 59.75 268.8 0.07969 0.06053 0.03735 0.005128 0.1274 0.06724 0.1186 1.182 1.174 6.802 0.005515 0.02674 0.03735 0.005128 0.01951 0.004583 9.965 27.99 66.61 301.0 0.1086 0.1887 0.1868 0.02564 0.2376 0.09206 0 13.64 16.34 87.21 571.8 0.07685 0.06059 0.01857 0.01723 0.1353 0.05953 0.1872 0.9234 1.449 14.55 0.004477 0.01177 0.01079 0.007956 0.01325 0.002551 14.67 23.19 96.08 656.7 0.1089 0.1582 0.105 0.08586 0.2346 0.08025 0 13.34 15.86 86.49 520.0 0.1078 0.1535 0.1169 0.06987 0.1942 0.06902 0.286 1.016 1.535 12.96 0.006794 0.03575 0.0398 0.01383 0.02134 0.004603 15.53 23.19 96.66 614.9 0.1536 0.4791 0.4858 0.1708 0.3527 0.1016 0 9.847 15.68 63.0 293.2 0.09492 0.08419 0.0233 0.02416 0.1387 0.06891 0.2498 1.216 1.976 15.24 0.008732 0.02042 0.01062 0.006801 0.01824 0.003494 11.24 22.99 74.32 376.5 0.1419 0.2243 0.08434 0.06528 0.2502 0.09209 1 16.35 23.29 109.0 840.4 0.09742 0.1497 0.1811 0.08773 0.2175 0.06218 0.4312 1.022 2.972 45.5 0.005635 0.03917 0.06072 0.01656 0.03197 0.004085 19.38 31.03 129.3 1165.0 0.1415 0.4665 0.7087 0.2248 0.4824 0.09614
- Create test dataset from sample 3 by filtering on "sampleid" and drop "sampleid" column as it is not required for scoring.
test = cancer_sample[cancer_sample.sampleid == "3"].drop("sampleid", axis = 1)
test
The output:diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se concave_points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave_points_worst symmetry_worst fractal_dimension_worst 0 10.26 12.22 65.75 321.6 0.09996 0.07542 0.01923 0.01968 0.18 0.06569 0.1911 0.5477 1.348 11.88 0.005682 0.01365 0.008496 0.006929 0.01938 0.002371 11.38 15.65 73.23 394.5 0.1343 0.165 0.08615 0.06696 0.2937 0.07722 0 11.33 14.16 71.79 396.6 0.09379 0.03872 0.001487 0.003333 0.1954 0.05821 0.2375 1.28 1.565 17.09 0.008426 0.008998 0.001487 0.003333 0.02358 0.001627 12.2 18.99 77.37 458.0 0.1259 0.07348 0.004955 0.01111 0.2758 0.06386 0 11.37 18.89 72.17 396.0 0.08713 0.05008 0.02399 0.02173 0.2013 0.05955 0.2656 1.974 1.954 17.49 0.006538 0.01395 0.01376 0.009924 0.03416 0.002928 12.36 26.14 79.29 459.3 0.1118 0.09708 0.07529 0.06203 0.3267 0.06994 0 12.47 18.6 81.09 481.9 0.09965 0.1058 0.08005 0.03821 0.1925 0.06373 0.3961 1.044 2.497 30.29 0.006953 0.01911 0.02701 0.01037 0.01782 0.003586 14.97 24.64 96.05 677.9 0.1426 0.2378 0.2671 0.1015 0.3014 0.0875 1 21.75 20.99 147.3 1491.0 0.09401 0.1961 0.2195 0.1088 0.1721 0.06194 1.167 1.352 8.867 156.8 0.005687 0.0496 0.06329 0.01561 0.01924 0.004614 28.19 28.18 195.9 2384.0 0.1272 0.4725 0.5807 0.1841 0.2833 0.08858 0 8.618 11.79 54.34 224.5 0.09752 0.05272 0.02061 0.007799 0.1683 0.07187 0.1559 0.5796 1.046 8.322 0.01011 0.01055 0.01981 0.005742 0.0209 0.002788 9.507 15.4 59.9 274.9 0.1733 0.1239 0.1168 0.04419 0.322 0.09026 0 13.28 13.72 85.79 541.8 0.08363 0.08575 0.05077 0.02864 0.1617 0.05594 0.1833 0.5308 1.592 15.26 0.004271 0.02073 0.02828 0.008468 0.01461 0.002613 14.24 17.37 96.59 623.7 0.1166 0.2685 0.2866 0.09173 0.2736 0.0732 0 12.34 12.27 78.94 468.5 0.09003 0.06307 0.02958 0.02647 0.1689 0.05808 0.1166 0.4957 0.7714 8.955 0.003681 0.009169 0.008732 0.00574 0.01129 0.001366 13.61 19.27 87.22 564.9 0.1292 0.2074 0.1791 0.107 0.311 0.07592 0 11.89 21.17 76.39 433.8 0.09773 0.0812 0.02555 0.02179 0.2019 0.0629 0.2747 1.203 1.93 19.53 0.009895 0.03053 0.0163 0.009276 0.02258 0.002272 13.05 27.21 85.09 522.9 0.1426 0.2187 0.1164 0.08263 0.3075 0.07351 0 10.29 27.61 65.67 321.4 0.0903 0.07658 0.05999 0.02738 0.1593 0.06127 0.2199 2.239 1.437 14.46 0.01205 0.02736 0.04804 0.01721 0.01843 0.004938 10.84 34.91 69.57 357.6 0.1384 0.171 0.2 0.09127 0.2226 0.08283
- Encode the target column using label encoder.
- Create XGBoost SageMaker estimator instance through tdapiclient.
exec_role_arn = "arn:aws:iam::076782961461:role/service-role/AmazonSageMaker-ExecutionRole-20210112T215668"
xgboost_estimator = td_apiclient.XGBoost( entry_point="script.py", role=exec_role_arn, output_path=model_artifacts_location, code_location=custom_code_upload_location, instance_count=1, instance_type="ml.m5.xlarge", framework_version="1.3-1", trainingSparkDataFormat="csv", trainingContentType="csv" )
xgboost_estimator.set_hyperparameters(max_depth=5, eta=0.2, gamma=4, min_child_weight=6, subsample=0.8, csv_weights=1, num_round=30)
- Start training XGBoost estimator using teradataml DataFrame objects.
xgboost_estimator.fit({'train': train, 'validation': validate }, content_type="csv",wait=True)
- Create Serializer and Deserializer, so predictor can handle CSV input and output.
from sagemaker.serializers import CSVSerializer from sagemaker.deserializers import CSVDeserializer csv_ser = CSVSerializer() csv_dser = CSVDeserializer()
predictor = xgboost_estimator.deploy("aws-endpoint", sagemaker_kw_args={"instance_type": "ml.m5.large", "initial_instance_count": 1, "serializer": csv_ser, "deserializer": csv_dser})
- Try prediction integration using teradataml DataFrame and the predictor object created in previous step.
- Confirm that predictor is correctly configured for accepting csv input.
print(predictor.cloudObj.accept)
The output:('text/csv',)
- Prepare test dataset.
test=test.drop("diagnosis",axis=1)
item=test.head(1)
The output:radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se concave_points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave_points_worst symmetry_worst fractal_dimension_worst 8.219 20.7 53.27 203.9 0.09405 0.1305 0.1321 0.02168 0.2222 0.08261 0.1935 1.962 1.243 10.21 0.01243 0.05416 0.07753 0.01022 0.02309 0.01178 9.092 29.72 58.08 249.8 0.163 0.431 0.5381 0.07879 0.3322 0.1486
- Try prediction with UDF and Client options.Prediction with UDF option:
output = predictor.predict(item, mode="UDF",content_type='csv')
output
The output:radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se concave_points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave_points_worst symmetry_worst fractal_dimension_worst Output 8.219 20.7 53.27 203.9 0.09405 0.1305 0.1321 0.02168 0.2222 0.08261 0.1935 1.962 1.243 10.21 0.01243 0.05416 0.07753 0.01022 0.02309 0.01178 9.092 29.72 58.08 249.8 0.163 0.431 0.5381 0.07879 0.3322 0.1486 0.06437
Prediction with Client option:output = predictor.predict(item, mode="client",content_type='csv')
output
The output:[['0.03782']]
- Confirm that predictor is correctly configured for accepting csv input.
- Clean up.
predictor.cloudObj.delete_model() predictor.cloudObj.delete_endpoint() remove_tdapi_context(tdapi_context)