Using Open Analytics to score using externally trained models using APPLY - Using Open Analytics to Score using Externally Trained Models using Apply - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
Language
English (United States)
Last Update
2024-04-03
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

This example uses Open Analytics to score using externally trained models using Apply.

This example works only on VantageCloud Lake.
  1. Set up the environment.
    1. Import required libraries.
      from teradataml import create_context, remove_context, list_base_envs, list_user_envs, create_env, remove_env, get_env, DataFrame, copy_to_sql, Apply, configure, read_csv, set_config_params
      from teradataml.options.display import display
      import pandas as pd, getpass, os
      from collections import OrderedDict
      from teradatasqlalchemy.types import BIGINT, VARCHAR, INTEGER, FLOAT
    2. Set Authentication token and UES URL.
      set_config_params(ues_url=getpass.getpass("UES URL: "),
                        auth_token=getpass.getpass("JWT Token: "))
    3. Create the connection.
      con = create_context(host=getpass.getpass("Hostname: "),
                           username=getpass.getpass("Username: "),
                           password=getpass.getpass("Password: "))
      You can use the same JWT token instead of password to create a context. See create_context for more details.
  2. Generate model.
    1. Import required libraries.
      from sklearn.datasets import load_iris
      from sklearn.model_selection import train_test_split
      from sklearn.ensemble import RandomForestClassifier
      from sklearn.linear_model import LogisticRegression
    2. Read the data from the scikit-learn package.
      iris = load_iris()
      X, y = iris.data, iris.target
    3. Train a model with Random Forests.
      X_train, X_test, y_train, y_test = train_test_split(X, y)
      clr = RandomForestClassifier()
      clr.fit(X_train, y_train)
    4. Convert the model into ONNX format. Generate ONNX model file "rf_iris.onnx".
      from skl2onnx import convert_sklearn
      from skl2onnx.common.data_types import FloatTensorType
      initial_type = [('float_input', FloatTensorType([None, 4]))]
      onx = convert_sklearn(clr, initial_types = initial_type)
      with open("rf_iris.onnx", "wb") as f:
          f.write(onx.SerializeToString())
      print("RF model trained and saved in 'rf_iris.onnx'.")
  3. Load test data into VantageCloud Lake and create teradataml dataframe for the input table.
    dfIn = pd.DataFrame(X_test, columns=["sepal_length", "sepal_width", "petal_length", "petal_width"])
    copy_to_sql(dfIn, table_name = 'onnx_test_table_dataset', if_exists = 'replace')
    onnx_test_data = DataFrame.from_table("onnx_test_table_dataset")
    onnx_test_data.head(n=5)
  4. Create a python file to score the model.
    Create a file with the name 'sklearn_onnx_scoring.py' in local client with following code.
    # Train a model.
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    import pandas as pd
    import csv
    import sys
     
     
    # Read input data from stdin into a dataframe.
    _reader = csv.DictReader(sys.stdin.readlines(), fieldnames = ["sepal_length","sepal_width","petal_length","petal_width"])
    data=pd.DataFrame(_reader, columns = ["sepal_length","sepal_width","petal_length","petal_width"])
     
    # For AMPs that receive no data, exit the script instance gracefully.
    if data.empty:
        sys.exit()
     
    iris = load_iris()
    X, y = iris.data, iris.target
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    clr = RandomForestClassifier()
    clr.fit(X_train, y_train)
     
    # Compute the prediction with ONNX Runtime
    import onnxruntime as rt
    import numpy
    sess = rt.InferenceSession("rf_iris.onnx")
    input_name = sess.get_inputs()[0].name
    label_name = sess.get_outputs()[0].name
    pred_onx = sess.run([label_name], {input_name: data.values.astype(numpy.float32)})[0]
     
    listToStr = ' '.join([str(elem) for elem in pred_onx])
       
    print(listToStr)
  5. Create Environment and install the corresponding files in the environment.
    1. List the base Python environments.
      list_base_envs()
      Assume a new Python environment is needed.
    2. Create a new Python user environment for Python 3.8.13.
      Function create_env() will return an object of 'UserEnv'.
      demo_env = create_env(env_name = 'oaf_usecase_2c_env',
                            base_env = 'python_3.8.13',
                            desc = 'OAF Demo Use Case 2c Environment')
    3. Verify the new environment has been created.
      list_user_envs()
    4. Install necessary Python add-ons synchronously, for ues by the script in the user environment using an object 'demo_env' of class "UserEnv".
      demo_env.install_lib(["skl2onnx", "sklearn", "onnxruntime", "pandas"])
    5. Verify the Python libraries have been installed correctly.
      demo_env.libs
    6. Install the model file and Python file to score the data inside VantageCloud Lake.
      demo_env.install_file(file_path = 'rf_iris.onnx', replace = True)
      demo_env.install_file(file_path = 'sklearn_onnx_scoring.py', replace = True)
    7. Verify the files have been installed correctly.
      demo_env.files
  6. Score the data inside VantageCloud Lake.
    1. Use Apply to create an object for the Random Forest based prediction.
      applyRF_obj = Apply(data = onnx_test_data,
                          apply_command = 'python3 sklearn_onnx_scoring.py',
                          returns = {"Predicted_Class_RF": VARCHAR(200)},
                          env_name = demo_env
                         )
    2. Run the Python script inside the remote user environment.
      applyRF_obj.execute_script()
      You can display the underlying SQL by setting 'display.print_sqlmr_query = True'.
  7. Remove the environment and disconnect from VantageCloud Lake.
    1. After scoring the data, remove the environment.
      remove_env('oaf_usecase_2c_env')
    2. Verify the specified environment has been removed.
      list_user_envs()
    3. Disconnect from VantageCloud Lake.
      remove_context()