Preprocess Input Data | BYOM | Teradata Vantage - 3.0 - Preprocessing Input Data - Teradata Vantage

Teradata Vantageā„¢ - Bring Your Own Model User Guide

Product
Teradata Vantage
Release Number
3.0
Published
May 2022
Last Update
2022-06-02
Content Type
User Guide
Publication ID
B700-1111-051K
Language
English (United States)
Before using your input data to create a model, you can transform the data with R or Python functions for PMML models or H2O transformations with MOJO models.

PMML Models

The functions transform the input data during model training as part of a pipeline. The generated model, stored in XML format, includes the preprocessing steps. During model prediction, the transformations are applied to the input data and the transformed data is scored by the PMML or MOJO model.

PMML supports the following input data transformations:

Transformation Description R Function Python Functions
Normalization Scales continuous or discrete input values to specified range. xform_min_max MinMaxScaler
Discretization Maps continuous input values to discrete values. xform_discretize CutTransformer
Value Mapping Maps discrete input values to other discrete values. xform_map StandardScalar

LabelEncoder

Function Mapping Maps input values to values derived from applying a function. xform_function FunctionTransformer

The R functions are in the library https://cran.r-project.org/web/packages/pmml/index.html. Use the xform_wrap function to wrap your input data before feeding it to an R transformation function.

R creates the PMML model using the function pmml:pmml() and inserts the transformations into the XML element LocalTransformations.

Python uses the libraries sklearn and sklearn_pandas to set up the pipeline for preprocessing transformations, and uses the DataFrameMapper function in the library sklearn_pandas to transform input data. For information about sklearn and sklearn_pandas, see https://scikit-learn.org.

For examples of PMML pipelines that preprocess input data, see PMML Models with Custom Transformations.

MOJO Models

H2O Driverless AI (DAI) provides a number of transformations.

The following transformers are available for regression and classification (multiclass and binary) experiments:
  • Numeric
  • Categorical
  • Time and Date
  • Time Series
  • NLP (test)
  • Image

For details on each type of transformation, refer to https://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/transformations.html.

ONNX Models

For classical machine learning models on structured data, Vantage has a large set of transformation functions in both the Vantage Analytics Library and SQLE Advanced Analytic functions. You can use these functions to prepare the input data that the classical machine learning models expect. However, there are no transformation or conversion functions in Vantage to prepare tensors for unstructured data (text, images, video and audio) for ONNX models. The data must be preprocessed before loading to Vantage to conform the tensors into a shape that the ONNX models expect. As long as the data is in the form expected by your ONNX model, it can be scored by ONNXPredict.
Skl2onnx converts any Scikit-learn model or pipeline into an ONNX model or pipeline. For more information, see https://pypi.org/project/skl2onnx/.