Preprocess Input Data | BYOM | Teradata Vantage - Preprocessing Input Data

Preprocess Input Data | BYOM | Teradata Vantage - Preprocessing Input Data - Teradata Vantage

Teradata Vantage™ - Bring Your Own Model User Guide

Deployment

VantageCloud

VantageCore

Edition

Enterprise

IntelliFlex

Lake

VMware

Product

Teradata Vantage

Release Number

5.0

Published

October 2023

Language

English (United States)

Last Update

2024-04-06

dita:mapPath

fee1607120608274.ditamap

dita:ditavalPath

ayr1485454803741.ditaval

dita:id

fee1607120608274

Before using your input data to create a model, you can transform the data with R or Python functions for PMML models or H2O transformations with MOJO models.

PMML Models

The functions transform the input data during model training as part of a pipeline. The generated model, stored in XML format, includes the preprocessing steps. During model prediction, the transformations are applied to the input data and the transformed data is scored by the PMML or MOJO model.

PMML supports the following input data transformations:

Transformation	Description	R Function	Python Functions
Normalization	Scales continuous or discrete input values to specified range.	xform_min_max	MinMaxScaler
Discretization	Maps continuous input values to discrete values.	xform_discretize	CutTransformer
Value Mapping	Maps discrete input values to other discrete values.	xform_map	StandardScalar LabelEncoder
Function Mapping	Maps input values to values derived from applying a function.	xform_function	FunctionTransformer

The R functions are in the library https://cran.r-project.org/web/packages/pmml/index.html. Use the xform_wrap function to wrap your input data before feeding it to an R transformation function.

R creates the PMML model using the function pmml:pmml() and inserts the transformations into the XML element LocalTransformations.

Python uses the libraries sklearn and sklearn_pandas to set up the pipeline for preprocessing transformations, and uses the DataFrameMapper function in the library sklearn_pandas to transform input data. For information about sklearn and sklearn_pandas, see https://scikit-learn.org.

For examples of PMML pipelines that preprocess input data, see PMML Models with Custom Transformations.

MOJO Models

H2O Driverless AI (DAI) provides a number of transformations.

The following transformers are available for regression and classification (multiclass and binary) experiments:

Numeric
Categorical
Time and Date
Time Series
NLP (test)
Image

For details on each type of transformation, refer to https://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/transformations.html.

ONNX Models

For classical machine learning models on structured data, Vantage has a large set of transformation functions in both the Vantage Analytics Library and SQLE Advanced Analytic functions. You can use these functions to prepare the input data that the classical machine learning models expect. However, there are no transformation or conversion functions in Vantage to prepare tensors for unstructured data (text, images, video and audio) for ONNX models. The data must be preprocessed before loading to Vantage to conform the tensors into a shape that the ONNX models expect. As long as the data is in the form expected by your ONNX model, it can be scored by ONNXPredict.

Skl2onnx converts any Scikit-learn model or pipeline into an ONNX model or pipeline. For more information, see https://pypi.org/project/skl2onnx/.