Teradata Package for Python Function Reference | 17.10 - Transform - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.

Teradata® Package for Python Function Reference

Product

Teradata Package for Python

Release Number

17.10

Published

April 2022

Language

English (United States)

Last Update

2022-08-19

lifecycle

Product Category

Teradata Vantage

Transform

Functions
		Transform(data, bins=None, derive=None, one_hot_encode=None, fillna=None, label_encode=None, rescale=None, retain=None, sigmoid=None, zscore=None, fallback=False, index_columns=None, unique_index=False, key_columns=None, allow_duplicates=None, nopi=None, filter=None, gen_sql_only=False) DESCRIPTION: The Variable Transformation analysis reads a teradataml DataFrame and produces an output containing transformed columns. This is useful when preparing data for input to an analytic algorithm. For example, a K-Means Clustering algorithm typically produces better results when the input columns are first converted to their Z-Score values to put all input variables on an equal footing, regardless of their magnitude. Function supports following transformations: * Binning - Binning replaces a continuous numeric column with a categorical one to produce ordinal values (for example, numeric categorical values where order is meaningful). * Derive - The Derive transformation requires the free-form transformation be specified as a formula. * One Hot Encoding - One Hot Encoding is useful when a categorical data element must be re-expressed as one or more numeric data elements, creating a binary numeric field for each categorical data value. * Missing Value Treatment or Null Replacement. * Label Encoding - Allows to re-express existing values of a categorical data column (variable) into a new coding scheme or to correct data quality problems and focus an analysis on a value. * Min-Max Scaling - Limits the upper and lower boundaries of the data in a continuous numeric column using a linear rescaling function based on maximum and minimum data values. * Retain - Allows copying of one or more columns into the final analytic data set. * Sigmoid - Provides rescaling of continuous numeric data using a type of sigmoid or s-shaped function. * ZScore - Provides rescaling of continuous numeric data using Z-Scores. PARAMETERS: data: Required Argument. Specifies the input data to perform variable transformations. Types: teradataml DataFrame bins: Optional Argument. Specifies one or more instances of Binning Transformation. Binning replaces a continuous numeric column with a categorical one to produce ordinal values (for example, numeric categorical values where order is meaningful). Check the documentation of 'teradataml.analytics.Transformation.Binning' to know more about Binning. Types: Binning OR List of Binning derive: Optional Argument. Specifies one or more instances of Derive Transformation. This argument allows user to perform a free form transformation using arithmetic formula. Check the documentation of 'teradataml.analytics.Transformation.Derive' to know more about Derive. Types: Derive OR List of Derive one_hot_encode: Optional Argument. Specifies one or more instances of OneHotEncoder Transformation. One hot encoding allows user to re-express categorical data as one or more numeric data elements, creating a binary numeric field for each categorical data value. Check the documentation of 'teradataml.analytics.Transformation.OneHotEncoder' to know more about OneHotEncoder. Types: OneHotEncoder OR List of OneHotEncoder fillna: Optional Argument. Specifies one or more instances of FillNa Transformation. This argument allows user to perform a missing value/null replacement transformation. Check the documentation of 'teradataml.analytics.Transformation.FillNa' to know more about FillNa. Types: FillNa OR List of FillNa label_encode: Optional Argument. Specifies one or more instances of LabelEncoder Transformation. This allows to re-express existing values of a categorical data column (variable) into a new coding scheme. Check the documentation of 'teradataml.analytics.Transformation.LabelEncoder' to know more about LabelEncoder. Types: LabelEncoder OR List of LabelEncoder rescale: Optional Argument. Specifies one or more instances of MinMaxScalar Transformation. This limits the upper and lower boundaries of the data in a continuous numeric column using a linear rescaling function based on maximum and minimum data values. Check the documentation of 'teradataml.analytics.Transformation.MinMaxScalar' to know more about MinMaxScalar. Types: MinMaxScalar OR List of MinMaxScalar retain: Optional Argument. Specifies one or more instances of Retain Transformation. This argument allows user to retain columns from input to output. Check the documentation of 'teradataml.analytics.Transformation.Retain' to know more about Retain. Types: Retain OR List of Retain sigmoid: Optional Argument. Specifies one or more instances of Sigmoid Transformation. This argument allows user to perform a rescaling using sigmoid transformation. Check the documentation of 'teradataml.analytics.Transformation.Sigmoid' to know more about Sigmoid. Types: Sigmoid OR List of Sigmoid zscore: Optional Argument. Specifies one or more instances of ZScore Transformation. This argument allows user to perform a rescaling using Z-Score transformation. Check the documentation of 'teradataml.analytics.Transformation.ZScore' to know more about ZScore. Types: ZScore OR List of ZScore fallback: Optional Argument. Specifies whether a mirrored copy of underlying table of output DataFrame is required or not. Default Value: False Types: bool index_columns: Optional Argument. Specifies the name(s) of the output column(s) to be used as index in output DataFrame. Types: str OR List of Strings (str) unique_index: Optional Argument. Specifies whether the underlying output table should contain a unique primary index or not. Default Value: False Types: bool key_columns: Optional Argument. Specifies the name(s) of the column(s) that can be unique key in input and output teradataml DataFrame. When null replacement is requested, i.e., "fillna" argument is used either in FillNa transformation or in combination with a Binning, Derive, OneHotEncoder, LabelEncoder, MinMaxScalar, Sigmoid, or ZScore transformation, the "key_columns" argument must be specified. Types: str OR List of Strings (str) allow_duplicates: Optional Argument. Specifies whether output should contain duplicate rows or not. Types: bool nopi: Optional Argument. Specifies whether the underlying output table should contain no index columns. When True, output table does not contain index columns. Note: When this argument is set to True, "allow_duplicates" must also be set to True. Types: bool filter: Optional Argument. Specifies the clause to filter rows selected for transformation. For example, filter = "cust_id > 0" Types: str gen_sql_only: Optional Argument. Specifies whether to generate only SQL for the function. When set to True, function SQL is generated, not executed, which can be accessed using show_query() method, otherwise SQL is just executed but not returned. Default Value: False Types: bool RETURNS: An instance of Transform. Output teradataml DataFrames can be accessed using attribute references, such as TransformObj.<attribute_name>. Output teradataml DataFrame attribute name is: result. RAISES: TeradataMlException, TypeError, ValueError EXAMPLES: # Notes: # 1. To execute Vantage Analytic Library functions, # a. import "valib" object from teradataml. # b. set 'configure.val_install_location' to the database name where Vantage # analytic library functions are installed. # 2. Datasets used in these examples can be loaded using Vantage Analytic Library # installer. # Import valib object from teradataml to execute this function. from teradataml import valib # Set the 'configure.val_install_location' variable. from teradataml import configure configure.val_install_location = "SYSLIB" # Create required teradataml DataFrame. df = DataFrame("customer") print(df) # Let's look at multiple transformations done via Transform. # Example of running individual transformations can be found in the example sections # for each Transformation. # Binning performed on column "income" with different number of bins. bin_1 = Binning(style="bins", value=3, columns="income", out_columns="inc1") bin_2 = Binning(style="bins", value=4, columns="income", out_columns="inc2") bin_3 = Binning(style="bins", value=5, columns="income", out_columns="inc3") # Derive transformation done on "age" column with different arithmetic formula. derive_1 = Derive(formula="x/10", columns="age", out_column="age1") derive_2 = Derive(formula="x*10", columns="age", out_column="age2") derive_3 = Derive(formula="x+10", columns="age", out_column="age3") # One hot encoding done for values in columns "gender" and "marital_status". dc_1 = OneHotEncoder(values=["M", "F"], columns="gender") dc_2 = OneHotEncoder(values=[1, 2, 3, 4], columns="marital_status") # Replace null values in column "street_nbr", by using 'literal' and 'mean' null style. fn_1 = FillNa(style="literal", value=0, columns="street_nbr") fn_2 = FillNa(columns="street_nbr", out_columns="street_nbr2") # Recode values in column "marital_status", using different coding schemes. rc_1 = LabelEncoder(values={1: "S", 2: "M", 3: "S", 4: "S"}, default="SAME", columns="marital_status", out_columns="mar1") rc_2 = LabelEncoder(values={1: "B", 2: "A", 3: "B", 4: "B"}, default=None, columns="marital_status", out_columns="mar2") # Values in column "age" are rescaled using MinMaxScalar using # 1. Both lower bound and upper bound, # 2. Only lower bound, # 3. Only upper bound. rs_1 = MinMaxScalar(lbound=0, ubound=100, columns="age", out_columns="age4") rs_2 = MinMaxScalar(lbound=0, ubound=None, columns="age", out_columns="age5") rs_3 = MinMaxScalar(lbound=None, ubound=100, columns="age", out_columns="age6") # Rescale values in column "nbr_children" using various sigmoid functions. sig_1 = Sigmoid(style="logit", columns="nbr_children", out_columns="nbr1") sig_2 = Sigmoid(style="modifiedlogit", columns="nbr_children", out_columns="nbr2") sig_3 = Sigmoid(style="tanh", columns="nbr_children", out_columns="nbr3") # Rescale values in column "years_with_bank" using Z-Score values. zscore = ZScore(columns="years_with_bank", out_columns="ywb") # Retain columns "income", "age", "years_with_bank" and "nbr_children" in the # transformed output. retain = Retain(columns=["income", "age", "years_with_bank", "nbr_children"]) # Example 1: Execute the Transform function. obj = valib.Transform(data=df, bins=[bin_1, bin_2, bin_3], derive=[derive_1, derive_2, derive_3], one_hot_encode=[dc_1, dc_2], fillna=[fn_1, fn_2], label_encode=[rc_1, rc_2], rescale=[rs_1, rs_2, rs_3], retain=retain, sigmoid=[sig_1, sig_2, sig_3], zscore=zscore, key_columns="cust_id") # Print the results. print(obj.result) # Example 2: Generate only SQL for the function, but do not execute the same. obj = valib.Transform(data=df, bins=[bin_1, bin_2, bin_3], derive=[derive_1, derive_2, derive_3], one_hot_encode=[dc_1, dc_2], fillna=[fn_1, fn_2], label_encode=[rc_1, rc_2], rescale=[rs_1, rs_2, rs_3], retain=retain, sigmoid=[sig_1, sig_2, sig_3], zscore=zscore, key_columns="cust_id", gen_sql_only=True) # Print the generated SQL. print(obj.show_query("sql")) # Print both generated SQL and stored procedure call. print(obj.show_query("both")) # Print the stored procedure call. print(obj.show_query()) print(obj.show_query("sp"))