| |
- Transform(data, bins=None, derive=None, one_hot_encode=None, fillna=None, label_encode=None, rescale=None, retain=None, sigmoid=None, zscore=None, fallback=False, index_columns=None, unique_index=False, key_columns=None, allow_duplicates=None, nopi=None, filter=None, gen_sql_only=False)
- DESCRIPTION:
The Variable Transformation analysis reads a teradataml DataFrame and produces an
output containing transformed columns. This is useful when preparing data for input
to an analytic algorithm. For example, a K-Means Clustering algorithm typically
produces better results when the input columns are first converted to their Z-Score
values to put all input variables on an equal footing, regardless of their magnitude.
Function supports following transformations:
* Binning - Binning replaces a continuous numeric column with a categorical one
to produce ordinal values (for example, numeric categorical values where
order is meaningful).
* Derive - The Derive transformation requires the free-form transformation be
specified as a formula.
* One Hot Encoding - One Hot Encoding is useful when a categorical data element
must be re-expressed as one or more numeric data elements,
creating a binary numeric field for each categorical data value.
* Missing Value Treatment or Null Replacement.
* Label Encoding - Allows to re-express existing values of a categorical data
column (variable) into a new coding scheme or to correct data
quality problems and focus an analysis on a value.
* Min-Max Scaling - Limits the upper and lower boundaries of the data in a
continuous numeric column using a linear rescaling function
based on maximum and minimum data values.
* Retain - Allows copying of one or more columns into the final analytic data set.
* Sigmoid - Provides rescaling of continuous numeric data using a type of sigmoid
or s-shaped function.
* ZScore - Provides rescaling of continuous numeric data using Z-Scores.
PARAMETERS:
data:
Required Argument.
Specifies the input data to perform variable transformations.
Types: teradataml DataFrame
bins:
Optional Argument.
Specifies one or more instances of Binning Transformation. Binning replaces a
continuous numeric column with a categorical one to produce ordinal values (for
example, numeric categorical values where order is meaningful).
Check the documentation of 'teradataml.analytics.Transformation.Binning' to know
more about Binning.
Types: Binning OR List of Binning
derive:
Optional Argument.
Specifies one or more instances of Derive Transformation. This argument allows
user to perform a free form transformation using arithmetic formula.
Check the documentation of 'teradataml.analytics.Transformation.Derive' to know
more about Derive.
Types: Derive OR List of Derive
one_hot_encode:
Optional Argument.
Specifies one or more instances of OneHotEncoder Transformation. One hot encoding
allows user to re-express categorical data as one or more numeric data elements,
creating a binary numeric field for each categorical data value.
Check the documentation of 'teradataml.analytics.Transformation.OneHotEncoder' to
know more about OneHotEncoder.
Types: OneHotEncoder OR List of OneHotEncoder
fillna:
Optional Argument.
Specifies one or more instances of FillNa Transformation. This argument allows
user to perform a missing value/null replacement transformation.
Check the documentation of 'teradataml.analytics.Transformation.FillNa' to know
more about FillNa.
Types: FillNa OR List of FillNa
label_encode:
Optional Argument.
Specifies one or more instances of LabelEncoder Transformation. This allows
to re-express existing values of a categorical data column (variable) into a
new coding scheme.
Check the documentation of 'teradataml.analytics.Transformation.LabelEncoder' to
know more about LabelEncoder.
Types: LabelEncoder OR List of LabelEncoder
rescale:
Optional Argument.
Specifies one or more instances of MinMaxScalar Transformation. This limits
the upper and lower boundaries of the data in a continuous numeric column using
a linear rescaling function based on maximum and minimum data values.
Check the documentation of 'teradataml.analytics.Transformation.MinMaxScalar' to
know more about MinMaxScalar.
Types: MinMaxScalar OR List of MinMaxScalar
retain:
Optional Argument.
Specifies one or more instances of Retain Transformation. This argument allows
user to retain columns from input to output.
Check the documentation of 'teradataml.analytics.Transformation.Retain' to know
more about Retain.
Types: Retain OR List of Retain
sigmoid:
Optional Argument.
Specifies one or more instances of Sigmoid Transformation. This argument allows
user to perform a rescaling using sigmoid transformation.
Check the documentation of 'teradataml.analytics.Transformation.Sigmoid' to know
more about Sigmoid.
Types: Sigmoid OR List of Sigmoid
zscore:
Optional Argument.
Specifies one or more instances of ZScore Transformation. This argument allows
user to perform a rescaling using Z-Score transformation.
Check the documentation of 'teradataml.analytics.Transformation.ZScore' to know
more about ZScore.
Types: ZScore OR List of ZScore
fallback:
Optional Argument.
Specifies whether a mirrored copy of underlying table of output DataFrame is
required or not.
Default Value: False
Types: bool
index_columns:
Optional Argument.
Specifies the name(s) of the output column(s) to be used as index in output DataFrame.
Types: str OR List of Strings (str)
unique_index:
Optional Argument.
Specifies whether the underlying output table should contain a unique primary
index or not.
Default Value: False
Types: bool
key_columns:
Optional Argument.
Specifies the name(s) of the column(s) that can be unique key in input and
output teradataml DataFrame. When null replacement is requested, i.e., "fillna"
argument is used either in FillNa transformation or in combination with a
Binning, Derive, OneHotEncoder, LabelEncoder, MinMaxScalar, Sigmoid, or ZScore
transformation, the "key_columns" argument must be specified.
Types: str OR List of Strings (str)
allow_duplicates:
Optional Argument.
Specifies whether output should contain duplicate rows or not.
Types: bool
nopi:
Optional Argument.
Specifies whether the underlying output table should contain no index columns.
When True, output table does not contain index columns.
Note:
When this argument is set to True, "allow_duplicates" must also be set to True.
Types: bool
filter:
Optional Argument.
Specifies the clause to filter rows selected for transformation.
For example,
filter = "cust_id > 0"
Types: str
gen_sql_only:
Optional Argument.
Specifies whether to generate only SQL for the function.
When set to True, function SQL is generated, not executed, which can be accessed
using show_query() method, otherwise SQL is just executed but not returned.
Default Value: False
Types: bool
RETURNS:
An instance of Transform.
Output teradataml DataFrames can be accessed using attribute references, such as
TransformObj.<attribute_name>.
Output teradataml DataFrame attribute name is: result.
RAISES:
TeradataMlException, TypeError, ValueError
EXAMPLES:
# Notes:
# 1. To execute Vantage Analytic Library functions,
# a. import "valib" object from teradataml.
# b. set 'configure.val_install_location' to the database name where Vantage
# analytic library functions are installed.
# 2. Datasets used in these examples can be loaded using Vantage Analytic Library
# installer.
# Import valib object from teradataml to execute this function.
from teradataml import valib
# Set the 'configure.val_install_location' variable.
from teradataml import configure
configure.val_install_location = "SYSLIB"
# Create required teradataml DataFrame.
df = DataFrame("customer")
print(df)
# Let's look at multiple transformations done via Transform.
# Example of running individual transformations can be found in the example sections
# for each Transformation.
# Binning performed on column "income" with different number of bins.
bin_1 = Binning(style="bins", value=3, columns="income", out_columns="inc1")
bin_2 = Binning(style="bins", value=4, columns="income", out_columns="inc2")
bin_3 = Binning(style="bins", value=5, columns="income", out_columns="inc3")
# Derive transformation done on "age" column with different arithmetic formula.
derive_1 = Derive(formula="x/10", columns="age", out_column="age1")
derive_2 = Derive(formula="x*10", columns="age", out_column="age2")
derive_3 = Derive(formula="x+10", columns="age", out_column="age3")
# One hot encoding done for values in columns "gender" and "marital_status".
dc_1 = OneHotEncoder(values=["M", "F"], columns="gender")
dc_2 = OneHotEncoder(values=[1, 2, 3, 4], columns="marital_status")
# Replace null values in column "street_nbr", by using 'literal' and 'mean' null style.
fn_1 = FillNa(style="literal", value=0, columns="street_nbr")
fn_2 = FillNa(columns="street_nbr", out_columns="street_nbr2")
# Recode values in column "marital_status", using different coding schemes.
rc_1 = LabelEncoder(values={1: "S", 2: "M", 3: "S", 4: "S"}, default="SAME",
columns="marital_status", out_columns="mar1")
rc_2 = LabelEncoder(values={1: "B", 2: "A", 3: "B", 4: "B"}, default=None,
columns="marital_status", out_columns="mar2")
# Values in column "age" are rescaled using MinMaxScalar using
# 1. Both lower bound and upper bound,
# 2. Only lower bound,
# 3. Only upper bound.
rs_1 = MinMaxScalar(lbound=0, ubound=100, columns="age", out_columns="age4")
rs_2 = MinMaxScalar(lbound=0, ubound=None, columns="age", out_columns="age5")
rs_3 = MinMaxScalar(lbound=None, ubound=100, columns="age", out_columns="age6")
# Rescale values in column "nbr_children" using various sigmoid functions.
sig_1 = Sigmoid(style="logit", columns="nbr_children", out_columns="nbr1")
sig_2 = Sigmoid(style="modifiedlogit", columns="nbr_children", out_columns="nbr2")
sig_3 = Sigmoid(style="tanh", columns="nbr_children", out_columns="nbr3")
# Rescale values in column "years_with_bank" using Z-Score values.
zscore = ZScore(columns="years_with_bank", out_columns="ywb")
# Retain columns "income", "age", "years_with_bank" and "nbr_children" in the
# transformed output.
retain = Retain(columns=["income", "age", "years_with_bank", "nbr_children"])
# Example 1: Execute the Transform function.
obj = valib.Transform(data=df,
bins=[bin_1, bin_2, bin_3],
derive=[derive_1, derive_2, derive_3],
one_hot_encode=[dc_1, dc_2],
fillna=[fn_1, fn_2],
label_encode=[rc_1, rc_2],
rescale=[rs_1, rs_2, rs_3],
retain=retain,
sigmoid=[sig_1, sig_2, sig_3],
zscore=zscore,
key_columns="cust_id")
# Print the results.
print(obj.result)
# Example 2: Generate only SQL for the function, but do not execute the same.
obj = valib.Transform(data=df,
bins=[bin_1, bin_2, bin_3],
derive=[derive_1, derive_2, derive_3],
one_hot_encode=[dc_1, dc_2],
fillna=[fn_1, fn_2],
label_encode=[rc_1, rc_2],
rescale=[rs_1, rs_2, rs_3],
retain=retain,
sigmoid=[sig_1, sig_2, sig_3],
zscore=zscore,
key_columns="cust_id",
gen_sql_only=True)
# Print the generated SQL.
print(obj.show_query("sql"))
# Print both generated SQL and stored procedure call.
print(obj.show_query("both"))
# Print the stored procedure call.
print(obj.show_query())
print(obj.show_query("sp"))
|