Description
The Variable Transformation analysis reads the input data in tbl_teradata
object and produces an output containing transformed columns. This is useful
when preparing data for input to an analytic algorithm. For example,
a K-Means Clustering algorithm typically produces better results when the
input columns are first converted to their Z-Score values to put all input
variables on an equal footing, regardless of their magnitude.
Function supports the following transformations:
Binning - replaces a continuous numeric column with a categorical one to produce ordinal values (for example, numeric categorical values where order is meaningful).
Derive - allows you to enter free-form formulas based on columns within a table. It requires the free-form transformation to be specified as a formula using operators, arguments, and functions.
One Hot Encoding - re-expresses a categorical data element as one or more numeric data elements and creates a binary numeric field for each categorical data value.
Null Replacement - defines how missing values are treated.
Label Encoding - allows to re-express existing values of a categorical data column (variable) into a new coding scheme or to correct data quality problems and focus an analysis on a value.
Min-Max Scaling - limits the upper and lower boundaries of the data in a continuous numeric column using a linear rescaling function based on maximum and minimum data values.
Retain - allows copying of one or more columns into the final analytic data set.
Sigmoid - provides rescaling of continuous numeric data using a type of sigmoid or s-shaped function.
ZScore - provides rescaling of continuous numeric data using Z-Scores.
Usage
td_transform_valib(data, bins=NULL, derive=NULL, one.hot.encode=NULL,
fillna=NULL, label.encode=NULL, rescale=NULL,
retain=NULL, sigmoid=NULL, zscore=NULL, ...)
Arguments
data |
Required Argument. |
bins |
Optional Argument. |
derive |
Optional Argument. |
one.hot.encode |
Optional Argument. |
fillna |
Optional Argument. |
label.encode |
Optional Argument. |
rescale |
Optional Argument. |
retain |
Optional Argument. |
sigmoid |
Optional Argument. |
zscore |
Optional Argument. |
... |
Specifies other arguments supported by the function as described in the 'Other Arguments' section. |
Value
Function returns result based on transformation object passed. Named list member can be referenced directly with the "$" operator using name: result.
Other Arguments
fallback
Optional Argument.
Specifies whether a mirrored copy of underlying table of
output tbl_teradata is required or not.
Default Value: FALSE
Types: logical
index.columns
Optional Argument.
Specifies the name(s) of the output column(s) to be
used as index in output tbl_teradata.
Types: character OR vector of Strings (character)
unique.index
Optional Argument.
Specifies whether the underlying output table should
contain a unique primary index or not.
Default Value: FALSE
Types: logical
key.columns
Optional Argument.
Specifies the name(s) of the column(s) that can be
unique key in input and output tbl_teradata. When
null replacement is requested, i.e., "fillna"
argument is used either in tdFillNa
transformation or in combination with a
tdBinning
, tdDerive
,
tdOneHotEncoder
, td_LabelEncoder
,
tdMinMaxScalar
, tdSigmoid
,
td_ZScore
transformation, the "key.columns"
argument must be specified.
Types: character OR vector of Strings (character)
allow.duplicates
Optional Argument.
Specifies whether output should contain
duplicate rows or not.
Types: logical
nopi
Optional Argument.
Specifies whether the underlying output table should contain
no index columns. When TRUE, output table does not contain
index columns.
Note:
When this argument is set to TRUE, "allow.duplicates" must also be set to TRUE.
Types: logical
filter
Optional Argument.
Specifies the clause to filter rows selected for
transformation.
For example,
filter = "cust_id > 0"
Types: character
See Also
tdFillNa
, tdBinning
, tdDerive
,
tdOneHotEncoder
, tdLabelEncoder
,
tdMinMaxScalar
, tdSigmoid
, tdZScore
Examples
Notes:
# 1. To run any transformation, user needs to use td_transform_valib()
# function.
# 2. To do so set option 'val.install.location' to the database name
# where Vantage analytic library functions are installed.
# 3. Datasets used in these examples can be loaded using Vantage Analytic
# Library installer.
# Set the option 'val.install.location'.
options(val.install.location = "SYSLIB")
# Get remote data source connection.
con <- td_get_context()$connection
# Create object(s) of class "tbl_teradata".
customer <- tbl(con, "customer")
# Example of running individual transformations can be found in the example
# sections for each Transformation.
# Binning performed on column 'income' with different number of bins.
bin_1 <- tdBinning(style="bins", value=3, columns=list("income"="inc1"))
bin_2 <- tdBinning(style="bins", value=4, columns=list("income"="inc2"))
bin_3 <- tdBinning(style="bins", value=5, columns=list("income"="inc3"))
# Derive transformation done on 'age' column with different arithmetic
# formula.
derive_1 <- tdDerive(formula="x/10", columns="age", out.column="age1")
derive_2 <- tdDerive(formula="x*10", columns="age", out.column="age2")
derive_3 <- tdDerive(formula="x+10", columns="age", out.column="age3")
# One hot encoding done for values in columns 'gender' and 'marital_status'.
dc_1 <- tdOneHotEncoder(values=list("M", "F"), column="gender")
dc_2 <- tdOneHotEncoder(values=list(1, 2, 3, 4), column="marital_status")
# Replace null values in column 'street_nbr', by using 'literal' and 'mean'
# null style.
fillna_1 <- tdFillNa(style="literal", value=0, columns="street_nbr")
fillna_2 <- tdFillNa(columns=list("street_nbr"="street_nbr2"))
# Recode values in column 'marital_status', using different coding schemes.
rc_1 <- tdLabelEncoder(values=list("1"="S", "2"="M", "3"="S", "4"="S"),
default="SAME", columns=list("marital_status"="mar1"))
rc_2 <- tdLabelEncoder(values=list("1"="B", "2"="A", "3"="B", "4"="B"),
default=NULL, columns=list("marital_status"="mar2"))
# Values in column 'age' are rescaled using MinMaxScalar using
# 1. Both lower bound and upper bound,
# 2. Only lower bound,
# 3. Only upper bound.
rs_1 <- tdMinMaxScalar(lbound=0, ubound=100, columns=list("age"="age4"))
rs_2 <- tdMinMaxScalar(lbound=0, ubound=None, columns=list("age"="age5"))
rs_3 <- tdMinMaxScalar(lbound=None, ubound=100, columns=list("age"="age6"))
# Rescale values in column "nbr_children" using various sigmoid functions.
sig_1 <- tdSigmoid(style="logit", columns=list("nbr_children"="nbr1"))
sig_2 <- tdSigmoid(style="modifiedlogit", columns=list("nbr_children"="nbr2"))
sig_3 <- tdSigmoid(style="tanh", columns=list("nbr_children"="nbr3"))
# Rescale values in column 'years_with_bank' using Z-Score values.
zscore <- tdZScore(columns=list("years_with_bank"="ywb"))
# Retain columns 'income', 'age', 'years_with_bank' and 'nbr_children' in the
# transformed output.
retain <- tdRetain(columns=list("income", "age", "years_with_bank",
"nbr_children"))
# Execute the td_transform_valib() function.
obj <- td_transform_valib(data=customer,
bins=c(bin_1, bin_2, bin_3),
derive=c(derive_1, derive_2, derive_3),
one.hot.encode=c(dc_1, dc_2),
fillna=c(fillna_1, fillna_2),
label.encode=c(rc_1, rc_2),
rescale=c(rs_1, rs_2),
sigmoid=c(sig_1, sig_2, sig_3),
retain=retain,
zscore=zscore,
key.columns="cust_id")
# Print the results.
print(obj$result)