Teradata Package for R Function Reference | 17.00 - 17.00 - td_pca_valib - Teradata Package for R

Teradata® Package for R Function Reference

Product
Teradata Package for R
Release Number
17.00
Release Date
July 2021
Content Type
Programming Reference
Publication ID
B700-4007-090K
Language
English (United States)

Description

Factor Analysis is one of the most fundamental types of statistical analysis, and Principal Components Analysis (PCA), is arguably the most common variety of Factor Analysis. In PCA Analysis, a set of variables (denoted by columns) is reduced to a smaller number of factors that account for most of the variance in the variables. This can be useful in reducing the number of variables by converting them to factors, or in gaining insight into the nature of the variables when they are used for further data analysis.

Some of the key features of PCA Analysis are outlined below.

  1. One or more group by columns can optionally be specified so that an input matrix is built for each combination of group by column values, and subsequently a separate PCA Analysis model is built for each matrix.

  2. A Near Dependency Report is available to identify two or more columns that may be collinear. This report can be requested by setting the argument "near.dep.report" to TRUE and if desired, the arguments "cond.ind.threshold" and "variance.prop.threshold".

  3. Both orthogonal and oblique factor rotations are available. Refer to the "rotation.type" parameter.

  4. There are three Prime Factor reports available. Refer to the "load.report", "vars.report", and "vars.load.report" arguments.

Usage

td_pca_valib(data, columns, ...)

Arguments

data

Required Argument.
Specifies the input data containing the columns to perform PCA analysis.
Types: tbl_teradata

columns

Required Argument.
Specifies the name(s) of the column(s) representing the variables used in building a PCA analysis model. Occasionally, it can also accept permitted strings to specify all columns or all numeric columns.
Permitted Values:

  1. Name(s) of the column(s) in "data".

  2. Pre-defined strings:

    1. 'all' - all columns

    2. 'allnumeric' - all numeric columns

Types: character OR vector of Strings (character)

...

Specifies other arguments supported by the function as described in the 'Other Arguments' section.

Value

Function returns an object of class "td_pca_valib" which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator using name: result.

Other Arguments

exclude.columns

Optional Argument.
Specifies the name(s) of the column(s) to exclude from the PCA analysis.
If 'all' or 'allnumeric' is used in the "columns" argument, this argument can be used to exclude specific columns from the analysis.
Types: character OR vector of Strings (character)

cond.ind.threshold

Optional Argument.
Required when the argument "near.dep.report" is set to TRUE.
Specifies the condition index threshold parameter to generate Near Dependency Report.
Default Value: 30
Types: numeric

min.eigen

Optional Argument.
Specifies the minimum eigen value to include factors for.
Default Value: 1.0
Types: numeric

load.report

Optional Argument.
Specifies whether to generate Prime Factor Loadings Report in which rows are variables and columns are factors, matching each variable with the factor that has the biggest absolute loading value with.
When set to TRUE, Prime Factor Loadings Report is generated and added in the XML result string.
Default Value: FALSE
Types: logical

vars.load.report

Optional Argument.
Specifies whether to generate Prime Factor Variables with Loadings Report, equivalent to Prime Factor Variables Report with the addition of loading values that determined the relationship between factors and variables. The absolute sizes of the loading values point out the relationship strength and the sign its direction, i.e., either a positive or negative correlation.
When set to TRUE, Prime Factor Variables with Loadings Report is generated and added in the XML result string.
Default Value: FALSE
Types: logical

vars.report

Optional Argument.
Specifies whether to generate Prime Factor Variables Report in which rows are variables and columns are factors, matching variables with their prime factors, and if a threshold is used, possibly other than prime factors. (Either a threshold percent is specified with the "percent.threshold" argument or a threshold loading is specified with the "load.threshold" argument.)
When set to TRUE, Prime Factor Variables Report is generated and added in the XML result string.
Default Value: FALSE
Types: logical

gamma

Optional Argument.
Required when the argument "rotation.type" is set to 'orthomax' or 'orthomin'.
Specifies the gamma value to be set when 'orthomax' or 'orthomin' is used in "rotation.type" argument.
Note:

  • This argument is ignored for values of "rotation.type" other than 'orthomax' and 'orthomin'.

Types: numeric

group.columns

Optional Argument.
Specifies the name(s) of the input column(s) dividing the input "data" into partitions, one for each combination of values in the group by columns. For each partition or combination of values, a separate factor model is built. The default case is no group by columns. Types: character OR vector of Strings (character)

matrix.input

Optional Argument.
Specifies whether the input tbl_teradata is an extended sum-of-squares-and-cross-products (ESSCP) matrix built by the td_matrix_valib(). Use of this feature saves internally building a matrix each time this function is performed, providing a significant performance improvement.
When set to TRUE, the columns specified with the "columns" argument may be a subset of the columns in the matrix and may be specified in any order. The columns must, however, all be present in the matrix. Further, if group by columns are specified in the td_matrix_valib() call, these same group by columns must be specified in this function.
Note:

  • If the input tbl_teradata "data" represents a saved matrix, set this argument to TRUE to get predictable results.

Default Value: FALSE
Types: logical

matrix.type

Optional Argument.
Specifies type of matrix for processing affecting measure and score scaling.
Permitted Values: 'correlation', 'covariance'
Default Value: 'correlation'
Types: character

near.dep.report

Optional Argument.
Specifies whether to produce an XML report showing columns that are collinear as part of the output or not. The report is included in the XML output only if collinearity is detected. Two threshold arguments are available for this report, "cond.ind.threshold" and "variance.prop.threshold".
Default Value: FALSE
Types: logical

rotation.type

Optional Argument.
Specifies the rotation type among various schemes for rotating factors for possibly better results. Both orthogonal and oblique rotations are provided. Gamma value in the rotation equation assumes a different value for each rotation type, with f representing the number of factors and v the number of variables. Refer below table:

--------------------- ----------------- ------------------ ------------- -----------------
rotation.type gamma value Orthogonal/Oblique Notes
--------------------- ----------------- ------------------ ------------- -----------------
equamax f/2 orthogonal -
prthomax Set by user orthogonal -
parsimax v(f-1)/v+f+2 orthogonal -
quartimax 0.0 orthogonal -
varimax 1.0 orthogonal -
biquartimin 0.5 oblique least oblique rotation
covarimin 2.0 oblique -
orthomin Set by user oblique -
quartimin 0.0 oblique most oblique rotation

Types: character

load.threshold

Optional Argument.
Specifies a threshold factor loading value. If this argument is specified, a factor that is not a prime factor may be associated with a variable. This argument is used when the argument "vars.report" is set to TRUE; ignored otherwise.
Notes:

  1. This argument and the argument "percent.threshold" cannot both be specified.

  2. This argument is used when the argument "vars.report" is set to TRUE; ignored otherwise.

Types: numeric

percent.threshold

Optional Argument.
Specifies a threshold percent. If this argument is specified, a factor that is not a prime factor may be associated with a variable.
Notes:

  1. This argument and the argument "load.threshold" cannot both be specified.

  2. This argument is used when the argument "vars.report" is set to TRUE; ignored otherwise.

Types: numeric

variance.prop.threshold

Optional Argument.
Required when the argument "near.dep.report" is set to TRUE.
Specifies the variance proportion threshold parameter to generate Near Dependency Report.
Default Value: 0.5
Types: numeric

Examples

# Notes:
#   1. To execute Vantage Analytic Library functions, set option 'val.install.location' to
#      the database name where Vantage analytic library functions are installed.
#   2. Datasets used in these examples can be loaded using Vantage Analytic Library installer.

# Set the option 'val.install.location'.
options(val.install.location = "SYSLIB")

# Get remote data source connection.
con <- td_get_context()$connection

# Create an object of class "tbl_teradata".
df <- tbl(con, "customer")
print(df)

# Example 1: Generate Near Dependency Report.
obj <- td_pca_valib(data=df,
                    columns=c("age", "years_with_bank", "nbr_children"),
                    cond.ind.threshold=3,
                    near.dep.report=TRUE,
                    variance.prop.threshold=.3)

# Print the results.
print(obj$result)

# Example 2: Run PCA on two group by columns. The result contains one row
#            for each group by column combination.
obj <- td_pca_valib(data=df,
                    columns=c("age", "years_with_bank", "nbr_children"),
                    group.columns=c("gender", "marital_status"))

# Print the results.
print(obj$result)

# Example 3: Run PCA by taking input from a pre-built matrix. Both the Matrix Build
#            and PCA Analysis are shown. Note that only a subset of matrix columns
#            is used.
mat_obj <- td_matrix_valib(data=df,
                           columns=c("income", "age", "years_with_bank", "nbr_children"),
                           type="esscp")
obj <- td_pca_valib(data=mat_obj$result,
                    columns=c("age", "years_with_bank", "nbr_children"),
                    matrix.input=TRUE)

# Print the results.
print(obj$result)

# Example 4: Run PCA by taking input from a pre-built matrix with group by columns.
#            Both the Matrix Build and  PCA Analysis are shown. Note that only a
#            subset of matrix columns is used.
mat_obj <- td_matrix_valib(data=df,
                           columns=c("income", "age", "years_with_bank", "nbr_children"),
                           group.columns="gender",
                           type="esscp")
obj <- td_pca_valib(data=mat_obj$result,
                    columns=c("age", "years_with_bank", "nbr_children"),
                    matrix.input=TRUE,
                    group.columns="gender")

# Print the results.
print(obj$result)

# Example 5: Run PCA with 'varimax' rotation.
obj <- td_pca_valib(data=df,
                    columns=c("age", "years_with_bank", "nbr_children"),
                    rotation.type="varimax")

# Print the results.
print(obj$result)

# Example 6: Run PCA with Prime Factor reports requested. The "percent.threshold"
#            argument applies to "vars.report" argument.
obj <- td_pca_valib(data=df,
                    columns=c("age", "years_with_bank", "nbr_children"),
                    load.report=TRUE,
                    vars.load.report=TRUE,
                    vars.report=TRUE,
                    percent.threshold=0.9)

# Print the results.
print(obj$result)