Description
Factor Analysis is one of the most fundamental types of statistical analysis, and
Principal Components Analysis (PCA), is arguably the most common variety of Factor
Analysis. In PCA Analysis, a set of variables (denoted by columns) is reduced to a
smaller number of factors that account for most of the variance in the variables.
This can be useful in reducing the number of variables by converting them to factors,
or in gaining insight into the nature of the variables when they are used for further
data analysis.
Some of the key features of PCA Analysis are outlined below.
One or more group by columns can optionally be specified so that an input matrix is built for each combination of group by column values, and subsequently a separate PCA Analysis model is built for each matrix.
A Near Dependency Report is available to identify two or more columns that may be collinear. This report can be requested by setting the argument "near.dep.report" to TRUE and if desired, the arguments "cond.ind.threshold" and "variance.prop.threshold".
Both orthogonal and oblique factor rotations are available. Refer to the "rotation.type" parameter.
There are three Prime Factor reports available. Refer to the "load.report", "vars.report", and "vars.load.report" arguments.
Usage
td_pca_valib(data, columns, ...)
Arguments
data |
Required Argument. |
columns |
Required Argument.
Types: character OR vector of Strings (character) |
... |
Specifies other arguments supported by the function as described in the 'Other Arguments' section. |
Value
Function returns an object of class "td_pca_valib"
which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator
using name: result.
Other Arguments
exclude.columns
Optional Argument.
Specifies the name(s) of the column(s) to exclude from the
PCA analysis.
If 'all' or 'allnumeric' is used in the "columns" argument,
this argument can be used to exclude specific columns from the
analysis.
Types: character OR vector of Strings (character)
cond.ind.threshold
Optional Argument.
Required when the argument "near.dep.report" is set to TRUE.
Specifies the condition index threshold parameter to generate
Near Dependency Report.
Default Value: 30
Types: numeric
min.eigen
Optional Argument.
Specifies the minimum eigen value to include factors for.
Default Value: 1.0
Types: numeric
load.report
Optional Argument.
Specifies whether to generate Prime Factor Loadings Report in which
rows are variables and columns are factors, matching each variable with
the factor that has the biggest absolute loading value with.
When set to TRUE, Prime Factor Loadings Report is generated and added
in the XML result string.
Default Value: FALSE
Types: logical
vars.load.report
Optional Argument.
Specifies whether to generate Prime Factor Variables with Loadings
Report, equivalent to Prime Factor Variables Report with the
addition of loading values that determined the relationship
between factors and variables. The absolute sizes of the loading
values point out the relationship strength and the sign its
direction, i.e., either a positive or negative correlation.
When set to TRUE, Prime Factor Variables with Loadings Report is
generated and added in the XML result string.
Default Value: FALSE
Types: logical
vars.report
Optional Argument.
Specifies whether to generate Prime Factor Variables Report in which
rows are variables and columns are factors, matching variables with
their prime factors, and if a threshold is used, possibly other than
prime factors. (Either a threshold percent is specified with the
"percent.threshold" argument or a threshold loading is specified with
the "load.threshold" argument.)
When set to TRUE, Prime Factor Variables Report is generated and added
in the XML result string.
Default Value: FALSE
Types: logical
gamma
Optional Argument.
Required when the argument "rotation.type" is set to 'orthomax' or
'orthomin'.
Specifies the gamma value to be set when 'orthomax' or 'orthomin' is used in
"rotation.type" argument.
Note:
This argument is ignored for values of "rotation.type" other than 'orthomax' and 'orthomin'.
Types: numeric
group.columns
Optional Argument.
Specifies the name(s) of the input column(s) dividing the input
"data" into partitions, one for each combination of values in the
group by columns. For each partition or combination of values, a
separate factor model is built. The default case is no group by
columns.
Types: character OR vector of Strings (character)
matrix.input
Optional Argument.
Specifies whether the input tbl_teradata is an extended
sum-of-squares-and-cross-products (ESSCP) matrix built by the
td_matrix_valib()
. Use of this feature saves internally
building a matrix each time this function is performed, providing
a significant performance improvement.
When set to TRUE, the columns specified with the "columns" argument
may be a subset of the columns in the matrix and may be specified
in any order. The columns must, however, all be present in the
matrix. Further, if group by columns are specified in the
td_matrix_valib()
call, these same group by columns must
be specified in this function.
Note:
If the input tbl_teradata "data" represents a saved matrix, set this argument to TRUE to get predictable results.
Default Value: FALSE
Types: logical
matrix.type
Optional Argument.
Specifies type of matrix for processing affecting measure and score
scaling.
Permitted Values: 'correlation', 'covariance'
Default Value: 'correlation'
Types: character
near.dep.report
Optional Argument.
Specifies whether to produce an XML report showing columns that
are collinear as part of the output or not. The report is included
in the XML output only if collinearity is detected. Two threshold
arguments are available for this report, "cond.ind.threshold" and
"variance.prop.threshold".
Default Value: FALSE
Types: logical
rotation.type
Optional Argument.
Specifies the rotation type among various schemes for rotating
factors for possibly better results. Both orthogonal and oblique
rotations are provided. Gamma value in the rotation equation assumes
a different value for each rotation type, with f representing the
number of factors and v the number of variables. Refer below table:
--------------------- | ----------------- | ------------------ ------------- | ----------------- |
rotation.type | gamma value | Orthogonal/Oblique | Notes |
--------------------- | ----------------- | ------------------ ------------- | ----------------- |
equamax | f/2 | orthogonal | - |
prthomax | Set by user | orthogonal | - |
parsimax | v(f-1)/v+f+2 | orthogonal | - |
quartimax | 0.0 | orthogonal | - |
varimax | 1.0 | orthogonal | - |
biquartimin | 0.5 | oblique | least oblique rotation |
covarimin | 2.0 | oblique | - |
orthomin | Set by user | oblique | - |
quartimin | 0.0 | oblique | most oblique rotation |
Types: character
load.threshold
Optional Argument.
Specifies a threshold factor loading value. If this argument is
specified, a factor that is not a prime factor may be associated
with a variable. This argument is used when the argument
"vars.report" is set to TRUE; ignored otherwise.
Notes:
This argument and the argument "percent.threshold" cannot both be specified.
This argument is used when the argument "vars.report" is set to TRUE; ignored otherwise.
Types: numeric
percent.threshold
Optional Argument.
Specifies a threshold percent. If this argument is specified, a
factor that is not a prime factor may be associated with a
variable.
Notes:
This argument and the argument "load.threshold" cannot both be specified.
This argument is used when the argument "vars.report" is set to TRUE; ignored otherwise.
Types: numeric
variance.prop.threshold
Optional Argument.
Required when the argument "near.dep.report" is set to
TRUE.
Specifies the variance proportion threshold parameter to
generate Near Dependency Report.
Default Value: 0.5
Types: numeric
Examples
# Notes:
# 1. To execute Vantage Analytic Library functions, set option 'val.install.location' to
# the database name where Vantage analytic library functions are installed.
# 2. Datasets used in these examples can be loaded using Vantage Analytic Library installer.
# Set the option 'val.install.location'.
options(val.install.location = "SYSLIB")
# Get remote data source connection.
con <- td_get_context()$connection
# Create an object of class "tbl_teradata".
df <- tbl(con, "customer")
print(df)
# Example 1: Generate Near Dependency Report.
obj <- td_pca_valib(data=df,
columns=c("age", "years_with_bank", "nbr_children"),
cond.ind.threshold=3,
near.dep.report=TRUE,
variance.prop.threshold=.3)
# Print the results.
print(obj$result)
# Example 2: Run PCA on two group by columns. The result contains one row
# for each group by column combination.
obj <- td_pca_valib(data=df,
columns=c("age", "years_with_bank", "nbr_children"),
group.columns=c("gender", "marital_status"))
# Print the results.
print(obj$result)
# Example 3: Run PCA by taking input from a pre-built matrix. Both the Matrix Build
# and PCA Analysis are shown. Note that only a subset of matrix columns
# is used.
mat_obj <- td_matrix_valib(data=df,
columns=c("income", "age", "years_with_bank", "nbr_children"),
type="esscp")
obj <- td_pca_valib(data=mat_obj$result,
columns=c("age", "years_with_bank", "nbr_children"),
matrix.input=TRUE)
# Print the results.
print(obj$result)
# Example 4: Run PCA by taking input from a pre-built matrix with group by columns.
# Both the Matrix Build and PCA Analysis are shown. Note that only a
# subset of matrix columns is used.
mat_obj <- td_matrix_valib(data=df,
columns=c("income", "age", "years_with_bank", "nbr_children"),
group.columns="gender",
type="esscp")
obj <- td_pca_valib(data=mat_obj$result,
columns=c("age", "years_with_bank", "nbr_children"),
matrix.input=TRUE,
group.columns="gender")
# Print the results.
print(obj$result)
# Example 5: Run PCA with 'varimax' rotation.
obj <- td_pca_valib(data=df,
columns=c("age", "years_with_bank", "nbr_children"),
rotation.type="varimax")
# Print the results.
print(obj$result)
# Example 6: Run PCA with Prime Factor reports requested. The "percent.threshold"
# argument applies to "vars.report" argument.
obj <- td_pca_valib(data=df,
columns=c("age", "years_with_bank", "nbr_children"),
load.report=TRUE,
vars.load.report=TRUE,
vars.report=TRUE,
percent.threshold=0.9)
# Print the results.
print(obj$result)