| |
- PCA(data, columns, exclude_columns=None, cond_ind_threshold=30, min_eigen=1.0, load_report=False, vars_load_report=False, vars_report=False, gamma=None, group_columns=None, matrix_input=False, matrix_type='correlation', near_dep_report=False, rotation_type=None, load_threshold=None, percent_threshold=None, variance_prop_threshold=0.5)
- DESCRIPTION:
Factor Analysis is one of the most fundamental types of statistical analysis, and
Principal Components Analysis (PCA), is arguably the most common variety of Factor
Analysis. In PCA Analysis, a set of variables (denoted by columns) is reduced to a
smaller number of factors that account for most of the variance in the variables.
This can be useful in reducing the number of variables by converting them to factors,
or in gaining insight into the nature of the variables when they are used for further
data analysis.
Some of the key features of PCA Analysis are outlined below.
1. One or more group by columns can optionally be specified so that an input matrix
is built for each combination of group by column values, and subsequently a
separate PCA Analysis model is built for each matrix.
2. A Near Dependency Report is available to identify two or more columns that may be
collinear. This report can be requested by setting the argument "near_dep_report"
to 'True' and if desired, the arguments "cond_ind_threshold" and
"variance_prop_threshold".
3. Both orthogonal and oblique factor rotations are available. Refer to the
"rotation_type" parameter.
4. There are three Prime Factor reports available. Refer to the "load_report",
"vars_report", and "vars_load_report" arguments.
PARAMETERS:
data:
Required Argument.
Specifies the input data containing the columns to perform PCA analysis.
Types: teradataml DataFrame
columns:
Required Argument.
Specifies the name(s) of the column(s) representing the variables used in
building a PCA analysis model. Occasionally, it can also accept permitted
strings to specify all columns or all numeric columns.
Permitted Values:
* Name(s) of the column(s) in "data".
* Pre-defined strings:
* 'all' - all columns
* 'allnumeric' - all numeric columns
Types: str OR list of Strings (str)
exclude_columns:
Optional Argument.
Specifies the name(s) of the column(s) to exclude from the PCA analysis.
If 'all' or 'allnumeric' is used in the "columns" argument, this argument
can be used to exclude specific columns from the analysis.
Types: str OR list of Strings (str)
cond_ind_threshold:
Optional Argument. Required when the argument "near_dep_report" is set to 'True'.
Specifies the condition index threshold parameter to generate Near Dependency Report.
Default Value: 30
Types: float
min_eigen:
Optional Argument.
Specifies the minimum eigen value to include factors for.
Default Value: 1.0
Types: float
load_report:
Optional Argument.
Specifies whether to generate Prime Factor Loadings Report in which rows are
variables and columns are factors, matching each variable with the factor that
has the biggest absolute loading value with.
When set to 'True', Prime Factor Loadings Report is generated and added in the
XML result string.
Default Value: False
Types: bool
vars_load_report:
Optional Argument.
Specifies whether to generate Prime Factor Variables with Loadings Report,
equivalent to Prime Factor Variables Report with the addition of loading values
that determined the relationship between factors and variables. The absolute
sizes of the loading values point out the relationship strength and the sign
its direction, i.e., either a positive or negative correlation.
When set to 'True', Prime Factor Variables with Loadings Report is generated
and added in the XML result string.
Default Value: False
Types: bool
vars_report:
Optional Argument.
Specifies whether to generate Prime Factor Variables Report in which rows are
variables and columns are factors, matching variables with their prime factors,
and if a threshold is used, possibly other than prime factors. (Either a
threshold percent is specified with the "percent_threshold" argument or a
threshold loading is specified with the "load_threshold" argument.)
When set to 'True', Prime Factor Variables Report is generated and added in
the XML result string.
Default Value: False
Types: bool
gamma:
Optional Argument. Required when the argument "rotation_type" is set to
'orthomax' or 'orthomin'.
Specifies the gamma value to be set when 'orthomax' or 'orthomin' is used in
"rotation_type" argument.
Note:
This argument is ignored for values of "rotation_type" other than
'orthomax' and 'orthomin'.
Types: float
group_columns:
Optional Argument.
Specifies the name(s) of the input column(s) dividing the input DataFrame
"data" into partitions, one for each combination of values in the group by
columns. For each partition or combination of values, a separate factor model
is built. The default case is no group by columns.
Types: str OR list of Strings (str)
matrix_input:
Optional Argument.
Specifies whether the input DataFrame is an extended
sum-of-squares-and-cross-products (ESSCP) matrix built by the Matrix() VALIB
function. Use of this feature saves internally building a matrix each time
this function is performed, providing a significant performance improvement.
When set to 'True', the columns specified with the "columns" argument may be
a subset of the columns in the matrix and may be specified in any order. The
columns must, however, all be present in the matrix. Further, if group by
columns are specified in the Matrix() call, these same group by columns must
be specified in this function.
Note:
If the input DataFrame "data" represents a saved matrix, set this argument
to 'True' to get predictable results.
Default Value: False
Types: bool
matrix_type:
Optional Argument.
Specifies type of matrix for processing affecting measure and score scaling.
Permitted Values: 'correlation', 'covariance'
Default Value: 'correlation'
Types: str
near_dep_report:
Optional Argument.
Specifies whether to produce an XML report showing columns that are collinear
as part of the output or not. The report is included in the XML output only
if collinearity is detected.
Two threshold arguments are available for this report, "cond_ind_threshold"
and "variance_prop_threshold".
Default Value: False
Types: bool
rotation_type:
Optional Argument.
Specifies the rotation type among various schemes for rotating factors for
possibly better results. Both orthogonal and oblique rotations are provided.
Gamma value in the rotation equation assumes a different value for each
rotation type, with f representing the number of factors and v the number of
variables. Refer below table:
Table: Gamma values of different rotation types
--------------------------------------------------------------------------------------
| rotation_type | gamma value | Orthogonal/Oblique | Notes |
--------------------------------------------------------------------------------------
| equamax | f/2 | orthogonal | |
| prthomax | Set by user | orthogonal | |
| parsimax | v(f-1)/v+f+2 | orthogonal | |
| quartimax | 0.0 | orthogonal | |
| varimax | 1.0 | orthogonal | |
| biquartimin | 0.5 | oblique | least oblique rotation |
| covarimin | 2.0 | oblique | |
| orthomin | Set by user | oblique | |
| quartimin | 0.0 | oblique | most oblique rotation |
--------------------------------------------------------------------------------------
Types: str
load_threshold:
Optional Argument.
Specifies a threshold factor loading value. If this argument is specified,
a factor that is not a prime factor may be associated with a variable. This
argument is used when the argument "vars_report" is set to 'True'; ignored
otherwise.
Notes:
1. This argument and the argument "percent_threshold" cannot both be specified.
2. This argument is used when the argument "vars_report" is set to 'True';
ignored otherwise.
Types: float
percent_threshold:
Optional Argument.
Specifies a threshold percent. If this argument is specified, a factor that is
not a prime factor may be associated with a variable.
Notes:
1. This argument and the argument "load_threshold" cannot both be specified.
2. This argument is used when the argument "vars_report" is set to 'True';
ignored otherwise.
Types: float
variance_prop_threshold:
Optional Argument. Required when the argument "near_dep_report" is set to 'True'.
Specifies the variance proportion threshold parameter to generate Near
Dependency Report.
Default Value: 0.5
Types: float
RETURNS:
An instance of PCA.
Output teradataml DataFrames can be accessed using attribute references, such as
PCAObj.<attribute_name>.
Output teradataml DataFrame attribute name is: result
RAISES:
TeradataMlException, TypeError, ValueError
EXAMPLES:
# Notes:
# 1. To execute Vantage Analytic Library functions,
# a. import "valib" object from teradataml.
# b. set 'configure.val_install_location' to the database name where Vantage
# analytic library functions are installed.
# 2. Datasets used in these examples can be loaded using Vantage Analytic Library
# installer.
# Import valib object from teradataml to execute this function.
from teradataml import valib
# Set the 'configure.val_install_location' variable.
from teradataml import configure
configure.val_install_location = "SYSLIB"
# Create the required teradataml DataFrame.
df = DataFrame("customer")
print(df)
# Example 1: Generate Near Dependency Report.
obj = valib.PCA(data=df,
columns=["age", "years_with_bank", "nbr_children"],
cond_ind_threshold=3,
near_dep_report=True,
variance_prop_threshold=.3)
# Print the results.
print(obj.result)
# Example 2: Run PCA on two group by columns. The result DataFrame contains one row
# for each group by column combination.
obj = valib.PCA(data=df,
columns=["age", "years_with_bank", "nbr_children"],
group_columns=["gender", "marital_status"])
# Print the results.
print(obj.result)
# Example 3: Run PCA by taking input from a pre-built matrix. Both the Matrix Build
# and PCA Analysis are shown. Note that only a subset of matrix columns
# is used.
mat_obj = valib.Matrix(data=df,
columns=["income", "age", "years_with_bank", "nbr_children"],
type="esscp")
obj = valib.PCA(data=mat_obj.result,
columns=["age", "years_with_bank", "nbr_children"],
matrix_input=True)
# Print the results.
print(obj.result)
# Example 4: Run PCA by taking input from a pre-built matrix with group by columns.
# Both the Matrix Build and PCA Analysis are shown. Note that only a
# subset of matrix columns is used.
mat_obj = valib.Matrix(data=df,
columns=["income", "age", "years_with_bank", "nbr_children"],
group_columns="gender",
type="esscp")
obj = valib.PCA(data=mat_obj.result,
columns=["age", "years_with_bank", "nbr_children"],
matrix_input=True,
group_columns="gender")
# Print the results.
print(obj.result)
# Example 5: Run PCA with 'varimax' rotation.
obj = valib.PCA(data=df,
columns=["age", "years_with_bank", "nbr_children"],
rotation_type="varimax")
# Print the results.
print(obj.result)
# Example 6: Run PCA with Prime Factor reports requested. The "percent_threshold"
# argument applies to "vars_report" argument.
obj = valib.PCA(data=df,
columns=["age", "years_with_bank", "nbr_children"],
load_report=True,
vars_load_report=True,
vars_report=True,
percent_threshold=0.9)
# Print the results.
print(obj.result)
|