Teradata Package for Python Function Reference - PCA - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.

PCA

Functions
		PCA(data, columns, exclude_columns=None, cond_ind_threshold=30, min_eigen=1.0, load_report=False, vars_load_report=False, vars_report=False, gamma=None, group_columns=None, matrix_input=False, matrix_type='correlation', near_dep_report=False, rotation_type=None, load_threshold=None, percent_threshold=None, variance_prop_threshold=0.5) DESCRIPTION: Factor Analysis is one of the most fundamental types of statistical analysis, and Principal Components Analysis (PCA), is arguably the most common variety of Factor Analysis. In PCA Analysis, a set of variables (denoted by columns) is reduced to a smaller number of factors that account for most of the variance in the variables. This can be useful in reducing the number of variables by converting them to factors, or in gaining insight into the nature of the variables when they are used for further data analysis. Some of the key features of PCA Analysis are outlined below. 1. One or more group by columns can optionally be specified so that an input matrix is built for each combination of group by column values, and subsequently a separate PCA Analysis model is built for each matrix. 2. A Near Dependency Report is available to identify two or more columns that may be collinear. This report can be requested by setting the argument "near_dep_report" to 'True' and if desired, the arguments "cond_ind_threshold" and "variance_prop_threshold". 3. Both orthogonal and oblique factor rotations are available. Refer to the "rotation_type" parameter. 4. There are three Prime Factor reports available. Refer to the "load_report", "vars_report", and "vars_load_report" arguments. PARAMETERS: data: Required Argument. Specifies the input data containing the columns to perform PCA analysis. Types: teradataml DataFrame columns: Required Argument. Specifies the name(s) of the column(s) representing the variables used in building a PCA analysis model. Occasionally, it can also accept permitted strings to specify all columns or all numeric columns. Permitted Values: * Name(s) of the column(s) in "data". * Pre-defined strings: * 'all' - all columns * 'allnumeric' - all numeric columns Types: str OR list of Strings (str) exclude_columns: Optional Argument. Specifies the name(s) of the column(s) to exclude from the PCA analysis. If 'all' or 'allnumeric' is used in the "columns" argument, this argument can be used to exclude specific columns from the analysis. Types: str OR list of Strings (str) cond_ind_threshold: Optional Argument. Required when the argument "near_dep_report" is set to 'True'. Specifies the condition index threshold parameter to generate Near Dependency Report. Default Value: 30 Types: float min_eigen: Optional Argument. Specifies the minimum eigen value to include factors for. Default Value: 1.0 Types: float load_report: Optional Argument. Specifies whether to generate Prime Factor Loadings Report in which rows are variables and columns are factors, matching each variable with the factor that has the biggest absolute loading value with. When set to 'True', Prime Factor Loadings Report is generated and added in the XML result string. Default Value: False Types: bool vars_load_report: Optional Argument. Specifies whether to generate Prime Factor Variables with Loadings Report, equivalent to Prime Factor Variables Report with the addition of loading values that determined the relationship between factors and variables. The absolute sizes of the loading values point out the relationship strength and the sign its direction, i.e., either a positive or negative correlation. When set to 'True', Prime Factor Variables with Loadings Report is generated and added in the XML result string. Default Value: False Types: bool vars_report: Optional Argument. Specifies whether to generate Prime Factor Variables Report in which rows are variables and columns are factors, matching variables with their prime factors, and if a threshold is used, possibly other than prime factors. (Either a threshold percent is specified with the "percent_threshold" argument or a threshold loading is specified with the "load_threshold" argument.) When set to 'True', Prime Factor Variables Report is generated and added in the XML result string. Default Value: False Types: bool gamma: Optional Argument. Required when the argument "rotation_type" is set to 'orthomax' or 'orthomin'. Specifies the gamma value to be set when 'orthomax' or 'orthomin' is used in "rotation_type" argument. Note: This argument is ignored for values of "rotation_type" other than 'orthomax' and 'orthomin'. Types: float group_columns: Optional Argument. Specifies the name(s) of the input column(s) dividing the input DataFrame "data" into partitions, one for each combination of values in the group by columns. For each partition or combination of values, a separate factor model is built. The default case is no group by columns. Types: str OR list of Strings (str) matrix_input: Optional Argument. Specifies whether the input DataFrame is an extended sum-of-squares-and-cross-products (ESSCP) matrix built by the Matrix() VALIB function. Use of this feature saves internally building a matrix each time this function is performed, providing a significant performance improvement. When set to 'True', the columns specified with the "columns" argument may be a subset of the columns in the matrix and may be specified in any order. The columns must, however, all be present in the matrix. Further, if group by columns are specified in the Matrix() call, these same group by columns must be specified in this function. Note: If the input DataFrame "data" represents a saved matrix, set this argument to 'True' to get predictable results. Default Value: False Types: bool matrix_type: Optional Argument. Specifies type of matrix for processing affecting measure and score scaling. Permitted Values: 'correlation', 'covariance' Default Value: 'correlation' Types: str near_dep_report: Optional Argument. Specifies whether to produce an XML report showing columns that are collinear as part of the output or not. The report is included in the XML output only if collinearity is detected. Two threshold arguments are available for this report, "cond_ind_threshold" and "variance_prop_threshold". Default Value: False Types: bool rotation_type: Optional Argument. Specifies the rotation type among various schemes for rotating factors for possibly better results. Both orthogonal and oblique rotations are provided. Gamma value in the rotation equation assumes a different value for each rotation type, with f representing the number of factors and v the number of variables. Refer below table: Table: Gamma values of different rotation types -------------------------------------------------------------------------------------- \| rotation_type \| gamma value \| Orthogonal/Oblique \| Notes \| -------------------------------------------------------------------------------------- \| equamax \| f/2 \| orthogonal \| \| \| prthomax \| Set by user \| orthogonal \| \| \| parsimax \| v(f-1)/v+f+2 \| orthogonal \| \| \| quartimax \| 0.0 \| orthogonal \| \| \| varimax \| 1.0 \| orthogonal \| \| \| biquartimin \| 0.5 \| oblique \| least oblique rotation \| \| covarimin \| 2.0 \| oblique \| \| \| orthomin \| Set by user \| oblique \| \| \| quartimin \| 0.0 \| oblique \| most oblique rotation \| -------------------------------------------------------------------------------------- Types: str load_threshold: Optional Argument. Specifies a threshold factor loading value. If this argument is specified, a factor that is not a prime factor may be associated with a variable. This argument is used when the argument "vars_report" is set to 'True'; ignored otherwise. Notes: 1. This argument and the argument "percent_threshold" cannot both be specified. 2. This argument is used when the argument "vars_report" is set to 'True'; ignored otherwise. Types: float percent_threshold: Optional Argument. Specifies a threshold percent. If this argument is specified, a factor that is not a prime factor may be associated with a variable. Notes: 1. This argument and the argument "load_threshold" cannot both be specified. 2. This argument is used when the argument "vars_report" is set to 'True'; ignored otherwise. Types: float variance_prop_threshold: Optional Argument. Required when the argument "near_dep_report" is set to 'True'. Specifies the variance proportion threshold parameter to generate Near Dependency Report. Default Value: 0.5 Types: float RETURNS: An instance of PCA. Output teradataml DataFrames can be accessed using attribute references, such as PCAObj.<attribute_name>. Output teradataml DataFrame attribute name is: result RAISES: TeradataMlException, TypeError, ValueError EXAMPLES: # Notes: # 1. To execute Vantage Analytic Library functions, # a. import "valib" object from teradataml. # b. set 'configure.val_install_location' to the database name where Vantage # analytic library functions are installed. # 2. Datasets used in these examples can be loaded using Vantage Analytic Library # installer. # Import valib object from teradataml to execute this function. from teradataml import valib # Set the 'configure.val_install_location' variable. from teradataml import configure configure.val_install_location = "SYSLIB" # Create the required teradataml DataFrame. df = DataFrame("customer") print(df) # Example 1: Generate Near Dependency Report. obj = valib.PCA(data=df, columns=["age", "years_with_bank", "nbr_children"], cond_ind_threshold=3, near_dep_report=True, variance_prop_threshold=.3) # Print the results. print(obj.result) # Example 2: Run PCA on two group by columns. The result DataFrame contains one row # for each group by column combination. obj = valib.PCA(data=df, columns=["age", "years_with_bank", "nbr_children"], group_columns=["gender", "marital_status"]) # Print the results. print(obj.result) # Example 3: Run PCA by taking input from a pre-built matrix. Both the Matrix Build # and PCA Analysis are shown. Note that only a subset of matrix columns # is used. mat_obj = valib.Matrix(data=df, columns=["income", "age", "years_with_bank", "nbr_children"], type="esscp") obj = valib.PCA(data=mat_obj.result, columns=["age", "years_with_bank", "nbr_children"], matrix_input=True) # Print the results. print(obj.result) # Example 4: Run PCA by taking input from a pre-built matrix with group by columns. # Both the Matrix Build and PCA Analysis are shown. Note that only a # subset of matrix columns is used. mat_obj = valib.Matrix(data=df, columns=["income", "age", "years_with_bank", "nbr_children"], group_columns="gender", type="esscp") obj = valib.PCA(data=mat_obj.result, columns=["age", "years_with_bank", "nbr_children"], matrix_input=True, group_columns="gender") # Print the results. print(obj.result) # Example 5: Run PCA with 'varimax' rotation. obj = valib.PCA(data=df, columns=["age", "years_with_bank", "nbr_children"], rotation_type="varimax") # Print the results. print(obj.result) # Example 6: Run PCA with Prime Factor reports requested. The "percent_threshold" # argument applies to "vars_report" argument. obj = valib.PCA(data=df, columns=["age", "years_with_bank", "nbr_children"], load_report=True, vars_load_report=True, vars_report=True, percent_threshold=0.9) # Print the results. print(obj.result)