Teradata Package for Python Function Reference | 17.10 - Association - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.

Association

Functions
		Association(data, group_column=None, item_column=None, combinations=11, description_data=None, description_identifier=None, description_column=None, group_count=None, hierarchy_data=None, low_level_column=None, high_level_column=None, left_lookup_data=None, left_lookup_column=None, right_lookup_data=None, right_lookup_column=None, min_confidence=None, min_lift=None, min_support=None, min_zscore=None, order_prob=None, process_type=None, reduced_data=None, relaxed_order=None, sequence_column=None, filter=None, no_support_results=True, support_result_prefix='ml__valib_association', gen_sql_only=False) DESCRIPTION: Association Rules provide various measures concerning items residing in groups. The measures, support, confidence, lift and Z Score, help to determine the likelihood that one or more items exist in a group, given that another one or more items exist in the same group. The classic example of this type of study is market basket analysis, in which the groups are shopping carts and the items are the products purchased in the shopping carts. An association rule might indicate the likelihood that a given shopping cart contains oranges, given that it also contains apples. Association rules consist of a left part and a right part. The left part consists of one or more items that are given to reside in a group, and the right part is the consequence that one or more items also reside in the given group. The measures are defined as follows: * Support-Percentage of groups containing the items on the left (left-side support), on the right (right-side support), or on both sides of a rule (rule support). * Confidence-Percentage of groups containing the left-side items that also contain the right-side items. * Lift-A measure of how much the probability is raised that the right-side items occur in a group given that the left-side items occur in the group. * Z Score-A statistical measure of how much the expected and actual values of the number of groups containing all the items in the rule varies. (Zero means expected and actual are the same.) A sequence analysis may be optionally requested, wherein there is a sequence of items defined by a "sequence_column" argument, ordering the items on each side of each rule, with left-side items preceding the rights-side items. An option is provided called "relaxed_order" that can be set to true so that items on the left side and the right side can be in any order provided that all left-side items precede all right-side items. An output teradataml DataFrame is created for each requested rule combination (1-to-1, 2-to-1, and so on.). PARAMETERS: data: Required Argument. Specifies the input data to perform Association analysis. Types: teradataml DataFrame group_column: Required Argument. Specifies the name of the column representing groups in the association rules. Types: str item_column: Required Argument. Specifies the name of the column representing items in the association rules. Types: str combinations: Optional Argument. Specifies the combinations of number of items on left side and number of items on right side of requested association rules. More than one combination can also be requested. For each combination specified, one output DataFrame is generated, i.e., number of outputs DataFrames generated depends on the number of combinations. Corresponding output DataFrame is named as "result_{combination}". For example, combinations = [11, 21] above combinations produces an analysis of 1-to-1 and 2-to-1 rules. This will result in two output DataFrames 'result_11' and 'result_21'. Note: If you add the sizes of the left and right sides of a combination, the sum must be less than or equal to 5. Default Value: 11 Types: int OR list of Integers (int) description_data: Optional Argument. Specifies the teradataml DataFrame containing description data, which is joined with the result. Note: Arguments "description_data", "description_identifier" and "description_column", if used, must be used together. Types: teradataml DataFrame description_identifier: Optional Argument. Specifies the name of the column in the "description_data" that is joined with the result. Note: Arguments "description_data", "description_identifier" and "description_column", if used, must be used together. Types: str description_column: Optional Argument. Specifies the name of the column in the "description_data" that contains descriptive item names. Note: Arguments "description_data", "description_identifier" and "description_column", if used, must be used together. Types: str group_count: Optional Argument. Specifies the count of the number of groups in the input data. By default it is calculated by the function. This is useful when you are processing a reduced input set saved in a previous run, so that calculations can be based on the number of groups in the original input set and not the reduced set. Types: int hierarchy_data: Optional Argument. Specifies the hierarchy data that can be joined with the input data in order to reduce the amount of input data and compute association rules at a different hierarchical level. Use of this argument has an impact on the data saved in the reduced input when the argument "reduced_data" is requested. When this option is utilized, listwise deletion is automatically performed, ignoring rows that contain a null group, item, or sequence column value. Note: Arguments "hierarchy_data", "low_level_column" and "high_level_column", if used, must be used together. Types: teradataml DataFrame low_level_column: Optional Argument. Specifies the lowest level item column in the "hierarchy_data" to be matched with the item column in the input data. Note: Arguments "hierarchy_data", "low_level_column" and "high_level_column", if used, must be used together. Types: str high_level_column: Optional Argument. Specifies the higher-level item column in the "hierarchy_data". Note: Arguments "hierarchy_data", "low_level_column" and "high_level_column", if used, must be used together. Types: str left_lookup_data: Optional Argument. Specifies a left-side lookup data that can be specified to reduce the rules reported to only those with left-side items that appear in the lookup dataset. Note: Arguments "left_lookup_data" and "left_lookup_column", if used, must be used together. Types: teradataml DataFrame left_lookup_column: Optional Argument. Specifies the name of the column to match with left-side items in rules. Note: Arguments "left_lookup_data" and "left_lookup_column", if used, must be used together. Types: str right_lookup_data: Optional Argument. Specifies a right-side lookup data that can be specified to reduce the rules reported to only those with right-side items that appear in the lookup dataset. Note: Arguments "right_lookup_data" and "right_lookup_column", if used, must be used together. Types: teradataml DataFrame right_lookup_column: Optional Argument. Specifies the name of the column to match with right-side items in rules. Note: Arguments "right_lookup_data" and "right_lookup_column", if used, must be used together. Types: str min_confidence: Optional Argument. Specifies the minimum value that the confidence measure of an association rule must have before it is included in a result. The range of valid values is 0 to 1 inclusive. Types: float min_lift: Optional Argument. Specifies the minimum value that the lift measure of an association rule must have before it is included in a result. The range of valid values is 0 to any positive numeric value. Types: float min_support: Optional Argument. Specifies the minimum value that the support measure of an association rule must have before it is included in a result. When this argument is utilized, the size of the input data is reduced, potentially impacting the use of the "reduced_data" argument. Use of this argument also causes listwise deletion to be performed, skipping any input rows that have a null group, item or sequence column value. The range of valid values is 0 to 1 inclusive. Types: float min_zscore: Optional Argument. Specifies the minimum value that the Z Score measure of an association rule must have before it is included in a result. Types: float order_prob: Optional Argument. Specifies the probability of correct ordering for sequential analysis. When sequence analysis is being performed, by default the algorithm to determines ordering probabilities. Value should be non-zero between 0 and 1 (Setting it to 1 effectively ignores this principle in lift and Z Score calculations). Types: float, int process_type: Optional Argument. Specifies the type of processing. Permitted Values: * 'all' - All processing is performed, from building support tables to calculating final affinities. * 'support' - The single item support result DataFrame is built and then processing is halted. This allows user to view the support result and decide what the minimum support value should be, thus reducing the amount of processing performed. The single item support output DataFrame is named as 'support_1_item' and underlying output table is named as 'ml__valib_association_1_ITEM_SUPPORT'. If "support_result_prefix" is specified, it replaces 'ml__valib_association' with the provided value in the support table name. * 'recalculate' - The final affinity tables are calculated based on support tables already present. This requires that the "no_support_results" parameter was set to False in a previous run so that the support tables are available for recalculating the final affinities. Default Value: 'all' Types: str reduced_data: Optional Argument. Specified the reduced input data. If input to the analysis is reduced by using the "min_support", a "hierarchy_data", or a "filter", the resulting reduced input data can be saved for further analysis. Note: 1. This is not affected by the use of a left-side or right-side lookup argument or a "min_confidence", "min_lift", or "min_zscore" arguments. 2. If further analysis is performed on this data, it may be appropriate to use the "group_count" argument. Types: teradataml DataFrame relaxed_order: Optional Argument. Use this option in conjunction with sequence analysis, that is, when a sequence column is specified. Relaxed ordering occurs when the items on the left side of an association rule may occur in any order (via the sequence column), and the same is the case with the right-side items, provided that all left-side items precede all right-side items. Types: bool sequence_column: Optional Argument. Specifies the name of the column providing sequencing of input items if sequence analysis is desired. This might typically be a column of type date or timestamp. By default, sequence analysis is not performed. Types: str filter: Optional Argument. Specifies the clause to filter rows selected for analysis within Association Rules. For example, filter = "cust_id > 0" Note: Single quotes within the parameter value must be doubled, such as in where=channel <> '' ''. (Ordinarily, the expression would be where=channel <> ' '. Instead, the expression ends with quote-quote-blank-quote-quote). Types: str no_support_results: Optional Argument. Specifies whether the intermediate support results are required or not. By default, support results are not presented in the output. Notes: 1. If "no_support_results" is False and "support_result_prefix" is not used, then the generated underlying support tables are overwritten, if they already exist. 2. When set to True, support results generated depends on "combinations" parameter. Default Value: True Types: bool support_result_prefix: Optional Argument. Specifies a string that should be used to as prefix for the underlying table name for the support tables which can be accessed using output DataFrames. Notes: 1. Teradata recommends using this when function is to be executed with "process_type" as 'recalculate'. Make sure to use the same value for both the function calls. 2. If "no_support_results" is False and this is not used, then the generated underlying support tables are overwritten, if they already exist. Default Value: 'ml__valib_association' Types: str gen_sql_only: Optional Argument. Specifies whether to generate only SQL for the function. When set to True, function SQL is generated, not executed, which can be accessed using show_query() method, otherwise SQL is just executed but not returned. Default Value: False Types: bool RETURNS: An instance of Association. Output teradataml DataFrames can be accessed using attribute references, such as AssociationObj.<attribute_name>. Note: Output DataFrames generated can be categorized as follows: 1. Affinity output DataFrames 2. Support output DataFrames Attribute names of these output DataFrames can be found out using two attributes: 1. affinity_outputs: This prints the attribute names of the output DataFrames containing the affinity results. For example, this can be accessed as AssociationObj.affinity_outputs 2. support_outputs: This prints the attribute names of the output DataFrames containing the support results. For example, this can be accessed as AssociationObj.support_outputs Output DataFrames generated by the function depend upon following: 1. When "process_type" is set to 'support': a. No affinity outputs are generated. b. Function generates only two support output DataFrames named as: i. support_1_item ii. group_count 2. When "no_support_results" is set to True and "process_type" is set to values other than 'support': a. Only affinity outputs are generated, based on the number of combinations requested by the user. b. No support outputs are generated. 3. When "no_support_results" is set to False and "process_type" is set to values other than 'support': a. Affinity outputs are generated, based on the number of combinations requested. b. Support outputs are also generated depending on the combination(s) requested. RAISES: TeradataMlException, TypeError, ValueError EXAMPLES: # Notes: # 1. To execute Vantage Analytic Library functions, # a. import "valib" object from teradataml. # b. set 'configure.val_install_location' to the database name where Vantage # analytic library functions are installed. # 2. Datasets used in these examples can be loaded using Vantage Analytic Library # installer. # Import valib object from teradataml to execute this function. from teradataml import valib # Set the 'configure.val_install_location' variable. from teradataml import configure configure.val_install_location = "SYSLIB" # Create required teradataml DataFrame. df = DataFrame("credit_tran") print(df) # Example 1: Perform Association analysis using default values. obj = valib.Association(data=df, group_column="cust_id", item_column="channel") # Print the affinity result. Only affinity result for default combination 11 is produced. print(obj.result_11) # Example 2: Requests a 1-to-1 and a 2-to-1 analysis, while also requesting 0.1 minimum # support. Rows with blank channel column are also filtered. # Note: The blank channel value requires double single quotes, # that is quote-quote-blank-quote-quote. obj = valib.Association(data=df, group_column=["cust_id"], item_column="channel", min_support=0.1, filter="channel <>''''", combinations=[11, 21]) # Executing above call will return two affinity results. Let's check the names of the # output DataFrames. print(obj.affinity_outputs) # Print the results. print(obj.result_11) print(obj.result_21) # Example 3: Request sequence analysis by specifying a "sequence_column" parameter. # Also include optional parameters "min_support" and "filter". obj = valib.Association(data=df, group_column=["cust_id"], item_column="channel", min_support=0.1, filter="channel <>''''", sequence_column="tran_date") # Executing above call will return one affinity result. Let's check the name of # the output DataFrame. obj.affinity_outputs # Print the results. print(obj.result_11) # Example 4: Let's generate the support results first and then calculate the final # affinity results. # To generate the support results, we set "no_support_results" to False and use # 'test_prefix' as a prefix for the table names for the support tables generated. obj = valib.Association(data=df, group_column=["cust_id"], item_column="channel", no_support_results=False, support_result_prefix="test_prefix") # Print the results. print(obj) # Let's look at the attribute names of the support results generated. print(obj.support_outputs) # Print the individual support results. print(obj.support_result_01) print(obj.support_result_11) # Let's look at the attribute names of the affinity results generated. print(obj.affinity_outputs) # Print the affinity results. print(obj.result_11) # Re-run the Association function with "process_type" set to 'recalculate' for # recalculation of affinity results. # To recalculate we shall use the support results generated by the previous function call. obj = valib.Association(data=df, group_column=["cust_id"], item_column="channel", process_type="recalculate", support_result_prefix="test_prefix") # Print the affinity results. print(obj.result_11) # Example 5: Generate only SQL for the function, but do not execute the same. obj = valib.Association(data=df, group_column=["cust_id"], item_column="channel", no_support_results=False, support_result_prefix="test_prefix", gen_sql_only=True) # Print the generated SQL. print(obj.show_query("sql")) # Print both generated SQL and stored procedure call. print(obj.show_query("both")) # Print the stored procedure call. print(obj.show_query()) print(obj.show_query("sp"))