| |
- Association(data, group_column=None, item_column=None, combinations=11, description_data=None, description_identifier=None, description_column=None, group_count=None, hierarchy_data=None, low_level_column=None, high_level_column=None, left_lookup_data=None, left_lookup_column=None, right_lookup_data=None, right_lookup_column=None, min_confidence=None, min_lift=None, min_support=None, min_zscore=None, order_prob=None, process_type=None, reduced_data=None, relaxed_order=None, sequence_column=None, filter=None, no_support_results=True, support_result_prefix='ml__valib_association', gen_sql_only=False, charset=None)
- DESCRIPTION:
Association Rules provide various measures concerning items residing in groups. The
measures, support, confidence, lift and Z Score, help to determine the likelihood that
one or more items exist in a group, given that another one or more items exist in the
same group. The classic example of this type of study is market basket analysis, in
which the groups are shopping carts and the items are the products purchased in the
shopping carts. An association rule might indicate the likelihood that a given shopping
cart contains oranges, given that it also contains apples.
Association rules consist of a left part and a right part. The left part consists of
one or more items that are given to reside in a group, and the right part is the
consequence that one or more items also reside in the given group. The measures are
defined as follows:
* Support-Percentage of groups containing the items on the left (left-side support),
on the right (right-side support), or on both sides of a rule (rule support).
* Confidence-Percentage of groups containing the left-side items that also contain
the right-side items.
* Lift-A measure of how much the probability is raised that the right-side items
occur in a group given that the left-side items occur in the group.
* Z Score-A statistical measure of how much the expected and actual values of the
number of groups containing all the items in the rule varies. (Zero means expected
and actual are the same.)
A sequence analysis may be optionally requested, wherein there is a sequence of items
defined by a "sequence_column" argument, ordering the items on each side of each rule,
with left-side items preceding the rights-side items. An option is provided called
"relaxed_order" that can be set to true so that items on the left side and the right
side can be in any order provided that all left-side items precede all right-side items.
An output teradataml DataFrame is created for each requested rule combination (1-to-1,
2-to-1, and so on.).
PARAMETERS:
data:
Required Argument.
Specifies the input data to perform Association analysis.
Types: teradataml DataFrame
group_column:
Required Argument.
Specifies the name of the column representing groups in the association rules.
Types: str
item_column:
Required Argument.
Specifies the name of the column representing items in the association rules.
Types: str
combinations:
Optional Argument.
Specifies the combinations of number of items on left side and number of items on
right side of requested association rules. More than one combination can also be
requested. For each combination specified, one output DataFrame is generated, i.e.,
number of outputs DataFrames generated depends on the number of combinations.
Corresponding output DataFrame is named as "result_{combination}".
For example,
combinations = [11, 21]
above combinations produces an analysis of 1-to-1 and 2-to-1 rules. This will
result in two output DataFrames 'result_11' and 'result_21'.
Note:
If you add the sizes of the left and right sides of a combination, the sum must
be less than or equal to 5.
Default Value: 11
Types: int OR list of Integers (int)
description_data:
Optional Argument.
Specifies the teradataml DataFrame containing description data, which is joined with
the result.
Note:
Arguments "description_data", "description_identifier" and "description_column",
if used, must be used together.
Types: teradataml DataFrame
description_identifier:
Optional Argument.
Specifies the name of the column in the "description_data" that is joined with the
result.
Note:
Arguments "description_data", "description_identifier" and "description_column",
if used, must be used together.
Types: str
description_column:
Optional Argument.
Specifies the name of the column in the "description_data" that contains descriptive
item names.
Note:
Arguments "description_data", "description_identifier" and "description_column",
if used, must be used together.
Types: str
group_count:
Optional Argument.
Specifies the count of the number of groups in the input data. By default it is
calculated by the function. This is useful when you are processing a reduced input
set saved in a previous run, so that calculations can be based on the number of
groups in the original input set and not the reduced set.
Types: int
hierarchy_data:
Optional Argument.
Specifies the hierarchy data that can be joined with the input data in order to
reduce the amount of input data and compute association rules at a different
hierarchical level. Use of this argument has an impact on the data saved in the
reduced input when the argument "reduced_data" is requested. When this option is
utilized, listwise deletion is automatically performed, ignoring rows that contain
a null group, item, or sequence column value.
Note:
Arguments "hierarchy_data", "low_level_column" and "high_level_column",
if used, must be used together.
Types: teradataml DataFrame
low_level_column:
Optional Argument.
Specifies the lowest level item column in the "hierarchy_data" to be matched with
the item column in the input data.
Note:
Arguments "hierarchy_data", "low_level_column" and "high_level_column",
if used, must be used together.
Types: str
high_level_column:
Optional Argument.
Specifies the higher-level item column in the "hierarchy_data".
Note:
Arguments "hierarchy_data", "low_level_column" and "high_level_column",
if used, must be used together.
Types: str
left_lookup_data:
Optional Argument.
Specifies a left-side lookup data that can be specified to reduce the rules
reported to only those with left-side items that appear in the lookup dataset.
Note:
Arguments "left_lookup_data" and "left_lookup_column", if used, must be used
together.
Types: teradataml DataFrame
left_lookup_column:
Optional Argument.
Specifies the name of the column to match with left-side items in rules.
Note:
Arguments "left_lookup_data" and "left_lookup_column", if used, must be used
together.
Types: str
right_lookup_data:
Optional Argument.
Specifies a right-side lookup data that can be specified to reduce the rules
reported to only those with right-side items that appear in the lookup dataset.
Note:
Arguments "right_lookup_data" and "right_lookup_column", if used, must be used
together.
Types: teradataml DataFrame
right_lookup_column:
Optional Argument.
Specifies the name of the column to match with right-side items in rules.
Note:
Arguments "right_lookup_data" and "right_lookup_column", if used, must be used
together.
Types: str
min_confidence:
Optional Argument.
Specifies the minimum value that the confidence measure of an association rule must
have before it is included in a result. The range of valid values is 0 to 1 inclusive.
Types: float
min_lift:
Optional Argument.
Specifies the minimum value that the lift measure of an association rule must have
before it is included in a result. The range of valid values is 0 to any positive
numeric value.
Types: float
min_support:
Optional Argument.
Specifies the minimum value that the support measure of an association rule must
have before it is included in a result. When this argument is utilized, the size
of the input data is reduced, potentially impacting the use of the "reduced_data"
argument. Use of this argument also causes listwise deletion to be performed,
skipping any input rows that have a null group, item or sequence column value.
The range of valid values is 0 to 1 inclusive.
Types: float
min_zscore:
Optional Argument.
Specifies the minimum value that the Z Score measure of an association rule must
have before it is included in a result.
Types: float
order_prob:
Optional Argument.
Specifies the probability of correct ordering for sequential analysis. When sequence
analysis is being performed, by default the algorithm to determines ordering
probabilities. Value should be non-zero between 0 and 1 (Setting it to 1 effectively
ignores this principle in lift and Z Score calculations).
Types: float, int
process_type:
Optional Argument.
Specifies the type of processing.
Permitted Values:
* 'all' - All processing is performed, from building support tables to
calculating final affinities.
* 'support' - The single item support result DataFrame is built and then
processing is halted. This allows user to view the support result and decide
what the minimum support value should be, thus reducing the amount of processing
performed. The single item support output DataFrame is named as
'support_1_item' and underlying output table is named as
'ml__valib_association_1_ITEM_SUPPORT'. If "support_result_prefix" is specified,
it replaces 'ml__valib_association' with the provided value in the support
table name.
* 'recalculate' - The final affinity tables are calculated based on support
tables already present. This requires that the "no_support_results" parameter
was set to False in a previous run so that the support tables are available for
recalculating the final affinities.
Default Value: 'all'
Types: str
reduced_data:
Optional Argument.
Specified the reduced input data. If input to the analysis is reduced by using the
"min_support", a "hierarchy_data", or a "filter", the resulting reduced input data
can be saved for further analysis.
Note:
1. This is not affected by the use of a left-side or right-side lookup argument
or a "min_confidence", "min_lift", or "min_zscore" arguments.
2. If further analysis is performed on this data, it may be appropriate to use
the "group_count" argument.
Types: teradataml DataFrame
relaxed_order:
Optional Argument.
Use this option in conjunction with sequence analysis, that is, when a sequence
column is specified. Relaxed ordering occurs when the items on the left side of
an association rule may occur in any order (via the sequence column), and the
same is the case with the right-side items, provided that all left-side items
precede all right-side items.
Types: bool
sequence_column:
Optional Argument.
Specifies the name of the column providing sequencing of input items if sequence
analysis is desired. This might typically be a column of type date or timestamp.
By default, sequence analysis is not performed.
Types: str
filter:
Optional Argument.
Specifies the clause to filter rows selected for analysis within Association Rules.
For example,
filter = "cust_id > 0"
Note:
Single quotes within the parameter value must be doubled, such as in
where=channel <> '' ''. (Ordinarily, the expression would be where=channel <> ' '.
Instead, the expression ends with quote-quote-blank-quote-quote).
Types: str
no_support_results:
Optional Argument.
Specifies whether the intermediate support results are required or not. By default,
support results are not presented in the output.
Notes:
1. If "no_support_results" is False and "support_result_prefix" is not used,
then the generated underlying support tables are overwritten, if they already
exist.
2. When set to True, support results generated depends on "combinations" parameter.
Default Value: True
Types: bool
support_result_prefix:
Optional Argument.
Specifies a string that should be used to as prefix for the underlying table name
for the support tables which can be accessed using output DataFrames.
Notes:
1. Teradata recommends using this when function is to be executed with
"process_type" as 'recalculate'. Make sure to use the same value for both the
function calls.
2. If "no_support_results" is False and this is not used, then the generated
underlying support tables are overwritten, if they already exist.
Default Value: 'ml__valib_association'
Types: str
gen_sql_only:
Optional Argument.
Specifies whether to generate only SQL for the function.
When set to True, function SQL is generated, not executed, which can be accessed
using show_query() method, otherwise SQL is just executed but not returned.
Default Value: False
Types: bool
charset:
Optional Argument.
Specifies the character set for the table name and column names.
If this argument is not set, the function takes default value set by
VAL library.
Permitted Values:
* 'UTF8'
* 'ASCII'
Types: str
RETURNS:
An instance of Association.
Output teradataml DataFrames can be accessed using attribute references, such as
AssociationObj.<attribute_name>.
Note:
Output DataFrames generated can be categorized as follows:
1. Affinity output DataFrames
2. Support output DataFrames
Attribute names of these output DataFrames can be found out using two attributes:
1. affinity_outputs:
This prints the attribute names of the output DataFrames containing the
affinity results.
For example, this can be accessed as
AssociationObj.affinity_outputs
2. support_outputs:
This prints the attribute names of the output DataFrames containing the
support results.
For example, this can be accessed as
AssociationObj.support_outputs
Output DataFrames generated by the function depend upon following:
1. When "process_type" is set to 'support':
a. No affinity outputs are generated.
b. Function generates only two support output DataFrames named as:
i. support_1_item
ii. group_count
2. When "no_support_results" is set to True and "process_type" is set to values
other than 'support':
a. Only affinity outputs are generated, based on the number of combinations
requested by the user.
b. No support outputs are generated.
3. When "no_support_results" is set to False and "process_type" is set to values
other than 'support':
a. Affinity outputs are generated, based on the number of combinations requested.
b. Support outputs are also generated depending on the combination(s) requested.
RAISES:
TeradataMlException, TypeError, ValueError
EXAMPLES:
# Notes:
# 1. To execute Vantage Analytic Library functions,
# a. import "valib" object from teradataml.
# b. set 'configure.val_install_location' to the database name where Vantage
# analytic library functions are installed.
# 2. Datasets used in these examples can be loaded using Vantage Analytic Library
# installer.
# Import valib object from teradataml to execute this function.
from teradataml import valib
# Set the 'configure.val_install_location' variable.
from teradataml import configure
configure.val_install_location = "SYSLIB"
# Create required teradataml DataFrame.
df = DataFrame("credit_tran")
print(df)
# Example 1: Perform Association analysis using default values.
obj = valib.Association(data=df, group_column="cust_id", item_column="channel")
# Print the affinity result. Only affinity result for default combination 11 is produced.
print(obj.result_11)
# Example 2: Requests a 1-to-1 and a 2-to-1 analysis, while also requesting 0.1 minimum
# support. Rows with blank channel column are also filtered.
# Note: The blank channel value requires double single quotes,
# that is quote-quote-blank-quote-quote.
obj = valib.Association(data=df,
group_column=["cust_id"],
item_column="channel",
min_support=0.1,
filter="channel <>''''",
combinations=[11, 21])
# Executing above call will return two affinity results. Let's check the names of the
# output DataFrames.
print(obj.affinity_outputs)
# Print the results.
print(obj.result_11)
print(obj.result_21)
# Example 3: Request sequence analysis by specifying a "sequence_column" parameter.
# Also include optional parameters "min_support" and "filter".
obj = valib.Association(data=df,
group_column=["cust_id"],
item_column="channel",
min_support=0.1,
filter="channel <>''''",
sequence_column="tran_date")
# Executing above call will return one affinity result. Let's check the name of
# the output DataFrame.
obj.affinity_outputs
# Print the results.
print(obj.result_11)
# Example 4: Let's generate the support results first and then calculate the final
# affinity results.
# To generate the support results, we set "no_support_results" to False and use
# 'test_prefix' as a prefix for the table names for the support tables generated.
obj = valib.Association(data=df,
group_column=["cust_id"],
item_column="channel",
no_support_results=False,
support_result_prefix="test_prefix")
# Print the results.
print(obj)
# Let's look at the attribute names of the support results generated.
print(obj.support_outputs)
# Print the individual support results.
print(obj.support_result_01)
print(obj.support_result_11)
# Let's look at the attribute names of the affinity results generated.
print(obj.affinity_outputs)
# Print the affinity results.
print(obj.result_11)
# Re-run the Association function with "process_type" set to 'recalculate' for
# recalculation of affinity results.
# To recalculate we shall use the support results generated by the previous function call.
obj = valib.Association(data=df,
group_column=["cust_id"],
item_column="channel",
process_type="recalculate",
support_result_prefix="test_prefix")
# Print the affinity results.
print(obj.result_11)
# Example 5: Generate only SQL for the function, but do not execute the same.
obj = valib.Association(data=df,
group_column=["cust_id"],
item_column="channel",
no_support_results=False,
support_result_prefix="test_prefix",
gen_sql_only=True)
# Print the generated SQL.
print(obj.show_query("sql"))
# Print both generated SQL and stored procedure call.
print(obj.show_query("both"))
# Print the stored procedure call.
print(obj.show_query())
print(obj.show_query("sp"))
|