Description
Association Rules provide various measures concerning items residing in
groups. The measures, support, confidence, lift and Z Score, help to
determine the likelihood that one or more items exist in a group, given that
another one or more items exist in the same group. The classic example of
this type of study is market basket analysis, in which the groups are
shopping carts and the items are the products purchased in the shopping
carts. An association rule might indicate the likelihood that a given
shopping cart contains oranges, given that it also contains apples.
Association rules consist of a left part and a right part. The left part
consists of one or more items that are given to reside in a group, and the
right part is the consequence that one or more items also reside in the given
group. The measures are defined as follows:
Support-Percentage of groups containing the items on the left (left-side support), on the right (right-side support), or on both sides of a rule (rule support).
Confidence-Percentage of groups containing the left-side items that also contain the right-side items.
Lift-A measure of how much the probability is raised that the right-side items occur in a group given that the left-side items occur in the group.
Z Score-A statistical measure of how much the expected and actual values of the number of groups containing all the items in the rule varies. (Zero means expected and actual are the same.)
A sequence analysis may be optionally requested, wherein there is a sequence
of items defined by a "sequence.column" argument, ordering the items on each
side of each rule, with left-side items preceding the rights-side items. An
option called "relaxed.order" is provided, that can be set to TRUE so that
items on the left side and the right side can be in any order provided that
all left-side items precede all right-side items. An output tbl_teradata is
created for each requested rule combination (1-to-1, 2-to-1, and so on).
Usage
td_association_valib(data, group.column, item.column, ...)
Arguments
data |
Required Argument. |
group.column |
Required Argument. |
item.column |
Required Argument. |
... |
Specifies other arguments supported by the function as described in the 'Other Arguments' section. |
Value
Function returns an object of class "td_association_valib"
which is a named list containing objects of class "tbl_teradata".
Named list members can be referenced directly with the "$" operator
using names as described in the note below.
Note:
Output objects of class tbl_teradata generated can be categorized as follows:
Affinity outputs.
Support outputs.
Attribute names of these output objects of class tbl_teradata can be found out using two attributes:
affinity.outputs:
This prints the attribute names of the output objects of class tbl_teradata containing the affinity results.
For example, this can be accessed as
AssociationObj$affinity.outputs
support.outputs:
This prints the attribute names of the output objects of class tbl_teradata containing the support results.
For example, this can be accessed as
AssociationObj$support.outputs
Output objects of class tbl_teradata generated by the function depend upon following:
When "process.type" is set to 'support':
No affinity outputs are generated.
Function generates only two support output objects of class tbl_teradata named as:
support.1.item
group.count
When "no.support.results" is set to TRUE and "process.type" is set to values other than 'support':
Only affinity outputs are generated, based on the number of combinations requested by the user.
No support outputs are generated.
When "no.support.results" is set to FALSE and "process.type" is set to values other than 'support':
Affinity outputs are generated, based on the number of combinations requested.
Support outputs are also generated depending on the combination(s) requested.
Other Arguments
combinations
Optional Argument.
Specifies the combinations of number of items on
left side and number of items on right side of
requested association rules. More than one
combination can also be requested. For each
combination specified, one output tbl_teradata is
generated, i.e., number of outputs objects of class
tbl_teradata generated depends on the number of
combinations. Corresponding output tbl_teradata is
named as "result.<combination>".
For example,
combinations=c(11, 21)
above combinations produces an analysis of 1-to-1
and 2-to-1 rules. This will result in two output
objects of class tbl_teradata 'result.11' and
'result.21'.
Note:
If you add the sizes of the left and right sides of a combination, the sum must be less than or equal to 5.
Default Value: 11
Types: integer OR list of Integers (int)
description.data
Optional Argument.
Specifies the tbl_teradata containing
description data, which is joined with the
result.
Note:
Arguments "description.data", "description.identifier" and "description.column", if used, must be used together.
Types: tbl_teradata
description.identifier
Optional Argument.
Specifies the name of the column in the
"description.data" that is joined with
the result.
Note:
Arguments "description.data", "description.identifier" and "description.column", if used, must be used together.
Types: character
description.column
Optional Argument.
Specifies the name of
the column in the "description.data" that
contains descriptive item names.
Note:
Arguments "description.data", "description.identifier" and "description.column", if used, must be used together.
Types: character
group.count
Optional Argument.
Specifies the count of the number of groups in the
input data. By default it is calculated by the
function. This is useful when you are processing a
reduced input set saved in a previous run, so that
calculations can be based on the number of groups
in the original input set and not the reduced set.
Types: integer
hierarchy.data
Optional Argument.
Specifies the hierarchy data that can be joined
with the input data in order to reduce the amount
of input data and compute association rules at a
different hierarchical level. Use of this argument
has an impact on the data saved in the reduced
input when the argument "reduced.data" is
requested. When this option is utilized, listwise
deletion is automatically performed, ignoring rows
that contain a null group, item, or sequence
column value.
Note:
Arguments "hierarchy.data", "low.level.column" and "high.level.column", if used, must be used together.
Types: tbl_teradata
low.level.column
Optional Argument.
Specifies the lowest level item column in the
"hierarchy.data" to be matched with the item
column in the input data.
Note:
Arguments "hierarchy.data", "low.level.column" and "high.level.column", if used, must be used together.
Types: character
high.level.column
Optional Argument.
Specifies the higher-level item column in the
"hierarchy.data".
Note:
Arguments "hierarchy.data", "low.level.column" and "high.level.column", if used, must be used together.
Types: character
left.lookup.data
Optional Argument.
Specifies a left-side lookup data that can be
specified to reduce the rules reported to only
those with left-side items that appear in the
lookup dataset.
Note:
Arguments "left.lookup.data" and "left.lookup.column", if used, must be used together.
Types: tbl_teradata
left.lookup.column
Optional Argument.
Specifies the name of the column to match with
left-side items in rules.
Note:
Arguments "left.lookup.data" and "left.lookup.column", if used, must be used together.
Types: character
right.lookup.data
Optional Argument.
Specifies a right-side lookup data that can be
specified to reduce the rules reported to only
those with right-side items that appear in the
lookup dataset.
Note:
Arguments "right.lookup.data" and "right.lookup.column", if used, must be used together.
Types: tbl_teradata
right.lookup.column
Optional Argument.
Specifies the name of the column to match
with right-side items in rules.
Note:
Arguments "right.lookup.data" and "right.lookup.column", if used, must be used together.
Types: character
min.confidence
Optional Argument.
Specifies the minimum value that the confidence
measure of an association rule must have before
it is included in a result. The range of valid
values is 0 to 1 inclusive.
Types: numeric
min.lift
Optional Argument.
Specifies the minimum value that the lift measure of an
association rule must have before it is included in a
result. The range of valid values is 0 to any positive
numeric value.
Types: numeric
min.support
Optional Argument.
Specifies the minimum value that the support measure
of an association rule must have before it is
included in a result. When this argument is utilized,
the size of the input data is reduced, potentially
impacting the use of the "reduced.data" argument.
Use of this argument also causes listwise deletion
to be performed, skipping any input rows that have a
null group, item or sequence column value. The range
of valid values is 0 to 1 inclusive.
Types: numeric
min.zscore
Optional Argument.
Specifies the minimum value that the Z Score measure
of an association rule must have before it is included
in a result.
Types: numeric
order.prob
Optional Argument.
Specifies the probability of correct ordering for
sequential analysis. When sequence analysis is being
performed, by default the algorithm determines
ordering probabilities. Value should be non-zero
between 0 and 1 (Setting it to 1 effectively ignores
this principle in lift and Z Score calculations).
Types: numeric
process.type
Optional Argument.
Specifies the type of processing.
Permitted Values:
'all' - All processing is performed, from building support tables to calculating final affinities.
'support' - The single item support result tbl_teradata is built and then processing is halted. This allows user to view the support result and decide what the minimum support value should be, thus reducing the amount of processing performed. The single item support output tbl_teradata is named as 'support.1.item' and underlying output table is named as 'ml__valib_association_1_ITEM_SUPPORT'. If "support.result.prefix" is specified, it replaces 'ml__valib_association' with the provided value in the support table name.
'recalculate' - The final affinity tables are calculated based on support tables already present. This requires that the "no.support.results" parameter was set to FALSE in a previous run so that the support tables are available for recalculating the final affinities.
Default Value: 'all'
Types: character
reduced.data
Optional Argument.
Specified the reduced input data. If input to the
analysis is reduced by using the "min.support", a
"hierarchy.data", or a "filter", the resulting
reduced input data can be saved for further
analysis.
Notes:
This is not affected by the use of a left-side or right-side lookup argument or a "min.confidence", "min.lift", or "min.zscore" arguments.
If further analysis is performed on this data, it may be appropriate to use the "group.count" argument.
Types: tbl_teradata
relaxed.order
Optional Argument.
Use this option in conjunction with sequence
analysis, that is, when a sequence column is
specified. Relaxed ordering occurs when the items
on the left side of an association rule may occur
in any order (via the sequence column), and the
same is the case with the right-side items,
provided that all left-side items precede all
right-side items.
Types: logical
sequence.column
Optional Argument.
Specifies the name of the column providing
sequencing of input items if sequence analysis is
desired. This might typically be a column of type
date or timestamp. By default, sequence analysis
is not performed.
Types: character
filter
Optional Argument.
Specifies the clause to filter rows selected for analysis
within Association Rules.
For example,
filter = "cust_id > 0"
Note:
Single quotes within the parameter value must be doubled, such as in where=channel <> ” ”. (Ordinarily, the expression would be where=channel <> ' '. Instead, the expression ends with quote-quote-blank-quote-quote).
Types: character
no.support.results
Optional Argument.
Specifies whether the intermediate support
results are required or not. By default,
support results are not presented in the
output.
Notes:
If "no.support.results" is FALSE and "support.result.prefix" is not used, then the generated underlying support tables are overwritten, if they already exist.
When set to TRUE, support results generated depends on "combinations" parameter.
Default Value: TRUE
Types: logical
support.result.prefix
Optional Argument.
Specifies a string that should be used to
as prefix for the underlying table name
for the support tables which can be
accessed using output objects of class
tbl_teradata.
Notes:
Teradata recommends using this when function is to be executed with "process.type" as 'recalculate'. Make sure to use the same value for both the function calls.
If "no.support.results" is FALSE and this is not used, then the generated underlying support tables are overwritten, if they already exist.
Default Value: 'ml__valib_association'
Types: character
Examples
# Notes:
# 1. To execute Vantage Analytic Library functions, set option
# 'val.install.location' to the database name where Vantage analytic
# library functions are installed.
# 2. Datasets used in these examples can be loaded using Vantage Analytic
# Library installer.
# Set the option 'val.install.location'.
options(val.install.location = "SYSLIB")
# Get remote data source connection.
con <- td_get_context()$connection
# Create an object of class "tbl_teradata".
df <- tbl(con, "credit_tran")
print(df)
# Example 1: Perform Association analysis using default values.
obj <- td_association_valib(data=df, group.column="cust_id",
item.column="channel")
# Print the affinity result. Only affinity result for default combination 11
# is produced.
print(obj$result.11)
# Example 2: Requests a 1-to-1 and a 2-to-1 analysis, while also requesting
# 0.1 minimum support. Rows with blank channel column are also
# filtered.
# Note: The blank channel value requires double single quotes,
# that is quote-quote-blank-quote-quote.
obj <- td_association_valib(data=df,
group.column=c("cust_id"),
item.column="channel",
min.support=0.1,
filter="channel <>''''",
combinations=c(11, 21))
# Executing above call will return two affinity results. Let's check the
# names of the output objects of class tbl_teradata.
print(obj$affinity.outputs)
# Print the results.
print(obj$result.11)
print(obj$result.21)
# Example 3: Request sequence analysis by specifying a "sequence.column"
# parameter. Also include optional parameters "min.support" and
# "filter".
obj <- td_association_valib(data=df,
group.column=c("cust_id"),
item.column="channel",
min.support=0.1,
filter="channel <>''''",
sequence.column="tran_date")
# Executing above call will return one affinity result. Let's check the name
# of the output tbl_teradata.
print(obj$affinity.outputs)
# Print the results.
print(obj$result.11)
# Example 4: Let's generate the support results first and then calculate the
# final affinity results.
# To generate the support results, we set "no.support.results" to FALSE and
# use 'test_prefix' as a prefix for the table names for the support tables
# generated.
obj <- td_association_valib(data=df,
group.column=c("cust_id"),
item.column="channel",
no.support.results=FALSE,
support.result.prefix="test_prefix")
# Print the results.
print(obj)
# Let's look at the attribute names of the support results generated.
print(obj$support.outputs)
# Print the individual support results.
print(obj$support.result.01)
print(obj$support.result.11)
# Let's look at the attribute names of the affinity results generated.
print(obj$affinity.outputs)
# Print the affinity results.
print(obj$result.11)
# Re-run the Association function with "process.type" set to 'recalculate'
# for recalculation of affinity results.
# To recalculate we shall use the support results generated by the previous
# function call.
obj <- td_association_valib(data=df,
group.column=c("cust_id"),
item.column="channel",
process.type="recalculate",
support.result.prefix="test_prefix")
# Print the affinity results.
print(obj$result.11)