| |
Methods defined here:
- __init__(self, data=None, min_support=None, time_column=None, path_filters=None, groupby_columns=None, item_column=None, item_definition_table=None, path_column=None, max_length=2147483647, min_length=1, closed_pattern=False, item_definition_columns=None, partition_columns=None, data_sequence_column=None, item_definition_table_sequence_column=None)
- DESCRIPTION:
The FrequentPaths takes a teradataml DataFrame of sequences and
outputs a teradataml DataFrame of subsequences (patterns) that
frequently appear in the input teradataml DataFrame and, optionally,
a teradataml DataFrame of sequence-pattern pairs.
PARAMETERS:
data:
Required Argument.
Specifies the input teradataml DataFrame that contains the
input sequences. Each row is one item in a sequence.
Note: The function ignores rows that contain any NULL values.
min_support:
Required Argument.
Determines the threshold for whether a sequential pattern is
frequent. The min_support must be a positive float number.
If min_support is in the range (0,1), then it is a relative threshold.
If N is the total number of input sequences, then the threshold is
T=N*min_support.
For example, if there are 1000 sequences in the input
teradataml DataFrame and min_support is 0.05, then the threshold is 50.
If min_support is in the range (1,+), then it is an absolute threshold.
Regardless of N, T=min_support. For example, if min_support is 50, then the
threshold is 50, regardless of N.
A pattern is frequent if its support value is at least T.
Because the function outputs only frequent patterns, min_support controls
the number of output patterns. If min_support is small, processing time
increases exponentially; therefore, teradataml recommends starting the
trial with a larger value. for example, 5% of the total sequence number
if you know N and 0.05 otherwise.
If you specify a relative min_support and groupby_columns, then the function
calculates N and T for each group.
If you specify a relative min_support and path_filters, then N is the
number of sequences that meet the constraints of the filters.
Types: float
time_column:
Optional Argument. Required when item_column or item_definition_columns
is specified.
Specifies the input teradataml DataFrame column that
determines the order of items in a sequence. Items in the same
sequence that have the same timestamp belong to the same set.
Types: str
path_filters:
Optional Argument.
Specifies the filters to use on the input teradataml DataFrame
sequences. Only input teradataml DataFrame sequences that satisfy all
constraints of at least one filter are input to the function. Each
filter has one or more constraints, which are separated by spaces.
Each constraint has this syntax:
constraint (item [symbol ...]).
By default, symbol is comma (,). If you specify symbol, it applies to
all filters. The constraint is one of the following:
• STW (start-with constraint): The first item set of the sequence
must contain at least one item.
For example, STW(c,d) requires the first item set of the sequence to
contain c or d. Sequence "(a, c), e, (f, d)" meets this constraint
because the first item set, (a,c), contains c.
• EDW (end-with constraint): The last item set of the sequence must contain
at least one item.
For example, EDW(f,g) requires the last item set of the sequence to contain
f or g. Sequence "(a, b), e, (f, d)" meets this constraint because the last
item set, (f,d), contains f.
• CTN (containing constraint): The sequence must contain at least one item.
For example, CTN(a,b) requires the sequence to contain a or b. The
sequence "(a,c), d, (e,f)" meets this constraint but the sequence "d,
(e,f)" does not.
Constraints in the same filter must be different.
For example, the filter "STW(c,d) EDW(g,k) CTN(e)" is valid, but
"STW(c,d) STW(e,h)" is invalid.
This argument specifies a separator and uses it in two filters:
path_filters("Separator(#)", "STW(c#d) EDW (g#k) CTN(e)", "CTN(h#k)")
Types: str OR list of strs
groupby_columns:
Optional Argument.
Specifies the input teradataml DataFrame columns by which to group the
input teradataml DataFrame sequences. If you specify this argument,
then the function operates on each group separately and copies each
column mentioned in the argument to the output teradataml DataFrame.
Types: str OR list of Strings (str)
item_column:
Optional Argument. Required if you specify neither item_definition_columns
nor path_column.
Specifies the input teradataml DataFrame columns that contain the items.
Types: str OR list of Strings (str)
item_definition_table:
Optional Argument. Required if you specify neither item_column nor path_column.
Specifies the item definition teradataml DataFrame.
path_column:
Optional Argument. Required if you specify neither item_column nor
item_definition_columns.
Specifies the input teradataml DataFrame column that
contains paths in the form of sequence strings. A sequence string has
this syntax: "[item [, ...]]". In the sequence string syntax, you must
type the outer brackets. The sequence strings in this column
can be generated by the nPath function. If you specify this argument,
then each item set can have only one item.
Types: str
max_length:
Optional Argument.
Specifies the maximum length of the output sequential patterns. The
length of a pattern is its number of sets.
Default Value: 2147483647
Types: int
min_length:
Optional Argument.
Specifies the minimum length of the output sequential patterns.
Default Value: 1
Types: int
closed_pattern:
Optional Argument.
Specifies whether to output only closed patterns.
Default Value: False
Types: bool
item_definition_columns:
Optional Argument. Required if you specify neither item_column
nor path_column.
Specifies the names of the index, definition, and item columns of
the input argument "item_definition_table".
Types: str
partition_columns:
Required Argument.
Specifies the names of the columns that comprise the partition key of
the input sequences.
Types: str OR list of Strings (str)
data_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
item_definition_table_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "item_definition_table". The argument is used to
ensure deterministic results for functions which produce results that
vary from run to run.
Types: str OR list of Strings (str)
RETURNS:
Instance of FrequentPaths.
Output teradataml DataFrames can be accessed using attribute
references, such as FrequentPathsObj.<attribute_name>.
Output teradataml DataFrame attribute names are:
1. subsequence_data
2. seq_pattern_table
3. output
RAISES:
TeradataMlException
EXAMPLES:
# Load the data to run the example.
load_example_data("FrequentPaths", ["bank_web_url", "ref_url", "bank_web_clicks1", "bank_web_clicks2", "sequence_table"])
# Create teradataml DataFrame.
bank_web_url = DataFrame.from_table("bank_web_url")
ref_url = DataFrame.from_table("ref_url")
bank_web_clicks1 = DataFrame.from_table("bank_web_clicks1")
bank_web_clicks2 = DataFrame.from_table("bank_web_clicks2")
sequence_table = DataFrame.from_table("sequence_table")
# Example 1 : Running FrequentPaths function with item_column argument.
# data: bank_web_clicks1, which has web clickstream data from a set of users with multiple sessions.
# We are using users action information as item_column to run FrequentPaths function to select sequences.
frequentpaths_out1 = FrequentPaths(data=bank_web_clicks1,
partition_columns='session_id',
time_column='datestamp',
item_column='page',
min_support=2.0,
max_length=2147483647,
min_length=1,
closed_pattern=False,
data_sequence_column='datestamp'
)
# Print the result DataFrame.
print(frequentpaths_out1.subsequence_data)
print(frequentpaths_out1.seq_pattern_table)
print(frequentpaths_out1.output)
# Example 2 : Running FrequentPaths function with item_definition_table argument.
# data: bank_web_url, which has the URL of each page browsed by the customer.
# item_definition_table : ref_url, which has the definitions of the browser pages
frequentpaths_out2 = FrequentPaths(data=bank_web_url,
item_definition_table=ref_url,
partition_columns='session_id',
time_column='datestamp',
min_support=2.0,
item_definition_columns='[page_id:pagedef:page]',
max_length=2147483647,
min_length=1,
closed_pattern=False,
data_sequence_column='datestamp'
)
# Print the result DataFrame.
print(frequentpaths_out2.subsequence_data)
print(frequentpaths_out2.seq_pattern_table)
print(frequentpaths_out2.output)
# Example 3 : Running FrequentPaths function with groupby_columns argument.
# FrequentPaths function will operates on each group (customer) separately.
frequentpaths_out3 = FrequentPaths(data=bank_web_clicks2,
partition_columns='session_id',
time_column='datestamp',
item_column='page',
groupby_columns='customer_id',
min_support=2.0,
max_length=2147483647,
min_length=1,
closed_pattern=False,
data_sequence_column='datestamp'
)
# Print the result DataFrame.
print(frequentpaths_out3.subsequence_data)
print(frequentpaths_out3.seq_pattern_table)
print(frequentpaths_out3.output)
# Example 4 : Running FrequentPaths function with path_filters argument.
frequentpaths_out4 = FrequentPaths(data=bank_web_clicks1,
partition_columns='session_id',
time_column='datestamp',
item_column='page',
min_support=2.0,
max_length=2147483647,
path_filters='STW(account summary) EDW(account history)',
min_length=1,
closed_pattern=False,
data_sequence_column='datestamp'
)
# Print the result DataFrame.
print(frequentpaths_out4.subsequence_data)
print(frequentpaths_out4.seq_pattern_table)
print(frequentpaths_out4.output)
# Example 5 : Using NPath output to run FrequentPaths function to select sequences.
# data: npath_output, which the example creates by inputting the teradataml DataFrame
# "sequence_table", to the NPath function.
npath_output = NPath(data1=sequence_table,
data1_partition_column='id',
data1_order_column='datestamp',
result=['FIRST(id OF itemA) AS id','Accumulate (item OF ANY(itemA, itemAny, itemC)) AS path'],
mode='nonoverlapping',
pattern='itemA.itemAny*.itemC',
symbols=["item='A' AS itemA","item='C' AS itemC","TRUE AS itemAny"])
# Passing NPath function output to run FrequentPaths function.
frequentpaths_out5 = FrequentPaths(data=npath_output.result,
partition_columns='id',
path_column='path',
min_support=2.0,
max_length=2147483647,
min_length=1,
closed_pattern=False
)
# Print the result DataFrame.
print(frequentpaths_out5.subsequence_data)
print(frequentpaths_out5.seq_pattern_table)
print(frequentpaths_out5.output)
- __repr__(self)
- Returns the string representation for a FrequentPaths class instance.
- get_build_time(self)
- Function to return the build time of the algorithm in seconds.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_prediction_type(self)
- Function to return the Prediction type of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_target_column(self)
- Function to return the Target Column of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- show_query(self)
- Function to return the underlying SQL query.
When model object is created using retrieve_model(), then None is returned.
|