Teradata Package for Python Function Reference | 17.10 - FrequentPaths - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.

teradataml.analytics.mle.FrequentPaths = class FrequentPaths(builtins.object)

Methods defined here:

__init__(self, data=None, min_support=None, time_column=None, path_filters=None, groupby_columns=None, item_column=None, item_definition_table=None, path_column=None, max_length=2147483647, min_length=1, closed_pattern=False, item_definition_columns=None, partition_columns=None, data_sequence_column=None, item_definition_table_sequence_column=None): DESCRIPTION: The FrequentPaths takes a teradataml DataFrame of sequences and outputs a teradataml DataFrame of subsequences (patterns) that frequently appear in the input teradataml DataFrame and, optionally, a teradataml DataFrame of sequence-pattern pairs. PARAMETERS: data: Required Argument. Specifies the input teradataml DataFrame that contains the input sequences. Each row is one item in a sequence. Note: The function ignores rows that contain any NULL values. min_support: Required Argument. Determines the threshold for whether a sequential pattern is frequent. The min_support must be a positive float number. If min_support is in the range (0,1), then it is a relative threshold. If N is the total number of input sequences, then the threshold is T=N*min_support. For example, if there are 1000 sequences in the input teradataml DataFrame and min_support is 0.05, then the threshold is 50. If min_support is in the range (1,+), then it is an absolute threshold. Regardless of N, T=min_support. For example, if min_support is 50, then the threshold is 50, regardless of N. A pattern is frequent if its support value is at least T. Because the function outputs only frequent patterns, min_support controls the number of output patterns. If min_support is small, processing time increases exponentially; therefore, teradataml recommends starting the trial with a larger value. for example, 5% of the total sequence number if you know N and 0.05 otherwise. If you specify a relative min_support and groupby_columns, then the function calculates N and T for each group. If you specify a relative min_support and path_filters, then N is the number of sequences that meet the constraints of the filters. Types: float time_column: Optional Argument. Required when item_column or item_definition_columns is specified. Specifies the input teradataml DataFrame column that determines the order of items in a sequence. Items in the same sequence that have the same timestamp belong to the same set. Types: str path_filters: Optional Argument. Specifies the filters to use on the input teradataml DataFrame sequences. Only input teradataml DataFrame sequences that satisfy all constraints of at least one filter are input to the function. Each filter has one or more constraints, which are separated by spaces. Each constraint has this syntax: constraint (item [symbol ...]). By default, symbol is comma (,). If you specify symbol, it applies to all filters. The constraint is one of the following: • STW (start-with constraint): The first item set of the sequence must contain at least one item. For example, STW(c,d) requires the first item set of the sequence to contain c or d. Sequence "(a, c), e, (f, d)" meets this constraint because the first item set, (a,c), contains c. • EDW (end-with constraint): The last item set of the sequence must contain at least one item. For example, EDW(f,g) requires the last item set of the sequence to contain f or g. Sequence "(a, b), e, (f, d)" meets this constraint because the last item set, (f,d), contains f. • CTN (containing constraint): The sequence must contain at least one item. For example, CTN(a,b) requires the sequence to contain a or b. The sequence "(a,c), d, (e,f)" meets this constraint but the sequence "d, (e,f)" does not. Constraints in the same filter must be different. For example, the filter "STW(c,d) EDW(g,k) CTN(e)" is valid, but "STW(c,d) STW(e,h)" is invalid. This argument specifies a separator and uses it in two filters: path_filters("Separator(#)", "STW(c#d) EDW (g#k) CTN(e)", "CTN(h#k)") Types: str OR list of strs groupby_columns: Optional Argument. Specifies the input teradataml DataFrame columns by which to group the input teradataml DataFrame sequences. If you specify this argument, then the function operates on each group separately and copies each column mentioned in the argument to the output teradataml DataFrame. Types: str OR list of Strings (str) item_column: Optional Argument. Required if you specify neither item_definition_columns nor path_column. Specifies the input teradataml DataFrame columns that contain the items. Types: str OR list of Strings (str) item_definition_table: Optional Argument. Required if you specify neither item_column nor path_column. Specifies the item definition teradataml DataFrame. path_column: Optional Argument. Required if you specify neither item_column nor item_definition_columns. Specifies the input teradataml DataFrame column that contains paths in the form of sequence strings. A sequence string has this syntax: "[item [, ...]]". In the sequence string syntax, you must type the outer brackets. The sequence strings in this column can be generated by the nPath function. If you specify this argument, then each item set can have only one item. Types: str max_length: Optional Argument. Specifies the maximum length of the output sequential patterns. The length of a pattern is its number of sets. Default Value: 2147483647 Types: int min_length: Optional Argument. Specifies the minimum length of the output sequential patterns. Default Value: 1 Types: int closed_pattern: Optional Argument. Specifies whether to output only closed patterns. Default Value: False Types: bool item_definition_columns: Optional Argument. Required if you specify neither item_column nor path_column. Specifies the names of the index, definition, and item columns of the input argument "item_definition_table". Types: str partition_columns: Required Argument. Specifies the names of the columns that comprise the partition key of the input sequences. Types: str OR list of Strings (str) data_sequence_column: Optional Argument. Specifies the list of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: str OR list of Strings (str) item_definition_table_sequence_column: Optional Argument. Specifies the list of column(s) that uniquely identifies each row of the input argument "item_definition_table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: str OR list of Strings (str) RETURNS: Instance of FrequentPaths. Output teradataml DataFrames can be accessed using attribute references, such as FrequentPathsObj.<attribute_name>. Output teradataml DataFrame attribute names are: 1. subsequence_data 2. seq_pattern_table 3. output RAISES: TeradataMlException EXAMPLES: # Load the data to run the example. load_example_data("FrequentPaths", ["bank_web_url", "ref_url", "bank_web_clicks1", "bank_web_clicks2", "sequence_table"]) # Create teradataml DataFrame. bank_web_url = DataFrame.from_table("bank_web_url") ref_url = DataFrame.from_table("ref_url") bank_web_clicks1 = DataFrame.from_table("bank_web_clicks1") bank_web_clicks2 = DataFrame.from_table("bank_web_clicks2") sequence_table = DataFrame.from_table("sequence_table") # Example 1 : Running FrequentPaths function with item_column argument. # data: bank_web_clicks1, which has web clickstream data from a set of users with multiple sessions. # We are using users action information as item_column to run FrequentPaths function to select sequences. frequentpaths_out1 = FrequentPaths(data=bank_web_clicks1, partition_columns='session_id', time_column='datestamp', item_column='page', min_support=2.0, max_length=2147483647, min_length=1, closed_pattern=False, data_sequence_column='datestamp' ) # Print the result DataFrame. print(frequentpaths_out1.subsequence_data) print(frequentpaths_out1.seq_pattern_table) print(frequentpaths_out1.output) # Example 2 : Running FrequentPaths function with item_definition_table argument. # data: bank_web_url, which has the URL of each page browsed by the customer. # item_definition_table : ref_url, which has the definitions of the browser pages frequentpaths_out2 = FrequentPaths(data=bank_web_url, item_definition_table=ref_url, partition_columns='session_id', time_column='datestamp', min_support=2.0, item_definition_columns='[page_id:pagedef:page]', max_length=2147483647, min_length=1, closed_pattern=False, data_sequence_column='datestamp' ) # Print the result DataFrame. print(frequentpaths_out2.subsequence_data) print(frequentpaths_out2.seq_pattern_table) print(frequentpaths_out2.output) # Example 3 : Running FrequentPaths function with groupby_columns argument. # FrequentPaths function will operates on each group (customer) separately. frequentpaths_out3 = FrequentPaths(data=bank_web_clicks2, partition_columns='session_id', time_column='datestamp', item_column='page', groupby_columns='customer_id', min_support=2.0, max_length=2147483647, min_length=1, closed_pattern=False, data_sequence_column='datestamp' ) # Print the result DataFrame. print(frequentpaths_out3.subsequence_data) print(frequentpaths_out3.seq_pattern_table) print(frequentpaths_out3.output) # Example 4 : Running FrequentPaths function with path_filters argument. frequentpaths_out4 = FrequentPaths(data=bank_web_clicks1, partition_columns='session_id', time_column='datestamp', item_column='page', min_support=2.0, max_length=2147483647, path_filters='STW(account summary) EDW(account history)', min_length=1, closed_pattern=False, data_sequence_column='datestamp' ) # Print the result DataFrame. print(frequentpaths_out4.subsequence_data) print(frequentpaths_out4.seq_pattern_table) print(frequentpaths_out4.output) # Example 5 : Using NPath output to run FrequentPaths function to select sequences. # data: npath_output, which the example creates by inputting the teradataml DataFrame # "sequence_table", to the NPath function. npath_output = NPath(data1=sequence_table, data1_partition_column='id', data1_order_column='datestamp', result=['FIRST(id OF itemA) AS id','Accumulate (item OF ANY(itemA, itemAny, itemC)) AS path'], mode='nonoverlapping', pattern='itemA.itemAny*.itemC', symbols=["item='A' AS itemA","item='C' AS itemC","TRUE AS itemAny"]) # Passing NPath function output to run FrequentPaths function. frequentpaths_out5 = FrequentPaths(data=npath_output.result, partition_columns='id', path_column='path', min_support=2.0, max_length=2147483647, min_length=1, closed_pattern=False ) # Print the result DataFrame. print(frequentpaths_out5.subsequence_data) print(frequentpaths_out5.seq_pattern_table) print(frequentpaths_out5.output)

__repr__(self): Returns the string representation for a FrequentPaths class instance.

get_build_time(self): Function to return the build time of the algorithm in seconds. When model object is created using retrieve_model(), then the value returned is as saved in the Model Catalog.

get_prediction_type(self): Function to return the Prediction type of the algorithm. When model object is created using retrieve_model(), then the value returned is as saved in the Model Catalog.

get_target_column(self): Function to return the Target Column of the algorithm. When model object is created using retrieve_model(), then the value returned is as saved in the Model Catalog.

show_query(self): Function to return the underlying SQL query. When model object is created using retrieve_model(), then None is returned.