| |
Methods defined here:
- __init__(self, data=None, root_node=None, node_id=None, parent_id=None, allow_cycles=False, starts_with=None, mode=None, output=None, max_distance=5, logging=False, result=None, data_sequence_column=None, data_partition_column='1', data_order_column=None)
- DESCRIPTION:
The NTree function is a hierarchical analysis SQL-MapReduce function
that can build and traverse tree structures on all worker machines.
The function reads the data only once from the disk and creates the trees in memory.
PARAMETERS:
data:
Required Argument.
Specifies the input teradataml DataFrame that contains the input table.
data_partition_column:
Optional Argument.
Specifies Partition By columns for data.
Values to this argument can be provided as a list, if multiple
columns are used for partition.
Default Value: 1
Types: str OR list of Strings (str)
data_order_column:
Optional Argument.
Specifies Order By columns for data.
Values to this argument can be provided as a list, if multiple
columns are used for ordering.
Types: str OR list of Strings (str)
root_node:
Required Argument.
Specifies the bool SQL expression that defines the root nodes of the
trees (for example, parent_id IS NULL).
Types: str
node_id:
Required Argument.
Specifies the SQL expression whose value uniquely identifies a node
in the input teradataml DataFrame (for example, order_id).
Note: A node can appear multiple times in the data set, with
different parents.
Types: str
parent_id:
Required Argument.
Specifies the SQL expression whose value identifies the parent node.
Types: str
allow_cycles:
Optional Argument.
Specifies whether trees can contain cycles. If not, a cycle in the
data set causes the function to throw an exception. For information
about cycles, refer to "Cycles in NTree"
Default Value: False
Types: bool
starts_with:
Required Argument.
Specifies the node from which to start tree traversal - must
be "root", "leaf ", or a SQL expression that identifies a node.
Types: str
mode:
Required Argument.
Specifies the direction of tree traversal from the start
node - up to the root node or down to the leaf nodes.
Permitted Values: UP, DOWN
Types: str
output:
Required Argument.
Specifies when to output a tuple - at every node along the
traversal path ("all") or only at the end of the traversal
path ("end").
Permitted Values: END, ALL
Default Value: end
Types: str
max_distance:
Optional Argument.
Specifies the maximum tree depth.
Default Value: 5
Types: int
logging:
Optional Argument.
Specifies whether the function prints log messages.
Default Value: False
Types: bool
result:
Required Argument.
Specifies aggregate operations to perform during tree traversal. The
function reports the result of each aggregate operation in the output
table. The syntax of aggregate is:
operation (expression) [ ALIAS alias ]
operation is either PATH, SUM, LEVEL, MAX, MIN, IS_CYCLE, AVG, or
PROPAGATE.
expression is a SQL expression. If operation is LEVEL or
IS_CYCLE, then expression must be *.
alias is the name of the output teradataml DataFrame column that
contains the result of the operation. The default value is the string
"operation(expression)" without the quotation marks. For example,
PATH(node_name).
Note: The function ignores alias if it is the same as an input
teradataml DataFrame column name.
For the path from the Starts_With node to the last traversed node,
the operations do the following:
1. PATH: Outputs the value of expression for each node, separating
values with "->".
2. SUM: Computes the value of expression for each node and outputs the
sum of these values.
3. LEVEL: Outputs the number of hops.
4. MAX: Computes the value of expression for each node and outputs the
highest of these values.
5. MIN: Computes the value of expression for each node and outputs the
lowest of these values.
6. IS_CYCLE: Outputs the cycle (if any).
7. AVG: Computes the value of expression for each node and outputs the
average of these values.
8. PROPAGATE: Evaluates expression with the value of the starts_with
node and propagates the result to every node.
Types: str
data_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
RETURNS:
Instance of NTree.
Output teradataml DataFrames can be accessed using attribute
references, such as NTreeObj.<attribute_name>.
Output teradataml DataFrame attribute name is:
result
RAISES:
TeradataMlException
EXAMPLES:
# Load example data
load_example_data("ntree", ["employee_table", "emp_table_by_dept"])
# Create teradataml DataFrame objects.
employee_table = DataFrame.from_table("employee_table")
emp_table_by_dept = DataFrame.from_table("emp_table_by_dept")
# Example 1 - This example finds the employees who report to employee
# 100 (either directly or indirectly) by traversing the tree
# of employees from employee 100 downward.
ntree_out1 = NTree(data=employee_table,
root_node = 'mgr_id is NULL',
node_id='emp_id',
parent_id='mgr_id',
starts_with='emp_id=100',
mode='down',
output='end',
result='PATH(emp_name) AS path'
)
# Print the result DataFrame
print(ntree_out1)
# Example 2 - This example finds the reporting structure by department.
ntree_out2 = NTree(data=emp_table_by_dept,
data_partition_column='department',
root_node = "mgr_id = 'none'",
node_id='id',
parent_id='mgr_id',
starts_with="id=10",
mode='down',
output='all',
result='PATH(name) AS path, PATH(id) as path2'
)
# Print the result DataFrame
print(ntree_out2)
- __repr__(self)
- Returns the string representation for a NTree class instance.
- get_build_time(self)
- Function to return the build time of the algorithm in seconds.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_prediction_type(self)
- Function to return the Prediction type of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_target_column(self)
- Function to return the Target Column of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- show_query(self)
- Function to return the underlying SQL query.
When model object is created using retrieve_model(), then None is returned.
|