| |
Methods defined here:
- __init__(self, data=None, id_column=None, items_column=None, hash_num=None, key_groups=None, seed_table=None, input_format='integer', mincluster_size=3, maxcluster_size=5, delimiter=' ', data_sequence_column=None, seed_table_sequence_column=None)
- DESCRIPTION:
The MinHash function uses transaction history to cluster similar
items or users together. For example, the function can cluster items
that are frequently bought together or users that bought the same
items.
PARAMETERS:
data:
Required Argument.
Specifies the name of the input teradataml DataFrame.
id_column:
Required Argument.
Specifies the name of the input teradataml DataFrame column that
contains the values to be hashed into the same cluster. Typically
these values are customer identifiers.
Types: str
items_column:
Required Argument.
Specifies the name of the input column that contains the values
for hashing.
Types: str
hash_num:
Required Argument.
Specifies the number of hash functions to generate. The hash_num
determines the number and size of clusters generated.
Types: int
key_groups:
Required Argument.
Specifies the number of key groups to generate. The
number of key groups must be a divisor of hash_num. A
large number of key groups decreases the probability that multiple
users will be assigned to the same cluster identifier.
Types: int
seed_table:
Optional Argument.
Specifies the name of the teradataml DataFrame that contains the
seeds to use for hashing. Typically, this teradataml DataFrame was
created by an earlier MinHash call that is accessed by attribute
'save_seed_to'.
input_format:
Optional Argument.
Specifies the format of the values to be hashed (the values in
items_column).
Default Value: "integer"
Permitted Values: bigint, integer, hex, string
Types: str
mincluster_size:
Optional Argument.
Specifies the minimum cluster size.
Default Value: 3
Types: int
maxcluster_size:
Optional Argument.
Specifies the maximum cluster size.
Default Value: 5
Types: int
delimiter:
Optional Argument.
Specifies the delimiter used between hashed values (typically
customer identifiers) in the output. The default value is the space
character.
Default Value: " "
Types: str
data_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
seed_table_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "seed_table". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
RETURNS:
Instance of MinHash.
Output teradataml DataFrames can be accessed using attribute
references, such as MinHashObj.<attribute_name>.
Output teradataml DataFrame attribute names are:
1. output_table
2. save_seed_to
3. output
Note: When argument seed_table is used, output teradataml DataFrame,
save_seed_to, is not created. If tried to access this attribute
an INFO message will be thrown mentioning the same.
RAISES:
TeradataMlException
EXAMPLES:
# Load example data.
load_example_data("minhash", "salesdata")
# Create teradataml DataFrame objects.
salesdata = DataFrame.from_table("salesdata")
# Example 1 - Create clusters of users based on items purchased.
MinHash_out1 = MinHash(data = salesdata,
id_column = "userid",
items_column = "itemid",
hash_num = 1002,
key_groups = 3
)
# Print the results
print(MinHash_out1.output_table)
print(MinHash_out1.save_seed_to)
print(MinHash_out1.output)
# Example 2 - Use the previously generated seed table as input.
MinHash_out2 = MinHash(data = salesdata,
id_column = "userid",
items_column = "itemid",
hash_num = 1002,
key_groups = 3,
seed_table = MinHash_out1.save_seed_to
)
# Print the results
print(MinHash_out2.output_table)
print(MinHash_out2.output)
# Note: When argument seed_table is used, output teradataml DataFrame,
# save_seed_to, is not created. If tried to access this attribute
# an INFO message will be thrown mentioning the same.
- __repr__(self)
- Returns the string representation for a MinHash class instance.
|