Teradata Python Package Function Reference - MinHash - Teradata Python Package - Look here for syntax, methods and examples for the functions included in the Teradata Python Package.

teradataml.analytics.mle.MinHash = class MinHash(builtins.object)

Methods defined here:

__init__(self, data=None, id_column=None, items_column=None, hash_num=None, key_groups=None, seed_table=None, input_format='integer', mincluster_size=3, maxcluster_size=5, delimiter=' ', data_sequence_column=None, seed_table_sequence_column=None): DESCRIPTION: The MinHash function uses transaction history to cluster similar items or users together. For example, the function can cluster items that are frequently bought together or users that bought the same items. PARAMETERS: data: Required Argument. Specifies the name of the input teradataml DataFrame. id_column: Required Argument. Specifies the name of the input teradataml DataFrame column that contains the values to be hashed into the same cluster. Typically these values are customer identifiers. Types: str items_column: Required Argument. Specifies the name of the input column that contains the values for hashing. Types: str hash_num: Required Argument. Specifies the number of hash functions to generate. The hash_num determines the number and size of clusters generated. Types: int key_groups: Required Argument. Specifies the number of key groups to generate. The number of key groups must be a divisor of hash_num. A large number of key groups decreases the probability that multiple users will be assigned to the same cluster identifier. Types: int seed_table: Optional Argument. Specifies the name of the teradataml DataFrame that contains the seeds to use for hashing. Typically, this teradataml DataFrame was created by an earlier MinHash call that is accessed by attribute 'save_seed_to'. input_format: Optional Argument. Specifies the format of the values to be hashed (the values in items_column). Default Value: "integer" Permitted Values: bigint, integer, hex, string Types: str mincluster_size: Optional Argument. Specifies the minimum cluster size. Default Value: 3 Types: int maxcluster_size: Optional Argument. Specifies the maximum cluster size. Default Value: 5 Types: int delimiter: Optional Argument. Specifies the delimiter used between hashed values (typically customer identifiers) in the output. The default value is the space character. Default Value: " " Types: str data_sequence_column: Optional Argument. Specifies the list of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: str OR list of Strings (str) seed_table_sequence_column: Optional Argument. Specifies the list of column(s) that uniquely identifies each row of the input argument "seed_table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: str OR list of Strings (str) RETURNS: Instance of MinHash. Output teradataml DataFrames can be accessed using attribute references, such as MinHashObj.<attribute_name>. Output teradataml DataFrame attribute names are: 1. output_table 2. save_seed_to 3. output Note: When argument seed_table is used, output teradataml DataFrame, save_seed_to, is not created. If tried to access this attribute an INFO message will be thrown mentioning the same. RAISES: TeradataMlException EXAMPLES: # Load example data. load_example_data("minhash", "salesdata") # Create teradataml DataFrame objects. salesdata = DataFrame.from_table("salesdata") # Example 1 - Create clusters of users based on items purchased. MinHash_out1 = MinHash(data = salesdata, id_column = "userid", items_column = "itemid", hash_num = 1002, key_groups = 3 ) # Print the results print(MinHash_out1.output_table) print(MinHash_out1.save_seed_to) print(MinHash_out1.output) # Example 2 - Use the previously generated seed table as input. MinHash_out2 = MinHash(data = salesdata, id_column = "userid", items_column = "itemid", hash_num = 1002, key_groups = 3, seed_table = MinHash_out1.save_seed_to ) # Print the results print(MinHash_out2.output_table) print(MinHash_out2.output) # Note: When argument seed_table is used, output teradataml DataFrame, # save_seed_to, is not created. If tried to access this attribute # an INFO message will be thrown mentioning the same.

__repr__(self): Returns the string representation for a MinHash class instance.