Teradata Package for R Function Reference | 17.00 - MinHash - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

Teradata® Package for R Function Reference

Product
Teradata Package for R
Release Number
17.00
Published
July 2021
Language
English (United States)
Last Update
2023-08-08
dita:id
B700-4007
NMT
no
Product Category
Teradata Vantage
MinHash

Description

The MinHash function uses transaction history to cluster similar items or users together. For example, the function can cluster items that are frequently bought together or users that bought the same items.

Usage

  td_minhash_mle (
      data = NULL,
      id.column = NULL,
      items.column = NULL,
      hash.num = NULL,
      key.groups = NULL,
      seed.table = NULL,
      input.format = "integer",
      mincluster.size = 3,
      maxcluster.size = 5,
      delimiter = " ",
      data.sequence.column = NULL,
      seed.table.sequence.column = NULL
  )

Arguments

data

Required Argument.
Specifies the tbl_teradata containing the input data.

id.column

Required Argument.
Specifies the name of the column in "data" that contains the values to be hashed into the same cluster.
Typically these values are customer identifiers.
Types: character

items.column

Required Argument.
Specifies the name of the column in "data" that contains the values to use for hashing.
Types: character

hash.num

Required Argument.
Specifies the number of hash functions to generate. This argument determines the number and size of clusters generated.
Types: integer

key.groups

Required Argument.
Specifies the number of key groups to generate. The "key.groups" must be a divisor of "hash.num". A large value in "key.groups" decreases the probability that multiple users will be assigned to the same cluster identifier.
Types: integer

seed.table

Optional Argument.
Specifies the tbl_teradata that contains the seeds to be used for hashing. Typically, this is the "save.seed.to" tbl_teradata that was created by an earlier call to td_minhash_mle.
Note: When this argument is specified, the "save.seed.to" output tbl_teradata is not created in the current call to td_minhash_mle.

input.format

Optional Argument.
Specifies the format of the values in argument "items.column".
Default Value: "integer"
Permitted Values: bigint, integer, hex, string
Types: character

mincluster.size

Optional Argument.
Specifies the minimum cluster size.
Default Value: 3
Types: integer

maxcluster.size

Optional Argument.
Specifies the maximum cluster size.
Default Value: 5
Types: integer

delimiter

Optional Argument.
Specifies the delimiter used between hashed values (typically customer identifiers) in the output.
Default Value: " "
Types: character

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

seed.table.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "seed.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_minhash_mle" which is a named list containing objects of class "tbl_teradata".
Named list members can be referenced directly with the "$" operator using the following names:

  1. output.table

  2. save.seed.to

  3. output

Examples

  
    # Get the current context/connection.
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("minhash_example", "salesdata")
    
    # Create object(s) of class "tbl_teradata".
    salesdata <- tbl(con, "salesdata")
    
    # Example - Create clusters of users based on items purchased.
    td_minhash_out1 <- td_minhash_mle(data = salesdata,
                                      id.column = "userid",
                                      items.column = "itemid",
                                      hash.num = 1002,
                                      key.groups = 3
                                      )

    # Example 2 - Use the previously generated seed table as input.
    # Select a subset of the seed table to restrict the number of clusters.
    td_minhash_out2 <- td_minhash_mle(data = salesdata,
                                id.column = "userid",
                                items.column = "itemid",
                                hash.num = 99,
                                key.groups = 3,
                                seed.table = td_minhash_out1$save.seed.to %>% filter(index < 99)
                                )