Teradata Package for R Function Reference | 17.00 - MinHash - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

Teradata® Package for R Function Reference

Teradata Package for R
Release Number
July 2021
English (United States)
Last Update
Product Category
Teradata Vantage


The MinHash function uses transaction history to cluster similar items or users together. For example, the function can cluster items that are frequently bought together or users that bought the same items.


  td_minhash_mle (
      data = NULL,
      id.column = NULL,
      items.column = NULL,
      hash.num = NULL,
      key.groups = NULL,
      seed.table = NULL,
      input.format = "integer",
      mincluster.size = 3,
      maxcluster.size = 5,
      delimiter = " ",
      data.sequence.column = NULL,
      seed.table.sequence.column = NULL



Required Argument.
Specifies the tbl_teradata containing the input data.


Required Argument.
Specifies the name of the column in "data" that contains the values to be hashed into the same cluster.
Typically these values are customer identifiers.
Types: character


Required Argument.
Specifies the name of the column in "data" that contains the values to use for hashing.
Types: character


Required Argument.
Specifies the number of hash functions to generate. This argument determines the number and size of clusters generated.
Types: integer


Required Argument.
Specifies the number of key groups to generate. The "key.groups" must be a divisor of "hash.num". A large value in "key.groups" decreases the probability that multiple users will be assigned to the same cluster identifier.
Types: integer


Optional Argument.
Specifies the tbl_teradata that contains the seeds to be used for hashing. Typically, this is the "save.seed.to" tbl_teradata that was created by an earlier call to td_minhash_mle.
Note: When this argument is specified, the "save.seed.to" output tbl_teradata is not created in the current call to td_minhash_mle.


Optional Argument.
Specifies the format of the values in argument "items.column".
Default Value: "integer"
Permitted Values: bigint, integer, hex, string
Types: character


Optional Argument.
Specifies the minimum cluster size.
Default Value: 3
Types: integer


Optional Argument.
Specifies the maximum cluster size.
Default Value: 5
Types: integer


Optional Argument.
Specifies the delimiter used between hashed values (typically customer identifiers) in the output.
Default Value: " "
Types: character


Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)


Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "seed.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)


Function returns an object of class "td_minhash_mle" which is a named list containing objects of class "tbl_teradata".
Named list members can be referenced directly with the "$" operator using the following names:

  1. output.table

  2. save.seed.to

  3. output


    # Get the current context/connection.
    con <- td_get_context()$connection
    # Load example data.
    loadExampleData("minhash_example", "salesdata")
    # Create object(s) of class "tbl_teradata".
    salesdata <- tbl(con, "salesdata")
    # Example - Create clusters of users based on items purchased.
    td_minhash_out1 <- td_minhash_mle(data = salesdata,
                                      id.column = "userid",
                                      items.column = "itemid",
                                      hash.num = 1002,
                                      key.groups = 3

    # Example 2 - Use the previously generated seed table as input.
    # Select a subset of the seed table to restrict the number of clusters.
    td_minhash_out2 <- td_minhash_mle(data = salesdata,
                                id.column = "userid",
                                items.column = "itemid",
                                hash.num = 99,
                                key.groups = 3,
                                seed.table = td_minhash_out1$save.seed.to %>% filter(index < 99)