Teradata Package for R Function Reference | 17.00 - MinHash - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

Teradata® Package for R Function Reference

Product

Teradata Package for R

Release Number

17.00

Published

July 2021

Language

English (United States)

Last Update

2023-08-08

dita:id

B700-4007

NMT

Product Category

Teradata Vantage

MinHash

Description

The MinHash function uses transaction history to cluster similar items or users together. For example, the function can cluster items that are frequently bought together or users that bought the same items.

Usage

  td_minhash_mle (
      data = NULL,
      id.column = NULL,
      items.column = NULL,
      hash.num = NULL,
      key.groups = NULL,
      seed.table = NULL,
      input.format = "integer",
      mincluster.size = 3,
      maxcluster.size = 5,
      delimiter = " ",
      data.sequence.column = NULL,
      seed.table.sequence.column = NULL
  )

Arguments

`data`	Required Argument. Specifies the tbl_teradata containing the input data.
`id.column`	Required Argument. Specifies the name of the column in "data" that contains the values to be hashed into the same cluster. Typically these values are customer identifiers. Types: character
`items.column`	Required Argument. Specifies the name of the column in "data" that contains the values to use for hashing. Types: character
`hash.num`	Required Argument. Specifies the number of hash functions to generate. This argument determines the number and size of clusters generated. Types: integer
`key.groups`	Required Argument. Specifies the number of key groups to generate. The "key.groups" must be a divisor of "hash.num". A large value in "key.groups" decreases the probability that multiple users will be assigned to the same cluster identifier. Types: integer
`seed.table`	Optional Argument. Specifies the tbl_teradata that contains the seeds to be used for hashing. Typically, this is the "save.seed.to" tbl_teradata that was created by an earlier call to `td_minhash_mle`. Note: When this argument is specified, the "save.seed.to" output tbl_teradata is not created in the current call to `td_minhash_mle`.
`input.format`	Optional Argument. Specifies the format of the values in argument "items.column". Default Value: "integer" Permitted Values: bigint, integer, hex, string Types: character
`mincluster.size`	Optional Argument. Specifies the minimum cluster size. Default Value: 3 Types: integer
`maxcluster.size`	Optional Argument. Specifies the maximum cluster size. Default Value: 5 Types: integer
`delimiter`	Optional Argument. Specifies the delimiter used between hashed values (typically customer identifiers) in the output. Default Value: " " Types: character
`data.sequence.column`	Optional Argument. Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: character OR vector of Strings (character)
`seed.table.sequence.column`	Optional Argument. Specifies the vector of column(s) that uniquely identifies each row of the input argument "seed.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_minhash_mle" which is a named list containing objects of class "tbl_teradata".
Named list members can be referenced directly with the "$" operator using the following names:

output.table
save.seed.to
output

Examples

  
    # Get the current context/connection.
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("minhash_example", "salesdata")
    
    # Create object(s) of class "tbl_teradata".
    salesdata <- tbl(con, "salesdata")
    
    # Example - Create clusters of users based on items purchased.
    td_minhash_out1 <- td_minhash_mle(data = salesdata,
                                      id.column = "userid",
                                      items.column = "itemid",
                                      hash.num = 1002,
                                      key.groups = 3
                                      )

    # Example 2 - Use the previously generated seed table as input.
    # Select a subset of the seed table to restrict the number of clusters.
    td_minhash_out2 <- td_minhash_mle(data = salesdata,
                                id.column = "userid",
                                items.column = "itemid",
                                hash.num = 99,
                                key.groups = 3,
                                seed.table = td_minhash_out1$save.seed.to %>% filter(index < 99)
                                )