Teradata R Package Function Reference | 17.00 - 17.00 - MinHash - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
17.00
created_date
September 2020
category
Programming Reference
featnum
B700-4007-090K

Description

The MinHash function uses transaction history to cluster similar items or users together. For example, the function can cluster items that are frequently bought together or users that bought the same items.

Usage

  td_minhash_mle (
      data = NULL,
      id.column = NULL,
      items.column = NULL,
      hash.num = NULL,
      key.groups = NULL,
      seed.table = NULL,
      input.format = "integer",
      mincluster.size = 3,
      maxcluster.size = 5,
      delimiter = " ",
      data.sequence.column = NULL,
      seed.table.sequence.column = NULL
  )

Arguments

data

Required Argument.
Specifies the tbl_teradata containing the input data.

id.column

Required Argument.
Specifies the name of the column in "data" that contains the values to be hashed into the same cluster.
Typically these values are customer identifiers.
Types: character

items.column

Required Argument.
Specifies the name of the column in "data" that contains the values to use for hashing.
Types: character

hash.num

Required Argument.
Specifies the number of hash functions to generate. This argument determines the number and size of clusters generated.
Types: integer

key.groups

Required Argument.
Specifies the number of key groups to generate. The "key.groups" must be a divisor of "hash.num". A large value in "key.groups" decreases the probability that multiple users will be assigned to the same cluster identifier.
Types: integer

seed.table

Optional Argument.
Specifies the tbl_teradata that contains the seeds to be used for hashing. Typically, this is the "save.seed.to" tbl_teradata that was created by an earlier call to td_minhash_mle.
Note: When this argument is specified, the "save.seed.to" output tbl_teradata is not created in the current call to td_minhash_mle.

input.format

Optional Argument.
Specifies the format of the values in argument "items.column".
Default Value: "integer"
Permitted Values: bigint, integer, hex, string
Types: character

mincluster.size

Optional Argument.
Specifies the minimum cluster size.
Default Value: 3
Types: integer

maxcluster.size

Optional Argument.
Specifies the maximum cluster size.
Default Value: 5
Types: integer

delimiter

Optional Argument.
Specifies the delimiter used between hashed values (typically customer identifiers) in the output.
Default Value: " "
Types: character

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

seed.table.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "seed.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_minhash_mle" which is a named list containing objects of class "tbl_teradata".
Named list members can be referenced directly with the "$" operator using the following names:

  1. output.table

  2. save.seed.to

  3. output

Examples

    # Get the current context/connection.
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("minhash_example", "salesdata")
    
    # Create object(s) of class "tbl_teradata".
    salesdata <- tbl(con, "salesdata")
    
    # Example - Create clusters of users based on items purchased.
    td_minhash_out1 <- td_minhash_mle(data = salesdata,
                                      id.column = "userid",
                                      items.column = "itemid",
                                      hash.num = 1002,
                                      key.groups = 3
                                      )

    # Example 2 - Use the previously generated seed table as input.
    # Select a subset of the seed table to restrict the number of clusters.
    td_minhash_out2 <- td_minhash_mle(data = salesdata,
                                id.column = "userid",
                                items.column = "itemid",
                                hash.num = 99,
                                key.groups = 3,
                                seed.table = td_minhash_out1$save.seed.to %>% filter(index < 99)
                                )