Teradata R Package Function Reference - MinHash - Teradata R Package - Look here for syntax, methods and examples for the functions included in the Teradata R Package.

Teradata® R Package Function Reference

Product
Teradata R Package
Release Number
16.20
Published
February 2020
Language
English (United States)
Last Update
2020-02-28
dita:id
B700-4007
lifecycle
previous
Product Category
Teradata Vantage

Description

The Minhash (td_minhash_mle) function uses transaction history to cluster similar items or users together. For example, the function can cluster items that are frequently bought together or users that bought the same items.

Usage

  td_minhash_mle (
      data = NULL,
      id.column = NULL,
      items.column = NULL,
      hash.num = NULL,
      key.groups = NULL,
      seed.table = NULL,
      input.format = "integer",
      mincluster.size = 3,
      maxcluster.size = 5,
      delimiter = " ",
      data.sequence.column = NULL,
      seed.table.sequence.column = NULL
  )

Arguments

data

Required Argument.
Specifies the name of the input table.

id.column

Required Argument.
Specifies the name of the column in "data" that contains the values to be hashed into the same cluster.
Typically these values are customer identifiers.

items.column

Required Argument.
Specifies the name of the input column that contains the values to use for hashing.

hash.num

Required Argument.
Specifies the number of hash functions to generate. This argument determines the number and size of clusters generated.

key.groups

Required Argument.
Specifies the number of key groups to generate. The "key.groups" must be a divisor of "hash.num". A large value in "key.groups" decreases the probability that multiple users will be assigned to the same cluster identifier.

seed.table

Optional Argument.
Specifies the name of the tbl_teradata that contains the seeds to be used for hashing. Typically, this is the "save.seed.to" table created by an earlier call to td_minhash_mle.
Note: When this argument is specified, the "save.seed.to" output table is not created in the current call to td_minhash_mle.

input.format

Optional Argument.
Specifies the format of the values in argument "items.column".
Default Value: "integer"
Permitted Values: bigint, integer, hex, string

mincluster.size

Optional Argument.
Specifies the minimum cluster size.
Default Value: 3

maxcluster.size

Optional Argument.
Specifies the maximum cluster size.
Default Value: 5

delimiter

Optional Argument.
Specifies the delimiter used between hashed values (typically customer identifiers) in the output.
Default Value: " "

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.

seed.table.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "seed.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.

Value

Function returns an object of class "td_minhash_mle" which is a named list containing Teradata tbl objects.
Named list members can be referenced directly with the "$" operator using following names:

  1. output.table

  2. save.seed.to

  3. output

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("minhash_example", "salesdata")
    
    # Create remote tibble objects.
    salesdata <- tbl(con, "salesdata")
    
    # Example - Create clusters of users based on items purchased.
    td_minhash_out <- td_minhash_mle(data = salesdata,
                                 id.column = "userid",
                                 items.column = "itemid",
                                 hash.num = 1002,
                                 key.groups = 3
                                 )
    # Example 2 - Use the previously generated seed table as input
    # Select a subset of the seed table to restrict the number of clusters
    td_minhash_out1 <- td_minhash_mle(data = salesdata,
                                 id.column = "userid",
                                 items.column = "itemid",
                                 hash.num = 99,
                                 key.groups = 3,
                                 seed.table = td_minhash_out$save.seed.to %>% filter(index < 99)
    )