Teradata R Package Function Reference - 16.20 - HMMUnsupervised - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
16.20
created_date
February 2020
category
Programming Reference
featnum
B700-4007-098K

Description

The HMMUnsupervisedLearner (td_hmm_unsupervised_mle) function is available on the SQL-Graph platform. The function can produce multiple HMM models simultaneously, where each model is learned from a set of sequences and where each sequence represents a vertex.

Usage

  td_hmm_unsupervised_mle (
      vertices = NULL,
      model.key = NULL,
      sequence.key = NULL,
      observed.key = NULL,
      hidden.states.num = NULL,
      max.iter.num = 10,
      epsilon = NULL,
      skip.column = NULL,
      init.methods = NULL,
      init.params = NULL,
      vertices.sequence.column = NULL,
      vertices.partition.column = NULL,
      vertices.order.column = NULL
  )

Arguments

vertices

Required Argument.
Specifies the input vertex table.

vertices.partition.column

Required Argument.
Specifies the Partition By columns for vertices. Values to this argument can be provided as vector, if multiple columns are used for ordering.

vertices.order.column

Required Argument.
Specifies the Order By columns for vertices. Values to this argument can be provided as vector, if multiple columns are used for ordering.

model.key

Required Argument.
Specifies the name of the column that contains the model attribute. It must match one of the columns specified in the "vertices.partition.column" argument. The values in the column can be integers or strings.

sequence.key

Required Argument.
Specifies the name of the column that contains the sequence attribute. It must match one of the columns specified in the "vertices.partition.column" argument. A sequence must contain more than two observation symbols.

observed.key

Required Argument.
Specifies the name of the column that contains the observed symbols. The function scans the input tbl_teradata to find all possible observed symbols.
Note: Observed symbols are case-sensitive.

hidden.states.num

Required Argument.
Specifies the number of hidden states.
Note: The number of hidden states can influence model quality and performance, so choose the number appropriately.

max.iter.num

Optional Argument.
Specifies the number of iterations that the training process runs before the function completes.
Default Value: 10

epsilon

Optional Argument.
Specifies the threshold value in determining the convergence of HMM training. If the parameter value difference is less than the threshold, the training process converges. There is no default value. If you do not specify epsilon, the "max.iter.num" agrument determines when the training process converges.

skip.column

Optional Argument.
Specifies the name of the column whose values determine whether the function skips the row. The function skips the row if the value is "true", "yes", "y", or "1". The function does not skip the row if the value is "false", "f", "no", "n", "0", or NULL.

init.methods

Optional Argument.
Specifies the method that the function uses to generate the initial parameters for the initial state probabilities, state transition probabilities, and emission probabilities. The possibilities are:

  1. random (default): The initial parameters are based on uniform distribution.

  2. flat: The probabilities are equal. Each cell holds the same probability in the matrix or vector.

  3. input: The function takes the initial parameters from the "init.params" argument.

The names of these methods are case-insensitive. The seed number is meaningful only when the specified method is random. The correct way to specify the seed for "init.methods" is as follows: c('random','25').

init.params

Optional Argument.
When argument "init.methods"" has the value "input", this argument specifies the initial parameters for the models. The first parameter specifies the initial state probabilities, the second parameter specifies the state transition probabilities, and the third parameter specifies the emission probabilities. For example, if the hidden.states.num argument specifies three (M) hidden states and two (N) observed symbols ("yes" and "no"), then the init.params values are:

  1. init_state_probability_vector (the initial state probabilities): Vector of size M. Eg: "0.3333333333 0.3333333333 0.3333333333"

  2. state_transition_probability_matrix (the state transition probabilities): Matrix of dimensions M x M. Eg: "0.3333333333 0.3333333333 0.3333333333; 0.3333333333 0.3333333333 0.3333333333; 0.3333333333 0.3333333333 0.3333333333"

  3. observation_emission_probability_matrix (the emission probabilities): Matrix of dimensions M * N. Eg: "no:0.25 yes:0.75; no:0.35 yes:0.65; no:0.45 yes:0.55"

For the above example, the correct way to specify "init.params" is as follows: c("0.3333333333 0.3333333333 0.3333333333", "0.3333333333 0.3333333333 0.3333333333; 0.3333333333 0.3333333333 0.3333333333; 0.3333333333 0.3333333333 0.3333333333","no:0.25 yes:0.75; no:0.35 yes:0.65; no:0.45 yes:0.55"). The sum of the probabilities in each row for the initial state probabilities, state transition probabilities, or emission probabilities parameters must be rounded to 1.0. The observed symbols are case-sensitive. The number of states and the number of observed symbols must be consistent with the "hidden.states.num" argument and the observed symbols in the input table; otherwise, the function displays error messages.

vertices.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "vertices". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.

Value

Function returns an object of class "td_hmm_unsupervised_mle" which is a named list containing Teradata tbl objects.
Named list members can be referenced directly with the "$" operator using following names:

  1. output.initialstate.table

  2. output.statetransition.table

  3. output.emission.table

  4. output

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("hmmunsupervised_example", "loan_prediction")
    
    # Create remote tibble objects.
    loan_prediction <- tbl(con, "loan_prediction")
    
    # Example 1 - Build a HMM Unsupervised model on the loan prediction dataset
    td_hmm_unsupervised_out <- td_hmm_unsupervised_mle(vertices = loan_prediction,
                                                   vertices.partition.column = c("model_id", "seq_id"),
                                                   vertices.order.column = c("seq_vertex_id"),
                                                   model.key = "model_id",
                                                   sequence.key = "seq_id",
                                                   observed.key = "observed_id",
                                                   hidden.states.num = 3,
                                                   init.methods = c("random", "25")
                                                   )