Teradata R Package Function Reference - 16.20 - NaiveBayesTextClassifier - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
16.20
created_date
February 2020
category
Programming Reference
featnum
B700-4007-098K

Description

The td_naivebayes_textclassifier_mle function takes training data as input and outputs a model table.

Usage

  td_naivebayes_textclassifier_mle (
      data = NULL,
      data.partition.column = NULL,
      token.column = NULL,
      doc.id.columns = NULL,
      doc.category.column = NULL,
      model.type = "MULTINOMIAL",
      categories.data = NULL,
      category.column = "[0:0]",
      prediction.categories = NULL,
      stopwords.data = NULL,
      stopwords.column = NULL,
      stopwords.list = NULL,
      data.sequence.column = NULL,
      stopwords.data.sequence.column = NULL,
      categories.data.sequence.column = NULL
  )

Arguments

data

Required Argument.
Specifies the table defining the training tokens.

data.partition.column

Partition By columns for data.
Values to this argument can be provided as list, if multiple columns are used for ordering.

token.column

Required Argument.
Specifies the name of the tbl_teradata 'data' column that contains the tokens to be classified.

doc.id.columns

Optional Argument.
Specifies the names of the tbl_teradata 'data' columns that contain the document identifier.

doc.category.column

Required Argument.
Specifies the name of the tbl_teradata 'data' column that contains the document category.

model.type

Optional Argument.
Specifies the model type of the text classifier. The formulas for the two model types follow this table. Default Value: "MULTINOMIAL" Permitted Values: MULTINOMIAL, BERNOULLI

categories.data

Optional Argument.
Specifies the table defining allowed categories.

category.column

Optional Argument.
Specifies the name of the 'categories.data' column that contains the prediction categories. The default value is the first column of 'categories.data'. Default Value: "[0:0]"

prediction.categories

Optional Argument.
Specifies the prediction categories.
Note: Specify either this argument or the 'categories.data', but not both.

stopwords.data

Optional Argument.
Specifies the table defining stop words.

stopwords.column

Optional Argument.
Specifies the name of the 'stopwords.data' column that contains the stop words. The default value is the first column of 'stopwords.data'.

stopwords.list

Optional Argument.
Specifies words to ignore (such as a, an, and the).
Note: Specify either this argument or the 'stopwords.data', but not both.

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.

stopwords.data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "stopwords.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.

categories.data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "categories.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.

Value

Function returns an object of class "td_naivebayes_textclassifier_mle" which is a named list containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator using name: result

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("naivebayes_textclassifier_example", "token_table")
    
    # Create remote tibble objects.
    token_table <- tbl(con, "token_table")
    
    # Example 1 -
    naivebayes_textclassifier_out <- td_naivebayes_textclassifier_mle(
                                           data = token_table,
                                           data.partition.column = c("category"),
                                           token.column = "token",
                                           doc.id.columns = c("doc_id"),
                                           doc.category.column = "category",
                                           model.type = "Bernoulli"
                                           )