Description
The NaiveBayesTextClassifierTrainer function takes training data as input and outputs a
model tbl_teradata.
Usage
td_naivebayes_textclassifier_mle (
data = NULL,
data.partition.column = NULL,
token.column = NULL,
doc.id.columns = NULL,
doc.category.column = NULL,
model.type = "MULTINOMIAL",
categories.data = NULL,
category.column = "[0:0]",
prediction.categories = NULL,
stopwords.data = NULL,
stopwords.column = NULL,
stopwords.list = NULL,
data.sequence.column = NULL,
stopwords.data.sequence.column = NULL,
categories.data.sequence.column = NULL,
data.order.column = NULL,
stopwords.data.order.column = NULL,
categories.data.order.column = NULL
)
Arguments
data |
Required Argument.
Specifies the tbl_teradata defining the training tokens.
|
data.partition.column |
Required Argument.
Specifies Partition By columns for "data".
Values to this argument can be provided as a vector, if multiple
columns are used for partition.
Types: character OR vector of Strings (character)
|
data.order.column |
Optional Argument.
Specifies Order By columns for "data".
Values to this argument can be provided as a vector, if multiple
columns are used for ordering.
Types: character OR vector of Strings (character)
|
token.column |
Required Argument.
Specifies the name of the column in "data" tbl_teradata, that contains the tokens
to be classified.
Types: character
|
doc.id.columns |
Optional Argument. Required when "model.type" is 'BERNOULLI'.
Specifies the names of the columns, in "data" tbl_teradata, that contain the
document identifier.
Note: This argument should not be provided when "model.type" is 'MULTINOMIAL'.
Otherwise, an exception is raised.
Types: character OR vector of Strings (character)
|
doc.category.column |
Required Argument.
Specifies the name of the column in "data" tbl_teradata, that contains the
document category.
Types: character
|
model.type |
Optional Argument.
Specifies the model type of the text classifier.
Default Value: "MULTINOMIAL"
Permitted Values: MULTINOMIAL, BERNOULLI
Types: character
|
categories.data |
Optional Argument.
Specifies the tbl_teradata defining allowed categories.
|
categories.data.order.column |
Optional Argument.
Specifies Order By columns for "categories.data".
Values to this argument can be provided as a vector, if multiple
columns are used for ordering.
Types: character OR vector of Strings (character)
|
category.column |
Optional Argument.
Specifies the name of the column in "categories.data" tbl_teradata, that contains the
prediction categories. The default value is the first column of "categories.data"
tbl_teradata.
Default Value: "[0:0]"
Types: character
|
prediction.categories |
Optional Argument.
Specifies the prediction categories.
Note: Specify either this argument or the "categories.data" argument, but not both.
Types: character OR vector of characters
|
stopwords.data |
Optional Argument.
Specifies the tbl_teradata defining stop words.
|
stopwords.data.order.column |
Optional Argument.
Specifies Order By columns for "stopwords.data".
Values to this argument can be provided as a vector, if multiple
columns are used for ordering.
Types: character OR vector of Strings (character)
|
stopwords.column |
Optional Argument.
Specifies the name of the column in "stopwords.data" tbl_teradata, that contains the
stop words. The default value is the first column of "stopwords.data" tbl_teradata.
Types: character
|
stopwords.list |
Optional Argument.
Specifies words to ignore (such as a, an, and the).
Note: Specify either this argument or the "stopwords.data" argument, but not both.
Types: character OR vector of characters
|
data.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: character OR vector of Strings (character)
|
stopwords.data.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "stopwords.data". The argument is used to
ensure deterministic results for functions which produce results that
vary from run to run.
Types: character OR vector of Strings (character)
|
categories.data.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "categories.data". The argument is used to
ensure deterministic results for functions which produce results that
vary from run to run.
Types: character OR vector of Strings (character)
|
Value
Function returns an object of class "td_naivebayes_textclassifier_mle"
which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator
using name: result.
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("naivebayes_textclassifier_example", "token_table")
# Create object(s) of class "tbl_teradata".
token_table <- tbl(con, "token_table")
# Example 1 -
naivebayes_textclassifier_out <- td_naivebayes_textclassifier_mle(
data = token_table,
data.partition.column = c("category"),
token.column = "token",
doc.id.columns = c("doc_id"),
doc.category.column = "category",
model.type = "Bernoulli"
)