Description
The TextClassifierTrainer function trains a machine learning
classifier for text classification and installs the model file on
ML Engine. The model file can then be input to td_text_classifier_mle
function.
Usage
td_text_classifier_trainer_mle (
data = NULL,
text.column = NULL,
category.column = NULL,
classifier.type = "MaxEnt",
classifier.parameters = NULL,
nlp.parameters = NULL,
feature.selection = NULL,
model.file = NULL,
data.sequence.column = NULL
)
Arguments
data |
Required Argument.
Specifies the name of the tbl_teradata that contains the documents to
use to train the model.
|
text.column |
Required Argument.
Specifies the name of the column that contains the text of the
training documents.
Types: character
|
category.column |
Required Argument.
Specifies the name of the column that contains the category of the
training documents.
Types: character
|
classifier.type |
Required Argument.
Specifies the classifier type of the model, KNN algorithm or Maximum
Entropy model.
Default Value: "MaxEnt"
Permitted Values: MaxEnt, KNN
Types: character
|
classifier.parameters |
Optional Argument.
Applies only if the classifier type of the model is KNN. Specifies
parameters for the classifier. The name must be "compress" and value
must be in the range (0, 1). The n training documents are clustered
into value*n groups (for example, if there are 100 training
documents, then classifier.parameters("compress:0.6") clusters them
into 60 groups), and the model uses the center of each group as the
feature vector.
Types: character OR vector of characters
|
nlp.parameters |
Optional Argument.
Specifies Natural Language Processing (NLP) parameters for
preprocessing the text data and produce tokens:
tokenDictFile: token_file - token_file is the name
of an ML Engine file in which each line contains a phrase,
followed by a space, followed by the token for the phrase
(and nothing else).
stopwordsFile:stopword_file - stopword_file is the
name of an ML Engine file in which each line contains exactly
one stop word (a word to ignore during tokenization, such as
a, an, or the).
useStem:true|false - Specifies whether the function
stems the tokens. The default value is "false".
stemIgnoreFile:stem_ignore_file - stem_ignore_file is
the name of an ML Engine file in which each line contains
exactly one word to ignore during stemming. Specifying this parameter
with "useStem:false" causes an exception.
useBgram: true | false - Specifies whether the function uses
Bigram, which considers the proximity of adjacent tokens when analyzing
them. The default value is "false".
language: en | zh_CN | zh_TW - Specifies the language
of the input text - English (en), Simplified Chinese (zh_CN),
or Traditional Chinese (zh_TW). The default value is en. For the
values zh_CN and zh_TW, the function ignores the parameters useStem
and stemIgnoreFile.
Example: nlp.parameters = c("tokenDictFile:token_dict.txt",
"stopwordsFile:fileName",
"useStem:true",
"stemIgnoreFile:fileName",
"useBgram:true",
"language:zh_CN")
Types: character OR vector of characters
|
feature.selection |
Optional Argument.
Specifies the feature selection method, DF (document frequency). The
values min and max must be in the range (0, 1). The function selects
only the tokens that appear in at least min*n documents and at most
max*n documents, where n is the number of training documents. For
example, feature.selection("DF:[0.1:0.9]") causes the function to
select only the tokens that appear in at least 10% but no more
than 90% of the training documents. If min exceeds max, the
function uses min as max and max as min.
Types: character
|
model.file |
Required Argument.
Specifies the name of the model file to be generated.
Types: character
|
data.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: character OR vector of Strings (character)
|
Value
Function returns an object of class "td_text_classifier_trainer_mle"
which is a named list containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator
using name: result.
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("text_classifier_trainer_example", "texttrainer_input")
# Create remote tibble objects.
texttrainer_input <- tbl(con, "texttrainer_input")
# Example - The function outputs a binary model file with the name
# specified by "model.file" argument.
td_text_classifier_trainer_mle(data=texttrainer_input,
text.column='content',
category.column='category',
classifier.type='knn',
model.file='knn.bin',
classifier.parameters='compress:0.9',
nlp.parameters=c('useStem:true','stopwordsFile:stopwords.txt'),
feature.selection='DF:[0.1:0.99]',
data.sequence.column='id')