NaiveBayesTextClassifierTrainer Arguments - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
8.00
1.0
Published
May 2019
Language
English (United States)
Last Update
2019-11-22
dita:mapPath
blj1506016597986.ditamap
dita:ditavalPath
blj1506016597986.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢
TokenColumn
Specify the name of the token_table column that contains the tokens to classify.
ModelType
[Optional] Specify the model type of the text classifier.
Default: 'Multinomial'. See the sections that follow this table.
DocIDColumn
[Required if ModelType is 'Bernoulli', unnecessary otherwise.] Specify the names of the token_table columns that contain the document identifier.
DocCategoryColumn
Specify the name of the token_table column that contains the document category.
CategoryColumn
[Optional] Use only if you specify categories_table. Specify the name of the categories_table column that contains the prediction categories to use in the model.
Default: First column of categories_table
If you omit both categories_table and CategoryColumn, the function uses all categories specified by DocCategoryColumn.
Categories
[Optional] Specify the prediction categories to use in the model.
Default: All categories specified by DocCategoryColumn.
StopWordsColumn
[Optional] Specify the name of the stop_words_table column that contains the stop words.
Default: First column of stop_words_table
StopWords
[Optional] Specify either this argument or the stop_words_table, but not both.

Specify words to ignore (such as a, an, and the).

Multinomial (default) Model Formula



p(C i | D) Probability that new document is classified to category i
TC Total token count (including duplicate tokens)
T j Count of token j in category i (including duplicate tokens)
TC i Token count in category i (including duplicate tokens)
TC ij Count of token j in category i (including duplicate tokens)
|V| Number of unique tokens in training set V

Bernoulli Model Formula



p(C i | D) Probability that new document is classified to category i
DC Total document count
DC i Document count in category i
V Number of unique tokens in training set V
T k Token in V that is not in document D
DC ij Document count in category i that contains token j
|C| Number of unique categories in category set C