NaiveBayesTextClassifierTrainer Arguments - Aster Analytics

Teradata AsterĀ® Analytics Foundation User GuideUpdate 2

Product
Aster Analytics
Release Number
7.00.02
Published
September 2017
Language
English (United States)
Last Update
2018-04-17
dita:mapPath
uce1497542673292.ditamap
dita:ditavalPath
AA-notempfilter_pdf_output.ditaval
dita:id
B700-1022
lifecycle
previous
Product Category
Software
TokenColumn
Specifies the name of the token_table column that contains the tokens to be classified.
ModelType
[Optional] Specifies the model type of the text classifier. Default: 'Multinomial'. The formulas for the two model types follow this table.
DocIDColumn
[Required if ModelType is 'Bernoulli', unnecessary otherwise.] Specifies the names of the token_table columns that contain the document identifier.
DocCategoryColumn
Specifies the name of the token_table column that contains the document category.
CategoryColumn
[Optional] Specifies the name of the categories_table column that contains the prediction categories. Default: First column of categories_table.
Categories
[Optional] Specify either this argument or the categories_table, but not both.

Specifies the prediction categories.

StopWordsColumn
[Optional] Specifies the name of the stop_words_table column that contains the stop words. Default: First column of stop_words_table.
StopWords
[Optional] Specify either this argument or the stop_words_table, but not both.

Specifies words to ignore (such as a, an, and the).

The Multinomial (default) model formula is:



p(C i | D) Probability that new document is classified to category i
TC Total token count (including duplicate tokens)
T j Count of token j in category i (including duplicate tokens)
TC i Token count in category i (including duplicate tokens)
TC ij Count of token j in category i (including duplicate tokens)
|V| Number of unique tokens in training set V

The Bernoulli model formula is:



p(C i | D) Probability that new document is classified to category i
DC Total document count
DC i Document count in category i
V Number of unique tokens in training set V
T k Token in V that is not in document D
DC ij Document count in category i that contains token j
|C| Number of unique categories in category set C