NaiveBayesTextClassifierTrainer Arguments

NaiveBayesTextClassifierTrainer Arguments - Aster Analytics

Teradata Aster® Analytics Foundation User GuideUpdate 2

Product

Aster Analytics

Release Number

7.00.02

Published

September 2017

Language

English (United States)

Last Update

2018-04-17

dita:mapPath

uce1497542673292.ditamap

dita:ditavalPath

AA-notempfilter_pdf_output.ditaval

dita:id

B700-1022

lifecycle

Product Category

Software

TokenColumn: Specifies the name of the token_table column that contains the tokens to be classified.
ModelType: [Optional] Specifies the model type of the text classifier. Default: 'Multinomial'. The formulas for the two model types follow this table.
DocIDColumn: [Required if ModelType is 'Bernoulli', unnecessary otherwise.] Specifies the names of the token_table columns that contain the document identifier.
DocCategoryColumn: Specifies the name of the token_table column that contains the document category.
CategoryColumn: [Optional] Specifies the name of the categories_table column that contains the prediction categories. Default: First column of categories_table.
Categories: [Optional] Specify either this argument or the categories_table, but not both.
Specifies the prediction categories.
StopWordsColumn: [Optional] Specifies the name of the stop_words_table column that contains the stop words. Default: First column of stop_words_table.
StopWords: [Optional] Specify either this argument or the stop_words_table, but not both.
Specifies words to ignore (such as a, an, and the).

The Multinomial (default) model formula is:

p(C i \| D)	Probability that new document is classified to category i
TC	Total token count (including duplicate tokens)
T j	Count of token j in category i (including duplicate tokens)
TC i	Token count in category i (including duplicate tokens)
TC ij	Count of token j in category i (including duplicate tokens)
\|V\|	Number of unique tokens in training set V

The Bernoulli model formula is:

p(C i \| D)	Probability that new document is classified to category i
DC	Total document count
DC i	Document count in category i
V	Number of unique tokens in training set V
T k	Token in V that is not in document D
DC ij	Document count in category i that contains token j
\|C\|	Number of unique categories in category set C