NaiveBayesTextClassifierTrainer Syntax Elements - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
8.10
1.1
Published
October 2019
Language
English (United States)
Last Update
2019-12-31
dita:mapPath
ima1540829771750.ditamap
dita:ditavalPath
jsj1481748799576.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢
TokenColumn
Specify the name of the InputTable column that contains the tokens to classify.
ModelType
[Optional] Specify the model type of the text classifier.
Default: 'Multinomial'. See the sections that follow this table.
DocIDColumn
[Required if ModelType is 'Bernoulli', unnecessary otherwise.] Specify the names of the token table columns that contain the document identifier.
DocCategoryColumn
Specify the name of the InputTable column that contains the document category.
CategoryColumn
[Optional] Use only if you specify CategoriesTable. Specify the name of the CategoriesTable column that contains the prediction categories to use in the model.
Default: First column of CategoriesTable
If you omit both CategoriesTable and CategoryColumn, the function uses all categories specified by DocCategoryColumn.
Categories
[Optional] Specify the prediction categories to use in the model.
Default: All categories specified by DocCategoryColumn.
StopWordsColumn
[Optional] Specify the name of the StopWords table column that contains the stop words.
Default: First column of StopWords table
StopWordsList
[Optional] Specify either this syntax element or the StopWords table, but not both.

Specify words to ignore (such as a, an, and the).

Multinomial (default) Model Formula


Formula for multinomial (default) model, used by Machine Learning Engine function NaiveBayesTextClassifierTrainer
Expression Description
p(C i|D) Probability that new document D is classified to category i
TC Total token count (including duplicate tokens)
T j Count of token j in category i (including duplicate tokens)
TC i Token count in category i (including duplicate tokens)
TC ji Count of token j in category i (including duplicate tokens)
|V| Number of unique tokens in training set V

Bernoulli Model Formula


Formula for Bernoulli model, used by Machine Learning Engine function NaiveBayesTextClassifierTrainer
Expression Description
p(C i|D) Probability that new document D is classified to category i
DC Total document count
DC i Document count in category i
V Number of unique tokens in training set V
T k Token in V that is not in document D
DC ji Document count in category i that contains token j
DC ki Document count in category i that contains token k
|C| Number of unique categories in category set C