NaiveBayesTextClassifierTrainer2 Syntax Elements - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
9.02
9.01
2.0
1.3
Published
February 2022
Language
English (United States)
Last Update
2022-02-10
dita:mapPath
rnn1580259159235.ditamap
dita:ditavalPath
ybt1582220416951.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢
TextColumn
TokenColumn
Specify the name of the InputTable column that contains the text or tokens to classify.
DocCategoryColumn
Specify the name of the InputTable column that contains the document category.
ModelType
[Optional] Specify the model type of the text classifier. For descriptions of model types, see the sections that follow this table.
Default: 'Multinomial'
DocIDColumn
[Required if ModelType is 'Bernoulli'.] Specify the name of the InputTable column that contains the document identifier.
IsTokenized
[Optional] Specify whether the input data is already tokenized. With IsTokenized ('true'), the function does not tokenize the input data. Specifying IsTokenized ('true') with untokenized input data may result in an ambiguous or meaningless model.
Default: 'true'
ConvertToLowerCase
[Optional with IsTokenized ('false'), disallowed otherwise.] Specify whether to convert all letters in the input text to lowercase.
Default: 'false'
StemTokens
[Optional with IsTokenized ('false'), disallowed otherwise.] Specify whether to stem the tokens as part of text tokenization.
Default: 'true'
NullHandling
[Optional] Specify whether to remove null values from input data before processing. If the input data contains no null values, NullHandling ('false') improves performance.
Default: 'false'

Multinomial (default) Model Formula


Formula for multinomial (default) model, used by Machine Learning Engine function NaiveBayesTextClassifierTrainer2
Expression Description
p(Ci|D) Probability that new document D is classified to category i
TC Total token count (including duplicate tokens)
Tj Count of token j in category i (including duplicate tokens)
TCi Token count in category i (including duplicate tokens)
TCji Count of token j in category i (including duplicate tokens)
|V| Number of unique tokens in training set V

Bernoulli Model Formula


Formula for Bernoulli model, used by Machine Learning Engine function NaiveBayesTextClassifierTrainer2
Expression Description
p(Ci|D) Probability that new document D is classified to category i
DC Total document count
DCi Document count in category i
V Number of unique tokens in training set V
Tk Token in V that is not in document D
DCji Document count in category i that contains token j
DCki Document count in category i that contains token k
|C| Number of unique categories in category set C