- TokenColumn
- Specify the name of the token_table column that contains the tokens to classify.
- ModelType
- [Optional] Specify the model type of the text classifier.
- DocIDColumn
- [Required if ModelType is 'Bernoulli', unnecessary otherwise.] Specify the names of the token_table columns that contain the document identifier.
- DocCategoryColumn
- Specify the name of the token_table column that contains the document category.
- CategoryColumn
- [Optional] Use only if you specify categories_table. Specify the name of the categories_table column that contains the prediction categories to use in the model.
- Categories
- [Optional] Specify the prediction categories to use in the model.
- StopWordsColumn
- [Optional] Specify the name of the stop_words_table column that contains the stop words.
- StopWords
- [Optional] Specify either this argument or the stop_words_table, but not both.
Specify words to ignore (such as a, an, and the).
Multinomial (default) Model Formula
p(C i | D) | Probability that new document is classified to category i |
---|---|
TC | Total token count (including duplicate tokens) |
T j | Count of token j in category i (including duplicate tokens) |
TC i | Token count in category i (including duplicate tokens) |
TC ij | Count of token j in category i (including duplicate tokens) |
|V| | Number of unique tokens in training set V |
Bernoulli Model Formula
p(C i | D) | Probability that new document is classified to category i |
---|---|
DC | Total document count |
DC i | Document count in category i |
V | Number of unique tokens in training set V |
T k | Token in V that is not in document D |
DC ij | Document count in category i that contains token j |
|C| | Number of unique categories in category set C |