- TokenColumn
- Specifies the name of the token_table column that contains the tokens to be classified.
- ModelType
- [Optional] Specifies the model type of the text classifier. Default: 'Multinomial'. The formulas for the two model types follow this table.
- DocIDColumn
- [Required if ModelType is 'Bernoulli', unnecessary otherwise.] Specifies the names of the token_table columns that contain the document identifier.
- DocCategoryColumn
- Specifies the name of the token_table column that contains the document category.
- CategoryColumn
- [Optional] Specifies the name of the categories_table column that contains the prediction categories. Default: First column of categories_table.
- Categories
- [Optional] Specify either this argument or the categories_table, but not both.
Specifies the prediction categories.
- StopWordsColumn
- [Optional] Specifies the name of the stop_words_table column that contains the stop words. Default: First column of stop_words_table.
- StopWords
- [Optional] Specify either this argument or the stop_words_table, but not both.
Specifies words to ignore (such as a, an, and the).
The Multinomial (default) model formula is:
p(C i | D) | Probability that new document is classified to category i |
---|---|
TC | Total token count (including duplicate tokens) |
T j | Count of token j in category i (including duplicate tokens) |
TC i | Token count in category i (including duplicate tokens) |
TC ij | Count of token j in category i (including duplicate tokens) |
|V| | Number of unique tokens in training set V |
The Bernoulli model formula is:
p(C i | D) | Probability that new document is classified to category i |
---|---|
DC | Total document count |
DC i | Document count in category i |
V | Number of unique tokens in training set V |
T k | Token in V that is not in document D |
DC ij | Document count in category i that contains token j |
|C| | Number of unique categories in category set C |