Argument | Category | Description |
---|---|---|
TokenColumn | Required | Specifies the name of the token_table column that contains the tokens to be classified. |
ModelType | Optional | Specifies the model type of the text classifier. The default value is 'Multinomial'. The formulas for the two model types follow this table. |
DocIDColumn | Required if ModelType is 'Bernoulli', unnecessary otherwise | Specifies the names of the token_table columns that contain the document identifier. |
DocCategoryColumn | Required | Specifies the name of the token_table column that contains the document category. |
CategoryColumn | Optional | Specifies the name of the categories_table column that contains the prediction categories. The default value is the first column of categories_table. |
Categories | Optional | Specifies the prediction categories. Specify either this argument or the categories_table, but not both.
|
StopWordsColumn | Optional | Specifies the name of the stop_words_table column that contains the stop words. The default value is the first column of stop_words_table. |
StopWords | Optional | Specifies words to ignore (such as a, an, and the). Specify either this argument or the stop_words_table, but not both.
|
The Multinomial (default) model formula is:
p(C i | D) | Probability that new document is classified to category i |
---|---|
TC | Total token count (including duplicate tokens) |
T j | Count of token j in category i (including duplicate tokens) |
TC i | Token count in category i (including duplicate tokens) |
TC ij | Count of token j in category i (including duplicate tokens) |
|V| | Number of unique tokens in training set V |
The Bernoulli model formula is:
p(C i | D) | Probability that new document is classified to category i |
---|---|
DC | Total document count |
DC i | Document count in category i |
V | Number of unique tokens in training set V |
T k | Token in V that is not in document D |
DC ij | Document count in category i that contains token j |
|C| | Number of unique categories in category set C |