- TextColumn
- TokenColumn
- Specify the name of the InputTable column that contains the text or tokens to classify.
- DocCategoryColumn
- Specify the name of the InputTable column that contains the document category.
- ModelType
- [Optional] Specify the model type of the text classifier. For descriptions of model types, see the sections that follow this table.
- DocIDColumn
- [Required if ModelType is 'Bernoulli'.] Specify the name of the InputTable column that contains the document identifier.
- IsTokenized
- [Optional] Specify whether the input data is already tokenized. With IsTokenized ('true'), the function does not tokenize the input data. Specifying IsTokenized ('true') with untokenized input data may result in an ambiguous or meaningless model.
- ConvertToLowerCase
- [Optional with IsTokenized ('false'), disallowed otherwise.] Specify whether to convert all letters in the input text to lowercase.
- StemTokens
- [Optional with IsTokenized ('false'), disallowed otherwise.] Specify whether to stem the tokens as part of text tokenization.
- NullHandling
- [Optional] Specify whether to remove null values from input data before processing. If the input data contains no null values, NullHandling ('false') improves performance.
Multinomial (default) Model Formula
Expression | Description |
---|---|
p(Ci|D) | Probability that new document D is classified to category i |
TC | Total token count (including duplicate tokens) |
Tj | Count of token j in category i (including duplicate tokens) |
TCi | Token count in category i (including duplicate tokens) |
TCji | Count of token j in category i (including duplicate tokens) |
|V| | Number of unique tokens in training set V |
Bernoulli Model Formula
Expression | Description |
---|---|
p(Ci|D) | Probability that new document D is classified to category i |
DC | Total document count |
DCi | Document count in category i |
V | Number of unique tokens in training set V |
Tk | Token in V that is not in document D |
DCji | Document count in category i that contains token j |
DCki | Document count in category i that contains token k |
|C| | Number of unique categories in category set C |