TextClassifierTrainer Syntax Elements - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
9.02
9.01
2.0
1.3
Published
February 2022
Language
English (United States)
Last Update
2022-02-10
dita:mapPath
rnn1580259159235.ditamap
dita:ditavalPath
ybt1582220416951.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantage™
OutputModelFile
Specify the name for the model file to create.
ConvertToLowerCase
[Optional] Specify whether to convert input text to lowercase.
Default: 'true'
TextColumn
Specify the name of the InputTable column that contains the text of the training documents.
CategoryColumn
Specify the name of the InputTable column that contains the category of the training documents.
ModelType
Specify the model type, k-nearest neighbors (KNN) or Maximum Entropy (MaxEnt):
ModelType Description
KNN TextClassifierTrainer classifies text document by plurality vote of its neighbors, assigning document to class most common among its k nearest neighbors. TextClassifierTrainer chooses best k parameter and TextClassifier uses k to predict classes.
Depends on KNNModelParameters:
KNNModelParameters Description
Omitted TextClassifierTrainer function does the following:
  1. Internally divides data into training sets and validation sets.
  2. Trains multiple models on training sets, using different k_value and p_value for each model.
  3. Uses each model to predict label of each document in validation sets.
  4. Uses actual labels in validation sets to calculate precision of predicted labels.
  5. Installs model whose k_value and p_value produced maximum precision on ML Engine for TextClassifier function to use to predict label of test documents.
Specifies one k_value, one p_value, or one of each TextClassifierTrainer function uses specified value or values to train model on training sets and installs model on ML Engine for TextClassifier function to use to predict label of test documents.
Specifies more than one k_value or one p_value TextClassifierTrainer function does the following:
  1. Internally divides data into training sets and validation sets.
  2. For each specified value, trains a model on each training set.
  3. Uses each model to predict label of each document in validation sets.
  4. Uses actual labels in validation sets to calculate precision of predicted labels.
  5. Installs model whose values produced maximum precision on ML Engine for TextClassifier function to use to predict label of test documents.
MaxEnt Entropy is amount of information conveyed by event. Using principle of maximum entropy (http://en.wikipedia.org/wiki/Maximum_entropy_method), TextClassifierTrainer selects model that has largest entropy from all models that fit training data.

Maximum entropy does not assume features are conditionally independent of each other. Especially, in text classification problem, features are usually words that are not independent, unlike NaiveBayesTextClassifierTrainer2 (ML Engine), which assumes each word is independent of every other word.

KNNModelParameters
[Optional] Applies only if the classifier type of the model is KNN. Specify parameters for the classifier:
Parameter Description
compress c_value must be in range (0, 1). Function clusters n training documents into c_value*n groups.

For example, if there are 100 training documents, then KNNModelParameters ('compress:0.6') clusters them into 60 groups, and model uses center of each group as feature vector.

kvalues k_value must be INTEGER value in range [1, max(classes, ceil(sqrt(rows)))], where:
  • classes is number of classes in training table
  • rows is number of rows in training table

k_value specifies number of nearest neighbors to consider when deciding label of unseen document.

Function selects best specified k_value for deciding label of unseen document.

power p_value must be DOUBLE PRECISION value in range [0, 10]. p_value specifies power to apply to weight corresponding to each vote considered when deciding label of unseen document.
NLPParameters
[Optional] Specify natural language processing (NLP) parameters for preprocessing the text data and produce tokens:
name:value Description
tokenDictFile:token_file token_file is name of ML Engine file in which each line contains a phrase, followed by a space, followed by the token for the phrase (and nothing else).
stopwordsFile:stopword_file stopword_file is name of ML Engine file in which each line contains exactly one stop word (a word to ignore during tokenization, such as a, an, or the).
useStem:{ 'true' | 'false' } Specifies whether function stems tokens.

Default: 'false'

stemIgnoreFile:stem_ignore_file stem_ignore_file is name of ML Engine file in which each line contains exactly one word to ignore during stemming.

Specifying this parameter with useStem:'false' causes an exception.

useBgram:{ 'true' | 'false' } Specifies whether function uses Bigram, which considers proximity of adjacent tokens when analyzing them.

Default: 'false'

language:{ 'en' | 'zh_CN' | 'zh_TW' } Specifies input text language—English (Default), Simplified Chinese, or Traditional Chinese, respectively.

For zh_CN and zh_TW, function ignores useStem and stemIgnoreFile.

Default: 'en'

Example:
NLPParameters ('tokenDictFile:token_dict.txt', 
'stopwordsFile:fileName', 
'useStem:true', 
'stemIgnoreFile:fileName', 
'useBgram:true', 
'language:zh_CN')
If ConvertToLowerCase is 'false', the function treats stop words as case-sensitive.
FeatureSelectionLimits
[Optional] Specify the feature selection method, DF (document frequency). The values min and max must be in the range (0, 1). The function selects only the tokens that appear in at least min*n documents and at most max*n documents, where n is the number of training documents. For example, FeatureSelection ('DF:[0.1:0.9]') causes the function to select only the tokens that appear in at least 10% but no more than 90% of the training documents. If min exceeds max, the function uses min as max and max as min.
Punctuation
[Optional] Specify the punctuation characters to remove from the input text. The string 'punctuation_characters' is a regular expression (see Regular Expressions in Syntax Elements).