TextClassifierTrainer Syntax Elements

TextClassifierTrainer Syntax Elements - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product

Teradata Vantage

Release Number

9.02

9.01

2.0

1.3

Published

February 2022

Language

English (United States)

Last Update

2022-02-10

dita:mapPath

rnn1580259159235.ditamap

dita:ditavalPath

ybt1582220416951.ditaval

dita:id

B700-4003

lifecycle

Product Category

Teradata Vantage™

OutputModelFile

Specify the name for the model file to create.

ConvertToLowerCase

[Optional] Specify whether to convert input text to lowercase.

Default: 'true'

TextColumn

Specify the name of the InputTable column that contains the text of the training documents.

CategoryColumn

Specify the name of the InputTable column that contains the category of the training documents.

ModelType

Specify the model type, k-nearest neighbors (KNN) or Maximum Entropy (MaxEnt):

ModelType

Description

KNN

TextClassifierTrainer classifies text document by plurality vote of its neighbors, assigning document to class most common among its k nearest neighbors. TextClassifierTrainer chooses best k parameter and TextClassifier uses k to predict classes.

Depends on KNNModelParameters:

KNNModelParameters	Description
Omitted	TextClassifierTrainer function does the following: Internally divides data into training sets and validation sets. Trains multiple models on training sets, using different k_value and p_value for each model. Uses each model to predict label of each document in validation sets. Uses actual labels in validation sets to calculate precision of predicted labels. Installs model whose k_value and p_value produced maximum precision on ML Engine for TextClassifier function to use to predict label of test documents.
Specifies one k_value, one p_value, or one of each	TextClassifierTrainer function uses specified value or values to train model on training sets and installs model on ML Engine for TextClassifier function to use to predict label of test documents.
Specifies more than one k_value or one p_value	TextClassifierTrainer function does the following: Internally divides data into training sets and validation sets. For each specified value, trains a model on each training set. Uses each model to predict label of each document in validation sets. Uses actual labels in validation sets to calculate precision of predicted labels. Installs model whose values produced maximum precision on ML Engine for TextClassifier function to use to predict label of test documents.

MaxEnt

Entropy is amount of information conveyed by event. Using principle of maximum entropy (http://en.wikipedia.org/wiki/Maximum_entropy_method), TextClassifierTrainer selects model that has largest entropy from all models that fit training data.

Maximum entropy does not assume features are conditionally independent of each other. Especially, in text classification problem, features are usually words that are not independent, unlike NaiveBayesTextClassifierTrainer2 (ML Engine), which assumes each word is independent of every other word.

KNNModelParameters

[Optional] Applies only if the classifier type of the model is KNN. Specify parameters for the classifier:

Parameter	Description
compress	c_value must be in range (0, 1). Function clusters n training documents into c_value*n groups. For example, if there are 100 training documents, then KNNModelParameters ('compress:0.6') clusters them into 60 groups, and model uses center of each group as feature vector.
kvalues	k_value must be INTEGER value in range [1, max(classes, ceil(sqrt(rows)))], where: classes is number of classes in training table rows is number of rows in training table k_value specifies number of nearest neighbors to consider when deciding label of unseen document. Function selects best specified k_value for deciding label of unseen document.
power	p_value must be DOUBLE PRECISION value in range [0, 10]. p_value specifies power to apply to weight corresponding to each vote considered when deciding label of unseen document.

Parameter

Description

compress

c_value must be in range (0, 1). Function clusters n training documents into c_value*n groups.

For example, if there are 100 training documents, then KNNModelParameters ('compress:0.6') clusters them into 60 groups, and model uses center of each group as feature vector.

kvalues

k_value must be INTEGER value in range [1, max(classes, ceil(sqrt(rows)))], where:

classes is number of classes in training table
rows is number of rows in training table

k_value specifies number of nearest neighbors to consider when deciding label of unseen document.

Function selects best specified k_value for deciding label of unseen document.

power

p_value must be DOUBLE PRECISION value in range [0, 10]. p_value specifies power to apply to weight corresponding to each vote considered when deciding label of unseen document.

NLPParameters

[Optional] Specify natural language processing (NLP) parameters for preprocessing the text data and produce tokens:

name:value	Description
tokenDictFile:token_file	token_file is name of ML Engine file in which each line contains a phrase, followed by a space, followed by the token for the phrase (and nothing else).
stopwordsFile:stopword_file	stopword_file is name of ML Engine file in which each line contains exactly one stop word (a word to ignore during tokenization, such as a, an, or the).
useStem:{ 'true' \| 'false' }	Specifies whether function stems tokens. Default: 'false'
stemIgnoreFile:stem_ignore_file	stem_ignore_file is name of ML Engine file in which each line contains exactly one word to ignore during stemming. Specifying this parameter with useStem:'false' causes an exception.
useBgram:{ 'true' \| 'false' }	Specifies whether function uses Bigram, which considers proximity of adjacent tokens when analyzing them. Default: 'false'
language:{ 'en' \| 'zh_CN' \| 'zh_TW' }	Specifies input text language—English (Default), Simplified Chinese, or Traditional Chinese, respectively. For zh_CN and zh_TW, function ignores useStem and stemIgnoreFile. Default: 'en'

Example:

NLPParameters ('tokenDictFile:token_dict.txt', 
'stopwordsFile:fileName', 
'useStem:true', 
'stemIgnoreFile:fileName', 
'useBgram:true', 
'language:zh_CN')

If ConvertToLowerCase is 'false', the function treats stop words as case-sensitive.

FeatureSelectionLimits

[Optional] Specify the feature selection method, DF (document frequency). The values min and max must be in the range (0, 1). The function selects only the tokens that appear in at least min*n documents and at most max*n documents, where n is the number of training documents. For example, FeatureSelection ('DF:[0.1:0.9]') causes the function to select only the tokens that appear in at least 10% but no more than 90% of the training documents. If min exceeds max, the function uses min as max and max as min.

Punctuation

[Optional] Specify the punctuation characters to remove from the input text. The string 'punctuation_characters' is a regular expression (see Regular Expressions in Syntax Elements).