1.1 - 8.10 - TextClassifierTrainer Syntax Elements - Teradata Vantage

Teradata Vantage™ - Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
1.1
8.10
Release Date
October 2019
Content Type
Programming Reference
Publication ID
B700-4003-079K
Language
English (United States)
OutputModelFile
Specify the name for the model file to create.
TextColumn
Specify the name of the InputTable column that contains the text of the training documents.
CategoryColumn
Specify the name of the InputTable column that contains the category of the training documents.
ModelType
Specify the classifier type of the model, KNN algorithm or maximum entropy model.
KNNModelParameters
[Optional] Applies only if the classifier type of the model is KNN. Specify parameters for the classifier.

The value must be in the range (0, 1). The n training documents are clustered into value*n groups (for example, if there are 100 training documents, then KNNModelParameters('compress:0.6') clusters them into 60 groups), and the model uses the center of each group as the feature vector.

NLPParameters
[Optional] Specify natural language processing (NLP) parameters for preprocessing the text data and produce tokens:
name:value Description
tokenDictFile:token_file token_file is name of ML Engine file in which each line contains a phrase, followed by a space, followed by the token for the phrase (and nothing else).
stopwordsFile:stopword_file stopword_file is name of ML Engine file in which each line contains exactly one stop word (a word to ignore during tokenization, such as a, an, or the).
useStem:{ 'true' | 'false' } Specifies whether function stems tokens.

Default: 'false'

stemIgnoreFile:stem_ignore_file stem_ignore_file is name of ML Engine file in which each line contains exactly one word to ignore during stemming.

Specifying this parameter with useStem:'false' causes an exception.

useBgram:{ 'true' | 'false' } Specifies whether function uses Bigram, which considers proximity of adjacent tokens when analyzing them.

Default: 'false'

language:{ 'en' | 'zh_CN' | 'zh_TW' } Specifies input text language—English (Default), Simplified Chinese, or Traditional Chinese, respectively.

For zh_CN and zh_TW, function ignores useStem and stemIgnoreFile.

Default: 'en'

Example:
NLPParameters ('tokenDictFile:token_dict.txt', 
'stopwordsFile:fileName', 
'useStem:true', 
'stemIgnoreFile:fileName', 
'useBgram:true', 
'language:zh_CN')
FeatureSelectionLimits
[Optional] Specify the feature selection method, DF (document frequency). The values min and max must be in the range (0, 1). The function selects only the tokens that appear in at least min*n documents and at most max*n documents, where n is the number of training documents. For example, FeatureSelection ('DF:[0.1:0.9]') causes the function to select only the tokens that appear in at least 10% but no more than 90% of the training documents. If min exceeds max, the function uses min as max and max as min.