TextTokenizer Syntax Elements

TextTokenizer Syntax Elements - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product

Teradata Vantage

Release Number

9.02

9.01

2.0

1.3

Published

February 2022

Language

English (United States)

Last Update

2022-02-10

dita:mapPath

rnn1580259159235.ditamap

dita:ditavalPath

ybt1582220416951.ditaval

dita:id

B700-4003

lifecycle

Product Category

Teradata Vantage™

TextColumn

Specify the name of the InputTable column that contains the text to tokenize.

InputLanguage

[Optional] Specify the language of the text in text_column:

Option	Description
'en' (Default)	English
'zh_CN'	Simplified Chinese
'zh_TW'	Traditional Chinese
'jp'	Japanese

InputModelFile

[Optional] Specify the name of model file that the function uses for tokenizing. The model must be a conditional random-fields model and input_model_file must already be installed on ML Engine. If you omit this syntax element, or if input_model_file is not installed, then the function uses white spaces to separate English words and an embedded dictionary to tokenize Chinese text.

If you specify InputLanguage('jp'), the function ignores this syntax element.

OutputDelimiter

[Optional] Specify the delimiter, a string, for separating tokens in the output.

Default: '/' (slash)

OutputByWord

Specify whether to output one token in each row.

Default: 'false' (Output one line of text in each row.)

Accumulate

[Optional] Specify the names of the InputTable columns to copy to the output table.

UserDictionaryFile

[Optional] Specify the name of the user dictionary to use to correct results specified by the model. If you specify both this syntax element and a dictionary table (Dict), then the function uses the union of user_dictionary_file and Dict as its dictionary. TextTokenizer Input describes the format of user_dictionary_file and Dict.

If the function finds more than one matched term, it selects the longest term for the first match.