TextTokenizer Input - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
9.02
9.01
2.0
1.3
Published
February 2022
Language
English (United States)
Last Update
2022-02-10
dita:mapPath
rnn1580259159235.ditamap
dita:ditavalPath
ybt1582220416951.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantage™

Input Table Schema

Column Data Type Description
text_column VARCHAR Text to tokenize.
accumulate_column Any [Column appears once for each specified accumulate_column.] Column to copy to output table.

Dict Schema

This table is optional.

Column Data Type Description
entry VARCHAR Dictionary entry.

Dictionary Table and User Dictionary File Format

This table describes the format of both the dictionary table (Dict) and the user dictionary file (specified by the UserDictionaryFile syntax element).

Language Format
Chinese and English One dictionary word on each line.
Japanese A dictionary entry consists of the following comma-separated words:

word—The original word.

tokenized_word—The tokenized form of the word.

reading—The reading of word in Katakana.

pos—The part-of-speech of the word.

For example:

成田空港,成田空港,ナリタクウコウ,カスタム名詞