1.1 - 8.10 - TextTokenizer Input - Teradata Vantage

Teradata Vantage™ - Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
1.1
8.10
Published
October 2019
Content Type
Programming Reference
Publication ID
B700-4003-079K
Language
English (United States)

Input Table Schema

Column Data Type Description
text_column VARCHAR Text to tokenize.
accumulate_column Any [Column appears once for each specified accumulate_column.] Column to copy to output table.

Dict Schema

This table is optional.

Column Data Type Description
entry VARCHAR Dictionary entry.

Dictionary Table and User Dictionary File Format

This table describes the format of both the dictionary table (Dict) and the user dictionary file (specified by the UserDictionaryFile syntax element).

Language Format
Chinese and English One dictionary word on each line.
Japanese A dictionary entry consists of the following comma-separated words:

word—The original word.

tokenized_word—The tokenized form of the word.

reading—The reading of word in Katakana.

pos—The part-of-speech of the word.

For example:

成田空港,成田空港,ナリタクウコウ,カスタム名詞