TextTokenizer Input - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product

Teradata Vantage

Release Number

8.10

1.1

Published

October 2019

Language

English (United States)

Last Update

2019-12-31

dita:mapPath

ima1540829771750.ditamap

dita:ditavalPath

jsj1481748799576.ditaval

dita:id

B700-4003

lifecycle

Product Category

Teradata Vantage™

Input Table Schema

Column	Data Type	Description
text_column	VARCHAR	Text to tokenize.
accumulate_column	Any	[Column appears once for each specified accumulate_column.] Column to copy to output table.

Dict Schema

This table is optional.

Column	Data Type	Description
entry	VARCHAR	Dictionary entry.

Dictionary Table and User Dictionary File Format

This table describes the format of both the dictionary table (Dict) and the user dictionary file (specified by the UserDictionaryFile syntax element).

Language	Format
Chinese and English	One dictionary word on each line.
Japanese	A dictionary entry consists of the following comma-separated words: word—The original word. tokenized_word—The tokenized form of the word. reading—The reading of word in Katakana. pos—The part-of-speech of the word. For example: 成田空港,成田空港,ナリタクウコウ,カスタム名詞

Language

Format

Chinese and English

One dictionary word on each line.

Japanese

A dictionary entry consists of the following comma-separated words:

word—The original word.

tokenized_word—The tokenized form of the word.

reading—The reading of word in Katakana.

pos—The part-of-speech of the word.

For example:

成田空港,成田空港,ナリタクウコウ,カスタム名詞