TextTokenizer Example 1: Chinese Tokenization

TextTokenizer Example 1: Chinese Tokenization - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product

Teradata Vantage

Release Number

8.00

1.0

Published

May 2019

Language

English (United States)

Last Update

2019-11-22

dita:mapPath

blj1506016597986.ditamap

dita:ditavalPath

blj1506016597986.ditaval

dita:id

B700-4003

lifecycle

Product Category

Teradata Vantage™

Input

Input Table: cn_input
id	txt
t1	我从小就不由自主地认为自己长大以后一定得成为一个象我父亲一样的画家, 可能是父母潜移默化的影响。
t2	中华人民共和国辽宁省铁岭市靠山屯村支书赵本山。

dict: cn_dict
txt
辽宁省铁岭市靠山屯村
赵本山

SQL Call 1

SELECT * FROM TextTokenizer (
  ON cn_input AS "input" PARTITION BY ANY
  ON cn_dict AS dict DIMENSION
  USING
  InputLanguage ('zh_CN')
  OutputDelimiter (' ')
  OutputByWord ('false')
  Accumulate ('id')
  TextColumn ('txt')
) AS dt ORDER BY id;

Output 1

id	token
t1	我从小就不由自主地认为自己长大以后一定得成为一个象我父亲一样的画家 , 可能是父母潜移默化的影响。
t2	中华人民共和国辽宁省铁岭市靠山屯村支书赵本山。

SQL Call 2

SELECT * FROM TextTokenizer (
  ON cn_input AS "input" PARTITION BY ANY
  ON cn_dict AS dict DIMENSION
  USING
  InputLanguage ('zh_CN')
  OutputByWord ('true')
  Accumulate ('id')
  TextColumn ('txt')
) AS dt ORDER BY id;

Output 2

id	sn	token
t1	1	我
t1	2	从小
t1	3	就
t1	4	不由自主
…	...	...
t2	1	中华人民共和国
t2	2	辽宁省
...	...	...