1.0 - 8.00 - TextTokenizer Example 3: English Tokenization - Teradata Vantage

Teradata® Vantage Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
1.0
8.00
Release Date
May 2019
Content Type
Programming Reference
Publication ID
B700-4003-098K
Language
English (United States)

Input

The input table is log of vehicle complaints. The category column indicates whether the car has been in a crash.

complaints
doc_id text_data category
1 consumer was driving approximately 45 mph hit a deer with the front bumper and then ran into an embankment head-on passenger's side air bag did deploy hit windshield and deployed outward. driver's side airbag cover opened but did not inflate it was still folded causing injuries. crash
2 when vehicle was involved in a crash totalling vehicle driver's side/ passenger's side air bags did not deploy. vehicle was making a left turn and was hit by a ford f350 traveling about 35 mph on the front passenger's side. driver hit his head-on the steering wheel. hurt his knee and received neck and back injuries. crash
3 consumer has experienced following problems; 1.) both lower ball joints wear out excessively; 2.) head gasket leaks; and 3.) cruise control would shut itself off while driving without foot pressing on brake pedal. no_crash
... ... ...

SQL Call

SELECT * FROM TextTokenizer (
  ON complaints AS "input" PARTITION BY ANY
  USING
  InputLanguage ('en')
  OutputDelimiter (' ')
  OutputByWord ('true')
  Accumulate ('doc_id')
  TextColumn ('text_data')
) AS dt ORDER BY doc_id, sn, token;

Output

doc_id sn token
1 1 consumer
1 2 was
1 3 driving
1 4 approximately
1 5 45
1 6 mph
1 7 hit
1 8 a
1 9 deer
1 10 with
1 11 the
1 12 front
... ... ...