1.0 - 8.00 - LDA Example - Teradata Vantage

Teradata® Vantage Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
1.0
8.00
Release Date
May 2019
Content Type
Programming Reference
Publication ID
B700-4003-098K
Language
English (United States)

Input

The input (training) table is log of vehicle complaints. The category column indicates whether the car has been in a crash.

InputTable: complaints
doc_id text_data category
1 consumer was driving approximately 45 mph hit a deer with the front bumper and then ran into an embankment head-on passenger's side air bag did deploy hit windshield and deployed outward. driver's side airbag cover opened but did not inflate it was still folded causing injuries. crash
2 when vehicle was involved in a crash totalling vehicle driver's side/ passenger's side air bags did not deploy. vehicle was making a left turn and was hit by a ford f350 traveling about 35 mph on the front passenger's side. driver hit his head-on the steering wheel. hurt his knee and received neck and back injuries. crash
3 consumer has experienced following problems; 1.) both lower ball joints wear out excessively; 2.) head gasket leaks; and 3.) cruise control would shut itself off while driving without foot pressing on brake pedal. no_crash
... ... ...

The stop words table, stopwords.text, contains:

a
an
in
is
to
into
was
the
and
this
with
they
but
will

To create a tokenized, filtered input file for the LDA function, apply the function TextParser to the training table:

CREATE MULTISET TABLE complaints_traintoken AS (
  SELECT * FROM TextParser (
    ON complaints
    USING
    TextColumn ('text_data')
    ToLowerCase ('true')
    Stemming ('false')
    ListPositions ('true')
    StopWords ('stopwords.txt')
    RemoveStopWords ('true')
    Accumulate ('doc_id', 'category')
  ) AS dt
) WITH DATA;

This query returns the following table:

SELECT * FROM complaints_traintoken ORDER BY doc_id;
complaints_traintoken
doc_id category token frequency position
1 crash consumer 1 0
1 crash driving 1 2
1 crash approximately 1 3
1 crash 45 1 4
1 crash mph 1 5
1 crash hit 2 6,26
1 crash deer 1 8
1 crash front 1 11
1 crash bumper 1 12
1 crash then 1 14
1 crash ran 1 15
1 crash embankment 1 18
1 crash head-on 1 19
1 crash passenger's 1 20
1 crash side 2 21,32
... ... ... ... ...

SQL Call

SELECT * FROM LDA (
  ON complaints_traintoken AS InputTable
  OUT TABLE ModelTable (ldamodel)
  OUT TABLE OutputTable (ldaout1)
  USING
  TopicNum (5)
  DocIDColumn ('doc_id')
  WordColumn ('token')
  CountColumn ('frequency')
  MaxIterNum (30)
  ConvergenceDelta (1e-3)
  Seed (2)
) AS dt;

Output

message
Outputtable is created successfully.
Training converged after 7 iterate steps with delta 1.4006041962568355E-4
There are 20 documents with 520 words in the training set, the perplexity is 92.070615

This query returns the following table:

SELECT * FROM ldaout1 ORDER BY docid, topicid;
ldaout1
docid topicid topicweight
1 0 0.00442669824335036
1 1 0.00364972124026978
1 2 0.00313760355154859
1 3 0.985083744464884
1 4 0.00370223249994785
2 0 0.00333404274412358
2 1 0.00272130082493761
2 2 0.00322554604431533
2 3 0.986818406648743
... ... ...