LDA Example - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product

Teradata Vantage

Release Number

8.00

1.0

Published

May 2019

Language

English (United States)

Last Update

2019-11-22

dita:mapPath

blj1506016597986.ditamap

dita:ditavalPath

blj1506016597986.ditaval

dita:id

B700-4003

lifecycle

Product Category

Teradata Vantage™

Input

The input (training) table is log of vehicle complaints. The category column indicates whether the car has been in a crash.

InputTable: complaints
doc_id	text_data	category
1	consumer was driving approximately 45 mph hit a deer with the front bumper and then ran into an embankment head-on passenger's side air bag did deploy hit windshield and deployed outward. driver's side airbag cover opened but did not inflate it was still folded causing injuries.	crash
2	when vehicle was involved in a crash totalling vehicle driver's side/ passenger's side air bags did not deploy. vehicle was making a left turn and was hit by a ford f350 traveling about 35 mph on the front passenger's side. driver hit his head-on the steering wheel. hurt his knee and received neck and back injuries.	crash
3	consumer has experienced following problems; 1.) both lower ball joints wear out excessively; 2.) head gasket leaks; and 3.) cruise control would shut itself off while driving without foot pressing on brake pedal.	no_crash
...	...	...

The stop words table, stopwords.text, contains:

a
an
in
is
to
into
was
the
and
this
with
they
but
will

To create a tokenized, filtered input file for the LDA function, apply the function TextParser to the training table:

CREATE MULTISET TABLE complaints_traintoken AS (
  SELECT * FROM TextParser (
    ON complaints
    USING
    TextColumn ('text_data')
    ToLowerCase ('true')
    Stemming ('false')
    ListPositions ('true')
    StopWords ('stopwords.txt')
    RemoveStopWords ('true')
    Accumulate ('doc_id', 'category')
  ) AS dt
) WITH DATA;

This query returns the following table:

SELECT * FROM complaints_traintoken ORDER BY doc_id;

complaints_traintoken
doc_id	category	token	frequency	position
1	crash	consumer	1	0
1	crash	driving	1	2
1	crash	approximately	1	3
1	crash	45	1	4
1	crash	mph	1	5
1	crash	hit	2	6,26
1	crash	deer	1	8
1	crash	front	1	11
1	crash	bumper	1	12
1	crash	then	1	14
1	crash	ran	1	15
1	crash	embankment	1	18
1	crash	head-on	1	19
1	crash	passenger's	1	20
1	crash	side	2	21,32
...	...	...	...	...

SQL Call

SELECT * FROM LDA (
  ON complaints_traintoken AS InputTable
  OUT TABLE ModelTable (ldamodel)
  OUT TABLE OutputTable (ldaout1)
  USING
  TopicNum (5)
  DocIDColumn ('doc_id')
  WordColumn ('token')
  CountColumn ('frequency')
  MaxIterNum (30)
  ConvergenceDelta (1e-3)
  Seed (2)
) AS dt;

Output

message
Outputtable is created successfully. Training converged after 7 iterate steps with delta 1.4006041962568355E-4 There are 20 documents with 520 words in the training set, the perplexity is 92.070615

This query returns the following table:

SELECT * FROM ldaout1 ORDER BY docid, topicid;

ldaout1
docid	topicid	topicweight
1	0	0.00442669824335036
1	1	0.00364972124026978
1	2	0.00313760355154859
1	3	0.985083744464884
1	4	0.00370223249994785
2	0	0.00333404274412358
2	1	0.00272130082493761
2	2	0.00322554604431533
2	3	0.986818406648743
...	...	...