LDA Example - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
8.10
1.1
Published
October 2019
Language
English (United States)
Last Update
2019-12-31
dita:mapPath
ima1540829771750.ditamap
dita:ditavalPath
jsj1481748799576.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢

Input

The input (training) table is log of vehicle complaints. The category column indicates whether the car has been in a crash.

InputTable: complaints
doc_id text_data category
1 consumer was driving approximately 45 mph hit a deer with the front bumper and then ran into an embankment head-on passenger's side air bag did deploy hit windshield and deployed outward. driver's side airbag cover opened but did not inflate it was still folded causing injuries. crash
2 when vehicle was involved in a crash totalling vehicle driver's side/ passenger's side air bags did not deploy. vehicle was making a left turn and was hit by a ford f350 traveling about 35 mph on the front passenger's side. driver hit his head-on the steering wheel. hurt his knee and received neck and back injuries. crash
3 consumer has experienced following problems; 1.) both lower ball joints wear out excessively; 2.) head gasket leaks; and 3.) cruise control would shut itself off while driving without foot pressing on brake pedal. no_crash
... ... ...

The stop words table, stopwords.text, contains:

a
an
in
is
to
into
was
the
and
this
with
they
but
will

To create a tokenized, filtered input file for the LDA function, apply the function TextParser (ML Engine) to the training table:

CREATE MULTISET TABLE complaints_traintoken AS (
  SELECT * FROM TextParser (
    ON complaints
    USING
    TextColumn ('text_data')
    ConvertToLowerCase ('true')
    StemTokens ('false')
    ListPositions ('true')
    StopWordsList ('stopwords.txt')
    RemoveStopWords ('true')
    Accumulate ('doc_id', 'category')
  ) AS dt
) WITH DATA;

This query returns the following table:

SELECT * FROM complaints_traintoken ORDER BY doc_id;
complaints_traintoken
doc_id category token frequency position
1 crash consumer 1 0
1 crash driving 1 2
1 crash approximately 1 3
1 crash 45 1 4
1 crash mph 1 5
1 crash hit 2 6,26
1 crash deer 1 8
1 crash front 1 11
1 crash bumper 1 12
1 crash then 1 14
1 crash ran 1 15
1 crash embankment 1 18
1 crash head-on 1 19
1 crash passenger's 1 20
1 crash side 2 21,32
... ... ... ... ...

SQL Call

SELECT * FROM LDA (
  ON complaints_traintoken AS InputTable
  OUT TABLE ModelTable (ldamodel)
  OUT TABLE OutputTable (ldaout1)
  USING
  TopicNum (5)
  DocIDColumn ('doc_id')
  WordColumn ('token')
  CountColumn ('frequency')
  MaxIterNum (30)
  StopThreshold (1e-3)
  Seed (2)
) AS dt;

Output

 message                                                                                
 -------------------------------------------------------------------------------------- 
 Outputtable  is created successfully.                                                 
 Training converged after 7 iterate steps with delta 2.1420824926263047E-4             
 There are 20 documents with 520 words in the training set, the perplexity is 92.016370
SELECT * FROM ldaout1 ORDER BY docid, topicid;
 docid topicid topicweight           
 ----- ------- --------------------- 
     1       0  0.004426429226493859
     1       1  0.003656317158244291
     1       2   0.00313758179610592
     1       3    0.9850922760656984
     1       4  0.003687395753457743
     2       0  0.003333337974833768
     2       1 0.0026715880002218914
     2       2  0.003225645026950738
     2       3    0.9867880696484165
     2       4  0.003981359349576945
     3       0  0.006268581445977316
     3       1    0.9815551630033772
     3       2  0.004417571386232821
     3       3  0.003794628444713808
     3       4  0.003964055719699033
     4       0  0.010128051883887499
     4       1 0.0046614669429246676
     4       2    0.9758978544852696
     4       3  0.004290931289345636
     4       4  0.005021695398572808
     5       0 0.0034977306955634277
     5       1  0.002731383851625112
     5       2    0.9884815205074873
     5       3 0.0026545373864259193
     5       4  0.002634827558898238
     6       0  0.004910519510246177
     6       1    0.9818335529855468
     6       2 0.0047878536635098294
     6       3 0.0040427451392959415
     6       4   0.00442532870140121
     7       0    0.9820521861311798
     7       1  0.004062894585215896
     7       2  0.005145745286342357
     7       3  0.004845509758029061
     7       4 0.0038936642392328386
     8       0    0.9531957714532828
     8       1   0.01266691837738069
     8       2   0.01103344176680826
     8       3  0.012735425645888912
     8       4  0.010368442756639208
     9       0  0.007084998039507431
     9       1    0.9674762748316937
     9       2  0.009365407141105342
     9       3  0.009745662251613264
     9       4  0.006327657736080222
    10       0  0.007116616036670699
    10       1  0.004744801805922887
    10       2    0.9776204960574149
    10       3  0.005743722972250647
    10       4  0.004774363127740984
    11       0  0.004760046611134511
    11       1  0.004469826914975781
    11       2 0.0045464562180395195
    11       3  0.004484861188135865
    11       4    0.9817388090677145
    12       0  0.007265823196617461
    12       1  0.007964358825664323
    12       2  0.008114234945516906
    12       3  0.015620513156623533
    12       4    0.9610350698755779
    13       0   0.00426428406119381
    13       1  0.003902223649627741
    13       2  0.004234437785553248
    13       3  0.004507123835691691
    13       4    0.9830919306679337
    14       0    0.9650541555578738
    14       1    0.0074391614441562
    14       2  0.011274607436926492
    14       3  0.006893717562493168
    14       4  0.009338357998550267
    15       0  0.003897590296516705
    15       1    0.9832303090018338
    15       2  0.004202391731834754
    15       3  0.004037833753883089
    15       4  0.004631875215931837
    16       0  0.005524228354277319
    16       1  0.011413630319077005
    16       2    0.9701109801988259
    16       3  0.006336964722200615
    16       4  0.006614196405619184
    17       0 0.0030109395355278363
    17       1  0.003122351795972726
    17       2 0.0031409531638286127
    17       3    0.9876271092587618
    17       4 0.0030986462459095198
    18       0   0.03489774179008564
    18       1  0.031696122828725706
    18       2    0.8692684529329384
    18       3   0.03166369087468641
    18       4   0.03247399157356371
    19       0   0.02324181954633177
    19       1  0.013579883713331695
    19       2   0.02394012985969003
    19       3   0.01381761167230265
    19       4    0.9254205552083438
    20       0  0.007582282026547196
    20       1    0.2972918395246684
    20       2  0.009009173996895745
    20       3  0.008822007937856105
    20       4    0.6772946965140326

Download a zip file of all examples and a SQL script file that creates their input tables from the attachment in the left sidebar.