Input
The input (training) table is log of vehicle complaints. The category column indicates whether the car has been in a crash.
doc_id | text_data | category |
---|---|---|
1 | consumer was driving approximately 45 mph hit a deer with the front bumper and then ran into an embankment head-on passenger's side air bag did deploy hit windshield and deployed outward. driver's side airbag cover opened but did not inflate it was still folded causing injuries. | crash |
2 | when vehicle was involved in a crash totalling vehicle driver's side/ passenger's side air bags did not deploy. vehicle was making a left turn and was hit by a ford f350 traveling about 35 mph on the front passenger's side. driver hit his head-on the steering wheel. hurt his knee and received neck and back injuries. | crash |
3 | consumer has experienced following problems; 1.) both lower ball joints wear out excessively; 2.) head gasket leaks; and 3.) cruise control would shut itself off while driving without foot pressing on brake pedal. | no_crash |
... | ... | ... |
The stop words table, stopwords.text, contains:
a an in is to into was the and this with they but will
To create a tokenized, filtered input file for the LDA function, apply the function TextParser (ML Engine) to the training table:
CREATE MULTISET TABLE complaints_traintoken AS ( SELECT * FROM TextParser ( ON complaints USING TextColumn ('text_data') ConvertToLowerCase ('true') StemTokens ('false') ListPositions ('true') StopWordsList ('stopwords.txt') RemoveStopWords ('true') Accumulate ('doc_id', 'category') ) AS dt ) WITH DATA;
This query returns the following table:
SELECT * FROM complaints_traintoken ORDER BY doc_id;
doc_id | category | token | frequency | position |
---|---|---|---|---|
1 | crash | consumer | 1 | 0 |
1 | crash | driving | 1 | 2 |
1 | crash | approximately | 1 | 3 |
1 | crash | 45 | 1 | 4 |
1 | crash | mph | 1 | 5 |
1 | crash | hit | 2 | 6,26 |
1 | crash | deer | 1 | 8 |
1 | crash | front | 1 | 11 |
1 | crash | bumper | 1 | 12 |
1 | crash | then | 1 | 14 |
1 | crash | ran | 1 | 15 |
1 | crash | embankment | 1 | 18 |
1 | crash | head-on | 1 | 19 |
1 | crash | passenger's | 1 | 20 |
1 | crash | side | 2 | 21,32 |
... | ... | ... | ... | ... |
SQL Call
SELECT * FROM LDA ( ON complaints_traintoken AS InputTable OUT TABLE ModelTable (ldamodel) OUT TABLE OutputTable (ldaout1) USING TopicNum (5) DocIDColumn ('doc_id') WordColumn ('token') CountColumn ('frequency') MaxIterNum (30) StopThreshold (1e-3) Seed (2) ) AS dt;
Output
message -------------------------------------------------------------------------------------- Outputtable is created successfully. Training converged after 7 iterate steps with delta 2.1420824926263047E-4 There are 20 documents with 520 words in the training set, the perplexity is 92.016370
SELECT * FROM ldaout1 ORDER BY docid, topicid;
docid topicid topicweight ----- ------- --------------------- 1 0 0.004426429226493859 1 1 0.003656317158244291 1 2 0.00313758179610592 1 3 0.9850922760656984 1 4 0.003687395753457743 2 0 0.003333337974833768 2 1 0.0026715880002218914 2 2 0.003225645026950738 2 3 0.9867880696484165 2 4 0.003981359349576945 3 0 0.006268581445977316 3 1 0.9815551630033772 3 2 0.004417571386232821 3 3 0.003794628444713808 3 4 0.003964055719699033 4 0 0.010128051883887499 4 1 0.0046614669429246676 4 2 0.9758978544852696 4 3 0.004290931289345636 4 4 0.005021695398572808 5 0 0.0034977306955634277 5 1 0.002731383851625112 5 2 0.9884815205074873 5 3 0.0026545373864259193 5 4 0.002634827558898238 6 0 0.004910519510246177 6 1 0.9818335529855468 6 2 0.0047878536635098294 6 3 0.0040427451392959415 6 4 0.00442532870140121 7 0 0.9820521861311798 7 1 0.004062894585215896 7 2 0.005145745286342357 7 3 0.004845509758029061 7 4 0.0038936642392328386 8 0 0.9531957714532828 8 1 0.01266691837738069 8 2 0.01103344176680826 8 3 0.012735425645888912 8 4 0.010368442756639208 9 0 0.007084998039507431 9 1 0.9674762748316937 9 2 0.009365407141105342 9 3 0.009745662251613264 9 4 0.006327657736080222 10 0 0.007116616036670699 10 1 0.004744801805922887 10 2 0.9776204960574149 10 3 0.005743722972250647 10 4 0.004774363127740984 11 0 0.004760046611134511 11 1 0.004469826914975781 11 2 0.0045464562180395195 11 3 0.004484861188135865 11 4 0.9817388090677145 12 0 0.007265823196617461 12 1 0.007964358825664323 12 2 0.008114234945516906 12 3 0.015620513156623533 12 4 0.9610350698755779 13 0 0.00426428406119381 13 1 0.003902223649627741 13 2 0.004234437785553248 13 3 0.004507123835691691 13 4 0.9830919306679337 14 0 0.9650541555578738 14 1 0.0074391614441562 14 2 0.011274607436926492 14 3 0.006893717562493168 14 4 0.009338357998550267 15 0 0.003897590296516705 15 1 0.9832303090018338 15 2 0.004202391731834754 15 3 0.004037833753883089 15 4 0.004631875215931837 16 0 0.005524228354277319 16 1 0.011413630319077005 16 2 0.9701109801988259 16 3 0.006336964722200615 16 4 0.006614196405619184 17 0 0.0030109395355278363 17 1 0.003122351795972726 17 2 0.0031409531638286127 17 3 0.9876271092587618 17 4 0.0030986462459095198 18 0 0.03489774179008564 18 1 0.031696122828725706 18 2 0.8692684529329384 18 3 0.03166369087468641 18 4 0.03247399157356371 19 0 0.02324181954633177 19 1 0.013579883713331695 19 2 0.02394012985969003 19 3 0.01381761167230265 19 4 0.9254205552083438 20 0 0.007582282026547196 20 1 0.2972918395246684 20 2 0.009009173996895745 20 3 0.008822007937856105 20 4 0.6772946965140326
Download a zip file of all examples and a SQL script file that creates their input tables from the attachment in the left sidebar.