Input
The input (training) table is log of vehicle complaints. The category column indicates whether the car has been in a crash.
doc_id | text_data | category |
---|---|---|
1 | consumer was driving approximately 45 mph hit a deer with the front bumper and then ran into an embankment head-on passenger's side air bag did deploy hit windshield and deployed outward. driver's side airbag cover opened but did not inflate it was still folded causing injuries. | crash |
2 | when vehicle was involved in a crash totalling vehicle driver's side/ passenger's side air bags did not deploy. vehicle was making a left turn and was hit by a ford f350 traveling about 35 mph on the front passenger's side. driver hit his head-on the steering wheel. hurt his knee and received neck and back injuries. | crash |
3 | consumer has experienced following problems; 1.) both lower ball joints wear out excessively; 2.) head gasket leaks; and 3.) cruise control would shut itself off while driving without foot pressing on brake pedal. | no_crash |
... | ... | ... |
The stop words table, stopwords.text, contains:
a an in is to into was the and this with they but will
To create a tokenized, filtered input file for the LDA function, apply the function TextParser to the training table:
CREATE MULTISET TABLE complaints_traintoken AS ( SELECT * FROM TextParser ( ON complaints USING TextColumn ('text_data') ToLowerCase ('true') Stemming ('false') ListPositions ('true') StopWords ('stopwords.txt') RemoveStopWords ('true') Accumulate ('doc_id', 'category') ) AS dt ) WITH DATA;
This query returns the following table:
SELECT * FROM complaints_traintoken ORDER BY doc_id;
doc_id | category | token | frequency | position |
---|---|---|---|---|
1 | crash | consumer | 1 | 0 |
1 | crash | driving | 1 | 2 |
1 | crash | approximately | 1 | 3 |
1 | crash | 45 | 1 | 4 |
1 | crash | mph | 1 | 5 |
1 | crash | hit | 2 | 6,26 |
1 | crash | deer | 1 | 8 |
1 | crash | front | 1 | 11 |
1 | crash | bumper | 1 | 12 |
1 | crash | then | 1 | 14 |
1 | crash | ran | 1 | 15 |
1 | crash | embankment | 1 | 18 |
1 | crash | head-on | 1 | 19 |
1 | crash | passenger's | 1 | 20 |
1 | crash | side | 2 | 21,32 |
... | ... | ... | ... | ... |
SQL Call
SELECT * FROM LDA ( ON complaints_traintoken AS InputTable OUT TABLE ModelTable (ldamodel) OUT TABLE OutputTable (ldaout1) USING TopicNum (5) DocIDColumn ('doc_id') WordColumn ('token') CountColumn ('frequency') MaxIterNum (30) ConvergenceDelta (1e-3) Seed (2) ) AS dt;
Output
message |
---|
Outputtable is created successfully. Training converged after 7 iterate steps with delta 1.4006041962568355E-4 There are 20 documents with 520 words in the training set, the perplexity is 92.070615 |
This query returns the following table:
SELECT * FROM ldaout1 ORDER BY docid, topicid;
docid | topicid | topicweight |
---|---|---|
1 | 0 | 0.00442669824335036 |
1 | 1 | 0.00364972124026978 |
1 | 2 | 0.00313760355154859 |
1 | 3 | 0.985083744464884 |
1 | 4 | 0.00370223249994785 |
2 | 0 | 0.00333404274412358 |
2 | 1 | 0.00272130082493761 |
2 | 2 | 0.00322554604431533 |
2 | 3 | 0.986818406648743 |
... | ... | ... |