The training table is log of vehicle complaints. The category column indicates whether the car has been in a crash.
doc_id | text_data | category |
---|---|---|
1 | consumer was driving approximately 45 mph hit a deer with the front bumper and then ran into an embankment head-on passenger's side air bag did deploy hit windshield and deployed outward. driver's side airbag cover opened but did not inflate it was still folded causing injuries. | crash |
2 | when vehicle was involved in a crash totalling vehicle driver's side/ passenger's side air bags did not deploy. vehicle was making a left turn and was hit by a ford f350 traveling about 35 mph on the front passenger's side. driver hit his head-on the steering wheel. hurt his knee and received neck and back injuries. | crash |
3 | consumer has experienced following problems; 1.) both lower ball joints wear out excessively; 2.) head gasket leaks; and 3.) cruise control would shut itself off while driving without foot pressing on brake pedal. | no_crash |
... | ... | ... |
The stop words table, stopwords.text, contains:
a an in is to into was the and this with they but will
To generate a tokenized, filtered input file for the LDATrainer function, apply the function Text_Parser to the training table:
SELECT * FROM Text_Parser ( ON complaints TextColumn ('text_data') ToLowerCase ('true') Stemming ('false') Punctuation ('\[.,?\!\]') ListPositions ('true') StopWords ('stopwords.txt') RemoveStopWords ('true') Accumulate ('doc_id', 'category') ) ORDER BY doc_id;
The following query returns the output shown in the following table:
SELECT * FROM complaints_traintoken ORDER BY doc_id;
doc_id | category | token | frequency | position |
---|---|---|---|---|
1 | crash | consumer | 1 | 0 |
1 | crash | driving | 1 | 2 |
1 | crash | approximately | 1 | 3 |
1 | crash | 45 | 1 | 4 |
1 | crash | mph | 1 | 5 |
1 | crash | hit | 2 | 6,26 |
1 | crash | deer | 1 | 8 |
1 | crash | front | 1 | 11 |
1 | crash | bumper | 1 | 12 |
1 | crash | then | 1 | 14 |
1 | crash | ran | 1 | 15 |
1 | crash | embankment | 1 | 18 |
1 | crash | head-on | 1 | 19 |
1 | crash | passenger's | 1 | 20 |
1 | crash | side | 2 | 21,32 |
... | ... | ... | ... | ... |