Using Text Analysis with Teradata Python Package

Teradata® Python Package User Guide

brand
Teradata Vantage
prodname
Teradata Python Package
vrm_release
16.20
category
User Guide
featnum
B700-4006-098K
This section investigates a log of vehicle complaints that have been categorized as crash-related or not crash-related. Use this log to build a Naïve Bayes Text Classifier model, and then apply the model to a new log data to predict if the complaint is associated with a crash.

It is assumed that the training and test datasets are already in the Teradata Database.

The following table "complaints" contains the training dataset.

doc_id text_data category
1 consumer was driving approximately 45 mph hit a deer with the front bumper and then ran into an embankment head-on passenger's side air bag did deploy hit windshield and deployed outward. driver's side airbag cover opened but did not inflate it was still folded causing injuries. crash
2 when vehicle was involved in a crash totalling vehicle driver's side/ passenger's side air bags did not deploy. vehicle was making a left turn and was hit by a ford f350 traveling about 35 mph on the front passenger's side. driver hit his head-on the steering wheel. hurt his knee and received neck and back injuries. crash
3 consumer has experienced following problems; 1.) both lower ball joints wear out excessively; 2.) head gasket leaks; and 3.) cruise control would shut itself off while driving without foot pressing on brake pedal. no_crash
4 transfer case was repaired under recall. after the work was completed noise was heard intermittently. consumer took vehicle back to dealer. the dealer re-inspected vehicle and informed the owner that the driveshaft was hitting the transfer case. the manufacturer has been notified. no_crash
5 transmission would start to slip when traveling just 10mph. the rpms would be over 3 thousand. had vehicle checked at dealership & was informed transmission was stuck & that it's a factory defect almost blew up. also speedometer does not keep accurate speeds. if speed is increased, it would fail to work. this was referred to mechanic by manufacturer. no_crash
6 due to the defective ignition cable which burned the coil the vehicle stalled unexpectedly which could have resulted in a crash. also dealer replaced the r&r drive belts/speed control cable and performed vehicle tune up. please provide further information. no_crash
7 when switch is turned on windshield wipers would not work properly. would have to jiggle switch & then wipers would move. wipers do turn off/on by themselves. recall 97v017000. no_crash
8 consumer was driving in a rain storm when the windshield wipers stopped this happened periodically. no_crash
9 at 66900 miles transmission has malfunctioned and will not shift into first gear. repairs were made at owner's expense wants reimbursement. *ml no_crash
10 when truck was sitting on an incline it rolled on its own. manufacturer was aware of the problem. problem has not been corrected. the truck is owned by walnut hill recker manufactured in 1998. no_crash
11 car engine raced while slowing to park. car lurched forward and crashed into a fence and a building. car had been in shop approximately one week prior to incident for high idle condition. crash
12 rear ended another vehicle at 65 to 70mph and neither driver's side or passenger's side airbags deployed. dealer has vehicle. crash
13 while vehicle was parked for an hour a fire started on the left side of the engine compartment. owners son smelled smoke owner saw fire coming from around drivers side front wheel. referenced in ea02-025 no_crash
14 after vehicle was repaired under recall 99v029000 ignition switch the airbag light stayed on . the dealer and the manufacturer has been notified. no_crash
15 electrical control module is shortening out causing the vehicle to stall. engine will become totally inoperative. consumer had to change alternator/ battery and starter and module replaced 4 times but defect still occurring cannot determine what is causing the problem. no_crash
16 at 68000 miles power steering broke off the housing pump causing total loss of power steering which also caused the vehicle to shut down. no_crash
17 on two occasions dual airbags did not deploy. consumer rear-ended another vehicle at approximately 50 mph and at 80 mph hit a truck head-on upon impact air bags did not deploy. driver sustained injuries. dealer did not determine why air bags did not deploy. crash
18 sunroof is leaking. no_crash
19 motor and the frame separated from vehicle. manufacturer will be notified. no_crash
20 rear front wheel bearing broke causing vehicle to pull to the left when slowing down. consumer had brake's replaced about four times and still dealer can't determine the problem. no_crash

The following table "nbayes_test" contains the test dataset.

doc_id text_data
1 ELECTRICAL CONTROL MODULE IS SHORTENING OUT, CAUSING THE VEHICLE TO STALL. ENGINE WILL BECOME TOTALLY INOPERATIVE. CONSUMER HAD TO CHANGE ALTERNATOR/ BATTERY AND STARTER, AND MODULE REPLACED 4 TIMES, BUT DEFECT STILL OCCURRING CANNOT DETERMINE WHAT IS CAUSING THE PROBLEM.
2 ABS BRAKES FAIL TO OPERATE PROPERLY, AND AIR BAGS FAILED TO DEPLOY DURING A CRASH AT APPROX. 28 MPH IMPACT. MANUFACTURER NOTIFIED.
3 WHILE DRIVING AT 60 MPH GAS PEDAL GOT STUCK DUE TO THE RUBBER THAT IS AROUND THE GAS PEDAL.
4 THERE IS A KNOCKING NOISE COMING FROM THE CATALYTIC CONVERTER, AND THE VEHICLE IS STALLING. ALSO, HAS PROBLEM WITH THE STEERING.
5 CONSUMER WAS MAKING A TURN, DRIVING AT APPROX 5-10 MPH WHEN CONSUMER HIT ANOTHER VEHICLE. UPON IMPACT, DUAL AIRBAGS DID NOT DEPLOY. ALL DAMAGE WAS DONE FROM ENGINE TO TRANSMISSION, TO THE FRONT OF VEHICLE, AND THE VEHICLE CONSIDERED A TOTAL LOSS.
6 WHEEL BEARING AND HUBS CRACKED, CAUSING THE METAL TO GRIND WHEN MAKING A RIGHT TURN. ALSO WHEN APPLYING THE BRAKES, PEDAL GOES TO THE FLOOR, CAUSE UNKNOWN. WAS ADVISED BY MIDAS NOT TO DRIVE VEHICLE- WHEELE COULD COME OFF.
7 DRIVING ABOUT 5-10 MPH, THE VEHICLE HAD A LOW FRONTAL IMPACT IN WHICH THE OTHER VEHICLE HAD NO DAMAGES. UPON IMPACT, DRIVER'S AND THE PASSENGER'S AIR BAGS DID NOT DEPLOY, RESULTING IN INJURIES. PLEASE PROVIDE FURTHER INFORMATION AND VIN#.
8 THE AIR BAG WARNING LIGHT HAS COME ON, INDICATING AIRBAGS ARE INOPERATIVE. THEY WERE FIXED ONE AT THE TIME, BUT PROBLEM HAS REOCCURRED.
9 CONSUMER WAS DRIVING WEST WHEN THE OTHER CAR WAS GOING EAST. THE OTHER CAR TURNED IN FRONT OF CONSUMER'S VEHICLE, CONSUMER HIT OTHER VEHICLE AND STARTED TO SPIN AROUND, COULDN'T STOP, RESULTING IN A CRASH. UPON IMPACT, AIRBAGS DIDN'T DEPLOY.
10 WHILE DRIVING ABOUT 65 MPH AND THE TRANSMISSION MADE A STRANGE NOISE, AND THE LEFT FRONT AXLE LOCKED UP. THE DEALER HAS REPAIRED THE VEHICLE.

This example shows the steps to build a Naïve Bayes Text Classifier model and then apply the model to the new log data.

  1. Import the required modules.
    from teradataml.analytics.mle.NaiveBayesTextClassifier import NaiveBayesTextClassifier
    
    from teradataml.analytics.sqle.NaiveBayesTextClassifierPredict import NaiveBayesTextClassifierPredict
    
    from teradataml.analytics.mle.TextTokenizer import TextTokenizer
    
    from teradataml.dataframe.dataframe import DataFrame
  2. Create a teradataml DataFrame from the training dataset.
    complaints = DataFrame.from_table("complaints")
    1. Create tokens from the training dataset.
      text_tokenizer_out = TextTokenizer(data=complaints,
                                         text_column='text_data',
                                         output_byword = True,
                                         accumulate=['doc_id','category'])
    2. Create a teradataml DataFrame "tddf_nbayes_tokens" consisting of the tokens from the training dataset, ignoring the case (lower()).
      tddf_nbayes_tokens = text_tokenizer_out.result.assign(drop_columns = True,
                                                            doc_id = text_tokenizer_out.result.doc_id,
                                                            token = text_tokenizer_out.result.token.str.lower(),
                                                            category = text_tokenizer_out.result.category)
  3. Train a new Naïve Bayes Text Classifier based on the teradataml DataFrame from the training dataset, using the NaiveBayesTextClassifier function from teradataml package.
    nb_textclassifier_model = NaiveBayesTextClassifier(data = tddf_nbayes_tokens, 
                                                    data_partition_column = "category",
                                                    token_column = "token",
                                                    doc_category_column = "category")
    Next, apply the model to the test data.
  4. Create a teradataml DataFrame from the test dataset.
    nbayes_test = DataFrame.from_table("nbayes_test")
    1. Create tokens from the test dataset.
      text_tokenizer_out_2 = TextTokenizer(data=nbayes_test,
                                           text_column='text_data',
                                           output_byword = True,
                                           accumulate='doc_id')
    2. Create a teradataml DataFrame "tddf_nbayes_tokens_test" with the tokens from the test dataset, ignoring the case (lower()).
      tddf_nbayes_tokens_test = text_tokenizer_out_2.result.assign(drop_columns = True,
                                                                   doc_id = text_tokenizer_out_2.result.doc_id,
                                                                   token = text_tokenizer_out_2.result.token.str.lower() )
      
  5. Predict the categories ('crash' or 'no crash') by applying the Naïve Bayes Text Classifier model to the teradataml DataFrame from the test dataset, using the NaiveBayesTextClassifierPredict function.
    nb_textclassifier_pred = NaiveBayesTextClassifierPredict(newdata = tddf_nbayes_tokens_test,
                                                            object = nb_textclassifier_model,
                                                            newdata_partition_column = "doc_id",
                                                            input_token_column = "token",
                                                            doc_id_columns = ["doc_id"])
  6. Inspect the results.
    nb_textclassifier_pred