Text Analysis with teradataml Package - Text Analysis with teradataml Package - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
Language
English (United States)
Last Update
2024-04-03
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

This example investigates a log of vehicle complaints that have been categorized as crash-related or not crash-related. Use this log to build a Naïve Bayes Text Classifier model, and then apply the model to a new log data to predict if the complaint is associated with a crash.

This example shows the steps to build a Naïve Bayes Text Classifier model and then apply the model to the new log data.

  1. Import the required modules and load the example datasets.
    from teradataml import NaiveBayesTextClassifierTrainer
    from teradataml import NaiveBayesTextClassifierPredict
    from teradataml import TextParser
    from teradataml import load_example_data
    from teradataml.dataframe.dataframe import DataFrame
    # Load the data to run the example.
    load_example_data("TextTokenizer","complaints")
  2. Create a teradataml DataFrame from the training dataset.
    complaints = DataFrame.from_table("complaints")
  3. Create tokens from the training dataset.
    text_tokenizer_out = TextParser(data=complaints,
                                    text_column="text_data",
                                    remove_stopwords=True,
                                    accumulate=["doc_id", "category"])
  4. Create a teradataml DataFrame "tddf_nbayes_tokens" consisting of the tokens from the training dataset, ignoring the case (lower()).
    tddf_nbayes_tokens = text_tokenizer_out.result.assign(drop_columns = True,
                                                          doc_id = text_tokenizer_out.result.doc_id,
                                                          token = text_tokenizer_out.result.token.str.lower(),
                                                          category = text_tokenizer_out.result.category)
  5. Train a new Naïve Bayes Text Classifier based on the teradataml DataFrame from the training dataset, using the NaiveBayesTextClassifierTrainer function from teradataml package.
    nb_textclassifier_model = NaiveBayesTextClassifierTrainer(data = tddf_nbayes_tokens,
                                                             data_partition_column = "category",
                                                             token_column = "token",
                                                             doc_category_column = "category")
  6. Get the test data or creates a sample test data.
    import pandas as pd
    nbayes_test = {"doc_id": range(1, 11),
    "text_data":
    ["ELECTRICAL CONTROL MODULE IS SHORTENING OUT, CAUSING THE VEHICLE TO STALL. ENGINE WILL BECOME TOTALLY INOPERATIVE. CONSUMER HAD TO CHANGE ALTERNATOR/ BATTERY AND STARTER, AND MODULE REPLACED 4 TIMES, BUT DEFECT STILL OCCURRING CANNOT DETERMINE WHAT IS CAUSING THE PROBLEM.",
    "ABS BRAKES FAIL TO OPERATE PROPERLY, AND AIR BAGS FAILED TO DEPLOY DURING A CRASH AT APPROX. 28 MPH IMPACT. MANUFACTURER NOTIFIED.",
    "WHILE DRIVING AT 60 MPH GAS PEDAL GOT STUCK DUE TO THE RUBBER THAT IS AROUND THE GAS PEDAL.",
    "THERE IS A KNOCKING NOISE COMING FROM THE CATALYTIC CONVERTER, AND THE VEHICLE IS STALLING. ALSO, HAS PROBLEM WITH THE STEERING.",
    "CONSUMER WAS MAKING A TURN, DRIVING AT APPROX 5-10 MPH WHEN CONSUMER HIT ANOTHER VEHICLE. UPON IMPACT, DUAL AIRBAGS DID NOT DEPLOY. ALL DAMAGE WAS DONE FROM ENGINE TO TRANSMISSION, TO THE FRONT OF VEHICLE, AND THE VEHICLE CONSIDERED A TOTAL LOSS.",
    "WHEEL BEARING AND HUBS CRACKED, CAUSING THE METAL TO GRIND WHEN MAKING A RIGHT TURN. ALSO WHEN APPLYING THE BRAKES, PEDAL GOES TO THE FLOOR, CAUSE UNKNOWN. WAS ADVISED BY MIDAS NOT TO DRIVE VEHICLE- WHEELE COULD COME OFF.",
    "DRIVING ABOUT 5-10 MPH, THE VEHICLE HAD A LOW FRONTAL IMPACT IN WHICH THE OTHER VEHICLE HAD NO DAMAGES. UPON IMPACT, DRIVER'S AND THE PASSENGER'S AIR BAGS DID NOT DEPLOY, RESULTING IN INJURIES. PLEASE PROVIDE FURTHER INFORMATION AND VIN#.",
    "THE AIR BAG WARNING LIGHT HAS COME ON, INDICATING AIRBAGS ARE INOPERATIVE. THEY WERE FIXED ONE AT THE TIME, BUT PROBLEM HAS REOCCURRED.",
    "CONSUMER WAS DRIVING WEST WHEN THE OTHER CAR WAS GOING EAST. THE OTHER CAR TURNED IN FRONT OF CONSUMER'S VEHICLE, CONSUMER HIT OTHER VEHICLE AND STARTED TO SPIN AROUND, COULDN'T STOP, RESULTING IN A CRASH. UPON IMPACT, AIRBAGS DIDN'T DEPLOY.",
    "WHILE DRIVING ABOUT 65 MPH AND THE TRANSMISSION MADE A STRANGE NOISE, AND THE LEFT FRONT AXLE LOCKED UP. THE DEALER HAS REPAIRED THE VEHICLE."]}
    nbayes_test = pd.DataFrame(nbayes_test)
    from teradataml import copy_to_sql
    copy_to_sql(nbayes_test, table_name="nbayes_test")
    Next, apply the model to the test data.
  7. Create a teradataml DataFrame from the test dataset.
    nbayes_test = DataFrame.from_table("nbayes_test")
  8. Create tokens from the test dataset.
    text_tokenizer_out_2 = TextParser(data=nbayes_test,
                                      text_column="text_data",
                                      remove_stopwords=True,
                                      accumulate="doc_id")
  9. Create a teradataml DataFrame "tddf_nbayes_tokens_test" with the tokens from the test dataset, ignoring the case (lower()).
    tddf_nbayes_tokens_test = text_tokenizer_out_2.result.assign(drop_columns = True,
                                                                 doc_id = text_tokenizer_out_2.result.doc_id,
                                                                 token = text_tokenizer_out_2.result.token.str.lower() )
  10. Predict the categories ('crash' or 'no crash') by applying the Naïve Bayes Text Classifier model to the teradataml DataFrame from the test dataset, using the NaiveBayesTextClassifierPredict function.
    nb_textclassifier_pred = NaiveBayesTextClassifierPredict(newdata = tddf_nbayes_tokens_test,
                                                             object = nb_textclassifier_model.result,
                                                             newdata_partition_column = "doc_id",
                                                             input_token_column = "token",
                                                             doc_id_columns = "doc_id" )
  11. Inspect the results.
    nb_textclassifier_pred