NaiveBayesTextClassifierTrainer Example - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
8.00
1.0
Published
May 2019
Language
English (United States)
Last Update
2019-11-22
dita:mapPath
blj1506016597986.ditamap
dita:ditavalPath
blj1506016597986.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢

Input

  • tokens: Created by applying the TextTokenizer function to the training table complaints, a log of vehicle complaints

    In complaints, the category column indicates whether the car has been in a crash.

complaints
doc_id text_data category
1 consumer was driving approximately 45 mph hit a deer with the front bumper and then ran into an embankment head-on passenger's side air bag did deploy hit windshield and deployed outward. driver's side airbag cover opened but did not inflate it was still folded causing injuries. crash
2 when vehicle was involved in a crash totalling vehicle driver's side/ passenger's side air bags did not deploy. vehicle was making a left turn and was hit by a ford f350 traveling about 35 mph on the front passenger's side. driver hit his head-on the steering wheel. hurt his knee and received neck and back injuries. crash
3 consumer has experienced following problems; 1.) both lower ball joints wear out excessively; 2.) head gasket leaks; and 3.) cruise control would shut itself off while driving without foot pressing on brake pedal. no_crash
... ... ...

SQL Call

This call creates the model table, complaints_tokens_model, by calling NaiveBayesTextClassifierTrainer. It creates the NaiveBayesTextClassifierTrainer input table, token, by applying TextTokenizer to the table complaints.

CREATE MULTISET TABLE complaints_tokens_model AS (
  SELECT * FROM NaiveBayesTextClassifierTrainer (
    ON (
      SELECT * FROM NaiveBayesTextClassifierInternal (
        ON (
          SELECT doc_id, lower(token) AS token, category
          FROM TextTokenizer (
            ON complaints PARTITION BY ANY
            USING
            TextColumn ('text_data')
            OutputByWord ('true')
            Accumulate ('doc_id', 'category')
          ) AS dt1
        ) AS "input" PARTITION BY category
        USING
        TokenColumn ('token')
        ModelType ('Bernoulli')
        DocIDColumns ('doc_id')
        DocCategoryColumn ('category')
      ) AS dt2
    ) PARTITION BY 1
  ) AS dt3
) WITH DATA;

Output

This query returns the following table:

SELECT * FROM complaints_tokens_model;
complaints_tokens_model
token category prob
ASTER_NAIVE_BAYES_TEXT_MODEL_TYPE BERNOULLI 1
been crash 0.285714285714286
been no_crash 0.235294117647059
accurate no_crash 0.117647058823529
joints no_crash 0.117647058823529
shift no_crash 0.117647058823529
about crash 0.285714285714286
about no_crash 0.117647058823529
bag crash 0.285714285714286
... ... ..