NaiveBayesTextClassifierTrainer Example

NaiveBayesTextClassifierTrainer Example - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product

Teradata Vantage

Release Number

8.00

1.0

Published

May 2019

Language

English (United States)

Last Update

2019-11-22

dita:mapPath

blj1506016597986.ditamap

dita:ditavalPath

blj1506016597986.ditaval

dita:id

B700-4003

lifecycle

Product Category

Teradata Vantage™

Input

tokens: Created by applying the TextTokenizer function to the training table complaints, a log of vehicle complaints
In complaints, the category column indicates whether the car has been in a crash.

complaints
doc_id	text_data	category
1	consumer was driving approximately 45 mph hit a deer with the front bumper and then ran into an embankment head-on passenger's side air bag did deploy hit windshield and deployed outward. driver's side airbag cover opened but did not inflate it was still folded causing injuries.	crash
2	when vehicle was involved in a crash totalling vehicle driver's side/ passenger's side air bags did not deploy. vehicle was making a left turn and was hit by a ford f350 traveling about 35 mph on the front passenger's side. driver hit his head-on the steering wheel. hurt his knee and received neck and back injuries.	crash
3	consumer has experienced following problems; 1.) both lower ball joints wear out excessively; 2.) head gasket leaks; and 3.) cruise control would shut itself off while driving without foot pressing on brake pedal.	no_crash
...	...	...

SQL Call

This call creates the model table, complaints_tokens_model, by calling NaiveBayesTextClassifierTrainer. It creates the NaiveBayesTextClassifierTrainer input table, token, by applying TextTokenizer to the table complaints.

CREATE MULTISET TABLE complaints_tokens_model AS (
  SELECT * FROM NaiveBayesTextClassifierTrainer (
    ON (
      SELECT * FROM NaiveBayesTextClassifierInternal (
        ON (
          SELECT doc_id, lower(token) AS token, category
          FROM TextTokenizer (
            ON complaints PARTITION BY ANY
            USING
            TextColumn ('text_data')
            OutputByWord ('true')
            Accumulate ('doc_id', 'category')
          ) AS dt1
        ) AS "input" PARTITION BY category
        USING
        TokenColumn ('token')
        ModelType ('Bernoulli')
        DocIDColumns ('doc_id')
        DocCategoryColumn ('category')
      ) AS dt2
    ) PARTITION BY 1
  ) AS dt3
) WITH DATA;

Output

This query returns the following table:

SELECT * FROM complaints_tokens_model;

complaints_tokens_model
token	category	prob
ASTER_NAIVE_BAYES_TEXT_MODEL_TYPE	BERNOULLI	1
been	crash	0.285714285714286
been	no_crash	0.235294117647059
accurate	no_crash	0.117647058823529
joints	no_crash	0.117647058823529
shift	no_crash	0.117647058823529
about	crash	0.285714285714286
about	no_crash	0.117647058823529
bag	crash	0.285714285714286
...	...	..