Using the Naïve Bayes Model with Aster R

Using the Naïve Bayes Model with Aster R - Aster R

Teradata Aster® R User GuideUpdate 3

Product

Aster R

Release Number

7.00.02.01

Published

December 2017

Language

English (United States)

Last Update

2018-04-13

dita:mapPath

fop1497542774450.ditamap

dita:ditavalPath

Generic_no_ie_no_tempfilter.ditaval

dita:id

fbp1477004286096

lifecycle

Product Category

Software

This section uses the datasets "Pima.tr" and "Pima.te" found in the R package "MASS". These datasets are predefined training and test subsets from a dataset consisting of seven biomarker measurements from 532 women of Pima Indian heritage. These women were also tested for the presence of diabetes. An eighth column indicates whether diabetes was present or not.

In this example, users will build a Naïve Bayes classifier based on the training dataset, apply it to classify the rows in the test dataset, and create a confusion matrix to evaluate the model's performance.

Create tables in the Aster Database to hold the data.

ta.create(Pima.tr,
    table="Pima_train", 
    schemaName="public",
    tableType="dimension", 
    row.names=TRUE,
    colTypes=NULL
    )

ta.create(Pima.te,
    table="Pima_test", 
    schemaName="public",
    tableType="dimension", 
    row.names=TRUE,
    colTypes=NULL
    )

Create virtual data frames.

tadf_Pima.tr<-ta.data.frame('Pima_train')

tadf_Pima.te<-ta.data.frame('Pima_test')

Create the Naïve Bayes model using the training dataset.

nbmodel<-aa.naivebayes.train(      
	   formula = (type ~ npreg + glu + bp + skin + bmi + ped + age), 
	   data = tadf_Pima.tr
	 )

Run the model on the test dataset.

pred<-aa.naivebayes.predict(
      object = nbmodel, 
      newdata = tadf_Pima.te, 
      id.col = "row_names"
    )
[[1]]    row_names prediction loglik_No    loglik_Yes
1           1        Yes      -23.24780     -20.95173
2           2         No      -19.89531     -24.76983
3           3         No      -20.42996     -25.61973
4           4         No      -21.52287     -26.25105
5           5        Yes      -28.65697     -24.39327
6           6        Yes      -24.45018     -23.56059
7           7         No      -24.60451     -25.18178
8           8         No      -28.19846     -31.36756
…           …         …           …              …

Create a data frame containing the "prediction" column from the output of Step 4 and the "type" column from the input "Pima_test" table.

predicted_values<-as.ta.data.frame(pred[[1]])

joined_table<-ta.join(predicted_values, tadf_Pima.te, type="inner", by="row_names")
ConfMatInput<-joined_table[,c("prediction","type")]

> ConfMatInput
    prediction type
1          Yes  Yes
2           No   No
3           No   No
4           No  Yes
5          Yes  Yes
6          Yes  Yes
7           No  Yes
8           No   No
9          Yes   No
…           …     …

Review the confusion matrix to analyze the model’s performance.

Use the function aa.confusion.matrix() to create tables in the Aster Database.
```
aa.confusion.matrix(ConfMatInput,
                   reference = 'type',
                   prediction = 'prediction',
                   output.tablename.prefix = "NBexample"
 )
```
The function creates three tables in the Aster Database: "nbexample_1", "nbexample_2", and "nbexample_3".

Use the function ta.pull() to bring the tables into the R environment.

nb1<-ta.pull("nbexample_1")
nb2<-ta.pull("nbexample_2")
nb3<-ta.pull("nbexample_3")

Examine the results.

> nb1
  observation/predict    No Yes
1                  No   185  38
2                 Yes    43  66

> nb2
                   key            value
1             Accuracy            0.756
2               95% CI (0.7062, 0.8013)
3      Null Error Rate           0.3283
4  P-Value [Acc > NIR]           0.0005
5                Kappa           0.4403
6 Mcnemar Test P-Value           0.6567

> nb3
               measure     No    Yes
1          Sensitivity  0.8296   0.6055
2          Specificity  0.6055   0.8296
3       Pos Pred Value  0.8114   0.6346
4       Neg Pred Value  0.6346   0.8114
5           Prevalence  0.6717   0.3283
6       Detection Rate  0.5572   0.1988
7 Detection Prevalence  0.6867   0.3133
8    Balanced Accuracy  0.7176   0.7176