7.00.02.01 - Using the Naïve Bayes Model with Aster R - Aster R

Teradata Aster® R User GuideUpdate 3

prodname
Aster R
vrm_release
7.00.02.01
created_date
December 2017
category
Programming Reference
User Guide
featnum
B700-1033-700K
This section uses the datasets "Pima.tr" and "Pima.te" found in the R package "MASS". These datasets are predefined training and test subsets from a dataset consisting of seven biomarker measurements from 532 women of Pima Indian heritage. These women were also tested for the presence of diabetes. An eighth column indicates whether diabetes was present or not.

In this example, users will build a Naïve Bayes classifier based on the training dataset, apply it to classify the rows in the test dataset, and create a confusion matrix to evaluate the model's performance.

  1. Create tables in the Aster Database to hold the data.
    ta.create(Pima.tr,
        table="Pima_train", 
        schemaName="public",
        tableType="dimension", 
        row.names=TRUE,
        colTypes=NULL
        )
    ta.create(Pima.te,
        table="Pima_test", 
        schemaName="public",
        tableType="dimension", 
        row.names=TRUE,
        colTypes=NULL
        )
  2. Create virtual data frames.
    tadf_Pima.tr<-ta.data.frame('Pima_train')
    tadf_Pima.te<-ta.data.frame('Pima_test')
  3. Create the Naïve Bayes model using the training dataset.
    nbmodel<-aa.naivebayes.train(      
    	   formula = (type ~ npreg + glu + bp + skin + bmi + ped + age), 
    	   data = tadf_Pima.tr
    	 )
  4. Run the model on the test dataset.
    pred<-aa.naivebayes.predict(
          object = nbmodel, 
          newdata = tadf_Pima.te, 
          id.col = "row_names"
        )
    [[1]]    row_names prediction loglik_No    loglik_Yes
    1           1        Yes      -23.24780     -20.95173
    2           2         No      -19.89531     -24.76983
    3           3         No      -20.42996     -25.61973
    4           4         No      -21.52287     -26.25105
    5           5        Yes      -28.65697     -24.39327
    6           6        Yes      -24.45018     -23.56059
    7           7         No      -24.60451     -25.18178
    8           8         No      -28.19846     -31.36756
    …           …         …           …              …   
  5. Create a data frame containing the "prediction" column from the output of Step 4 and the "type" column from the input "Pima_test" table.
    predicted_values<-as.ta.data.frame(pred[[1]])
    joined_table<-ta.join(predicted_values, tadf_Pima.te, type="inner", by="row_names")
    ConfMatInput<-joined_table[,c("prediction","type")]
    > ConfMatInput
        prediction type
    1          Yes  Yes
    2           No   No
    3           No   No
    4           No  Yes
    5          Yes  Yes
    6          Yes  Yes
    7           No  Yes
    8           No   No
    9          Yes   No
    …           …     …
  6. Review the confusion matrix to analyze the model’s performance.
    1. Use the function aa.confusion.matrix() to create tables in the Aster Database.
      aa.confusion.matrix(ConfMatInput,
                         reference = 'type',
                         prediction = 'prediction',
                         output.tablename.prefix = "NBexample"
       )

      The function creates three tables in the Aster Database: "nbexample_1", "nbexample_2", and "nbexample_3".

    2. Use the function ta.pull() to bring the tables into the R environment.
      nb1<-ta.pull("nbexample_1")
      nb2<-ta.pull("nbexample_2")
      nb3<-ta.pull("nbexample_3")
    3. Examine the results.
      > nb1
        observation/predict    No Yes
      1                  No   185  38
      2                 Yes    43  66
      
      > nb2
                         key            value
      1             Accuracy            0.756
      2               95% CI (0.7062, 0.8013)
      3      Null Error Rate           0.3283
      4  P-Value [Acc > NIR]           0.0005
      5                Kappa           0.4403
      6 Mcnemar Test P-Value           0.6567
      > nb3
                     measure     No    Yes
      1          Sensitivity  0.8296   0.6055
      2          Specificity  0.6055   0.8296
      3       Pos Pred Value  0.8114   0.6346
      4       Neg Pred Value  0.6346   0.8114
      5           Prevalence  0.6717   0.3283
      6       Detection Rate  0.5572   0.1988
      7 Detection Prevalence  0.6867   0.3133
      8    Balanced Accuracy  0.7176   0.7176