Use the Naïve Bayes Model with Teradata R Package - 17.00 - Using the Naïve Bayes Model with Teradata R Package - Teradata R Package

Teradata® R Package User Guide

prodname
Teradata R Package
vrm_release
17.00
created_date
November 2020
category
User Guide
featnum
B700-4005-090K
This example uses the datasets "Pima.tr" and "Pima.te" found in the R package "MASS". These datasets are predefined training and test subsets from a dataset consisting of seven biomarker measurements from 532 women of Pima Indian heritage. These women were also tested for the presence of diabetes. An eighth column indicates whether diabetes was present or not.

In this example, you will build a Naïve Bayes classifier based on the training dataset, apply it to classify the rows in the test dataset, and create a confusion matrix to evaluate the model's performance.

  1. Load the "MASS" package and perform preliminary tasks, including add a rowID ID column to facilitate the Naïve Bayes predict function, and convert the strings in the type column to lowercase.
    library(MASS)
    PimaTr <-Pima.tr
    PimaTr$rowID <-seq.int(nrow(Pima.tr))
    PimaTr$type <-tolower(PimaTr$type)
    
    PimaTe <-Pima.te
    PimaTe$rowID <-seq.int(nrow(Pima.te))
    PimaTe$type <-tolower(PimaTe$type) 
  2. Create tables in the database to hold the data.
    copy_to(con, PimaTr, name="Pima_train", overwrite=FALSE)
    
    copy_to(con, PimaTe, name="Pima_test", overwrite=FALSE)
  3. Create R tables from the database tables created in Step 2.
    tddf_Pima.tr <- tbl(con, "Pima_train")
    
    tddf_Pima.te <- tbl(con, "Pima_test")
  4. Create the Naïve Bayes model from the training dataset using the td_naivebayes_mle() tdplyr analytic function.
    nbmodel <- td_naivebayes_mle(
      formula = (type ~ npreg + glu + bp + skin + bmi + ped + age),
      data = tddf_Pima.tr
    )
  5. Run the model on the test dataset using the td_naivebayes_predict_sqle() tdplyr analytic function.
    pred <- td_naivebayes_predict_sqle(
      formula = (type ~ npreg + glu + bp + skin + bmi + ped + age),
      modeldata = nbmodel,
      newdata = tddf_Pima.te,
      id.col = "rowID",
      responses = c("yes", "no")
    )
    To assess the model prediction, obtain the confusion matrix to analyze the performance of the model as shown in the following steps.
  6. Store the observed response and the predicted values in a data frame df.
    1. Join the "pred" and "tddf_Pima.te" dataset by the "rowID" column to bring together the response and the predicted values columns.
    2. Create the df data frame by selecting only these columns from the joined data.
    df <- inner_join(pred$result, tddf_Pima.te, by="rowID") %>% dplyr::select(prediction, response = type)

    Due to the loading order of the R packages, the select() function of the "MASS" package shadows the corresponding select() function in tdplyr. Thus, tdplyr cannot re-import the original select() function in the dplyr package.

    To use a masked function, you need to fully qualify the function call with the package name preceding the function name.

    For example, in this step, use dplyr::select() to explicitly invoke the original function.

  7. Create the "confusionMatrix_tbl" table in the database to hold the confusion matrix.
    copy_to(con, df, name="confusionMatrix_tbl")
  8. Create an R table from the existing database table with the tbl() function.
    tddf_confusionMatrix_tbl <-tbl(con, "confusionMatrix_tbl")
  9. Invoke the td_confusion_matrix_mle() tdplyr analytic function to analyze the performance of the model.
    In the following command, use the tibble "tddf_confusionMatrix_tbl" returned from the tbl() function in Step 8 as the input table, and use the columns "response" and "prediction" as columns with reference and prediction values.
    cmResult <- td_confusion_matrix_mle(
      data = tddf_confusionMatrix_tbl,
      reference = "response",
      prediction = "prediction"
    )
    
    The confusion matrix analysis creates three output tables in the database and a fourth table that declares whether the analysis has run successfully.

    These tables are stored by the td_confusion_matrix_mle() function in a named list as tibble objects:

    • "counttable" has the confusion matrix counts;
    • "stattable" contains analysis statistics;
    • "accuracy" hosts parameters related to the model accuracy for each response;
    • "output" reports the status of the analysis.
  10. Examine the results by invoking the output tibbles.
    print( cmResult$counttable )
    
    print( cmResult$stattable )
    
    print( cmResult$accuracytable )
    
    print( cmResult$output )