In this example, you will build a Naïve Bayes classifier based on the training dataset, apply it to classify the rows in the test dataset, and create a confusion matrix to evaluate the model's performance.
Load the "MASS" package and perform preliminary tasks, including add a rowID ID column to facilitate the Naïve Bayes predict function, and convert the strings in the type column to lowercase.
PimaTr <-Pima.tr PimaTr$rowID <-seq.int(nrow(Pima.tr)) PimaTr$type <-tolower(PimaTr$type) PimaTe <-Pima.te PimaTe$rowID <-seq.int(nrow(Pima.te)) PimaTe$type <-tolower(PimaTe$type)
Create tables in the database to hold the data.
copy_to(con, PimaTr, name="Pima_train", overwrite=FALSE) copy_to(con, PimaTe, name="Pima_test", overwrite=FALSE)
Create R tables from the database tables created in Step 2.
tddf_Pima.tr <- tbl(con, "Pima_train") tddf_Pima.te <- tbl(con, "Pima_test")
Create the Naïve Bayes model from the training dataset using the td_naivebayes_mle() tdplyr analytic function.
nbmodel <- td_naivebayes_mle( formula = (type ~ npreg + glu + bp + skin + bmi + ped + age), data = tddf_Pima.tr )
Run the model on the test dataset using the td_naivebayes_predict_sqle() tdplyr analytic function.
pred <- td_naivebayes_predict_sqle( formula = (type ~ npreg + glu + bp + skin + bmi + ped + age), modeldata = nbmodel, newdata = tddf_Pima.te, id.col = "rowID", responses = c("yes", "no") )To assess the model prediction, obtain the confusion matrix to analyze the performance of the model as shown in the following steps.
Store the observed response and the predicted values in a data frame df.
- Join the "pred" and "tddf_Pima.te" dataset by the "rowID" column to bring together the response and the predicted values columns.
- Create the df data frame by selecting only these columns from the joined data.
df <- inner_join(pred$result, tddf_Pima.te, by="rowID") %>% dplyr::select(prediction, response = type)
Due to the loading order of the R packages, the select() function of the "MASS" package shadows the corresponding select() function in tdplyr. Thus, tdplyr cannot re-import the original select() function in the dplyr package.
To use a masked function, you need to fully qualify the function call with the package name preceding the function name.
For example, in this step, use dplyr::select() to explicitly invoke the original function.
Create the "confusionMatrix_tbl" table in the database to hold the confusion matrix.
copy_to(con, df, name="confusionMatrix_tbl")
Create an R table from the existing database table with the tbl() function.
tddf_confusionMatrix_tbl <-tbl(con, "confusionMatrix_tbl")
Invoke the td_confusion_matrix_mle() tdplyr analytic function to analyze the performance of the model.
In the following command, use the tibble "tddf_confusionMatrix_tbl" returned from the tbl() function in Step 8 as the input table, and use the columns "response" and "prediction" as columns with reference and prediction values.
cmResult <- td_confusion_matrix_mle( data = tddf_confusionMatrix_tbl, reference = "response", prediction = "prediction" )The confusion matrix analysis creates three output tables in the database and a fourth table that declares whether the analysis has run successfully.
These tables are stored by the td_confusion_matrix_mle() function in a named list as tibble objects:
- "counttable" has the confusion matrix counts;
- "stattable" contains analysis statistics;
- "accuracy" hosts parameters related to the model accuracy for each response;
- "output" reports the status of the analysis.
Examine the results by invoking the output tibbles.
print( cmResult$counttable ) print( cmResult$stattable ) print( cmResult$accuracytable ) print( cmResult$output )