This section uses the datasets "Pima.tr" and "Pima.te" found in the R package "MASS". These datasets are predefined training and test subsets from a dataset consisting of seven biomarker measurements from 532 women of Pima Indian heritage. These women were also tested for the presence of diabetes. An eighth column indicates whether diabetes was present or not.
In this example, users will build a Naïve Bayes classifier based on the training dataset, apply it to classify the rows in the test dataset, and create a confusion matrix to evaluate the model's performance.
-
Create tables in the Aster Database to hold the data.
ta.create(Pima.tr, table="Pima_train", schemaName="public", tableType="dimension", row.names=TRUE, colTypes=NULL )
ta.create(Pima.te, table="Pima_test", schemaName="public", tableType="dimension", row.names=TRUE, colTypes=NULL )
-
Create virtual data frames.
tadf_Pima.tr<-ta.data.frame('Pima_train')
tadf_Pima.te<-ta.data.frame('Pima_test')
-
Create the Naïve Bayes model using the training dataset.
nbmodel<-aa.naivebayes.train( formula = (type ~ npreg + glu + bp + skin + bmi + ped + age), data = tadf_Pima.tr )
-
Run the model on the test dataset.
pred<-aa.naivebayes.predict( object = nbmodel, newdata = tadf_Pima.te, id.col = "row_names" ) [[1]] row_names prediction loglik_No loglik_Yes 1 1 Yes -23.24780 -20.95173 2 2 No -19.89531 -24.76983 3 3 No -20.42996 -25.61973 4 4 No -21.52287 -26.25105 5 5 Yes -28.65697 -24.39327 6 6 Yes -24.45018 -23.56059 7 7 No -24.60451 -25.18178 8 8 No -28.19846 -31.36756 … … … … …
-
Create a data frame containing the "prediction" column from the output of Step 4 and the "type" column from the input "Pima_test" table.
predicted_values<-as.ta.data.frame(pred[[1]])
joined_table<-ta.join(predicted_values, tadf_Pima.te, type="inner", by="row_names") ConfMatInput<-joined_table[,c("prediction","type")]
> ConfMatInput prediction type 1 Yes Yes 2 No No 3 No No 4 No Yes 5 Yes Yes 6 Yes Yes 7 No Yes 8 No No 9 Yes No … … …
-
Review the confusion matrix to analyze the model’s performance.
-
Use the function aa.confusion.matrix() to create tables in the Aster Database.
aa.confusion.matrix(ConfMatInput, reference = 'type', prediction = 'prediction', output.tablename.prefix = "NBexample" )
The function creates three tables in the Aster Database: "nbexample_1", "nbexample_2", and "nbexample_3".
-
Use the function ta.pull() to bring the tables into the R environment.
nb1<-ta.pull("nbexample_1") nb2<-ta.pull("nbexample_2") nb3<-ta.pull("nbexample_3")
-
Examine the results.
> nb1 observation/predict No Yes 1 No 185 38 2 Yes 43 66
> nb2 key value 1 Accuracy 0.756 2 95% CI (0.7062, 0.8013) 3 Null Error Rate 0.3283 4 P-Value [Acc > NIR] 0.0005 5 Kappa 0.4403 6 Mcnemar Test P-Value 0.6567
> nb3 measure No Yes 1 Sensitivity 0.8296 0.6055 2 Specificity 0.6055 0.8296 3 Pos Pred Value 0.8114 0.6346 4 Neg Pred Value 0.6346 0.8114 5 Prevalence 0.6717 0.3283 6 Detection Rate 0.5572 0.1988 7 Detection Prevalence 0.6867 0.3133 8 Balanced Accuracy 0.7176 0.7176
-
Use the function aa.confusion.matrix() to create tables in the Aster Database.