This example uses the dataset "fgl" found in the R package "MASS". This dataset includes nine different measurements on 214 samples of different types of glass. A tenth column indicates the type of glass, classifying the samples into one of six types.
- Load the "MASS" package and perform preliminary tasks, including add a "rowID" ID column to uniquely identify the data rows, and create an "fgl_with_rowids" dataset from the "rowID" column and the "fgl" dataset.
library(MASS)
fgl_with_rowids <- cbind(rownames(fgl), fgl) newColNames <- c("rowID", "RI", "Na", "Mg", "Al", "Si", "K", "Ca", "Ba", "Fe", "type") colnames(fgl_with_rowids) <- newColNames
Name the resulting dataset columns appropriately. - Divide the data into training and test datasets using the split() function.
glass_types <- split(fgl_with_rowids, fgl_with_rowids$type)
To make sure that the training set has representatives of each of the six types of glass, split the table by the "type" column. - Use "glass_types" to get the data of each individual type in separate lists.
WinF <- glass_types[[1]] WinNF <- glass_types[[2]] Veh <- glass_types[[3]] Con <- glass_types[[4]] Tabl <- glass_types[[5]] Head <- glass_types[[6]]
- Divide the observations for each type into training and test subsets.In this example, use 70% the observations as training data and the rest 30% as test data.
WinF_train_indices <- sample(1:nrow(WinF), 0.7*nrow(WinF)) WinF.test <- WinF[-WinF_train_indices,] WinF.train <- WinF[WinF_train_indices,] WinNF_train_indices <- sample(1:nrow(WinNF), 0.7*nrow(WinNF)) WinNF.test <- WinNF[-WinNF_train_indices,] WinNF.train <- WinNF[WinNF_train_indices,] Veh_train_indices <- sample(1:nrow(Veh), 0.7*nrow(Veh)) Veh.test <- Veh[-Veh_train_indices,] Veh.train <- Veh[Veh_train_indices,] Con_train_indices <- sample(1:nrow(Con), 0.7*nrow(Con)) Con.test <- Con[-Con_train_indices,] Con.train <- Con[Con_train_indices,] Tabl_train_indices <- sample(1:nrow(Tabl), 0.7*nrow(Tabl)) Tabl.test <- Tabl[-Tabl_train_indices,] Tabl.train <- Tabl[Tabl_train_indices,] Head_train_indices <- sample(1:nrow(Head), 0.7*nrow(Head)) Head.test <- Head[-Head_train_indices,] Head.train <- Head[Head_train_indices,]
- Combine the training and test subsets for each type to create the training and test datasets "fgl.tr" and "fgl.te", respectively.
fgl.tr <- rbind(WinNF.train, Con.train, Tabl.train, Veh.train, WinF.train, Head.train) fgl.te <- rbind(WinNF.test, Con.test, Tabl.test, Veh.test, WinF.test, Head.test)
- Save the training and test datasets into the database using the copy_to() function.
copy_to(con, fgl.tr, name="fgl_train", overwrite=FALSE) copy_to(con, fgl.te, name="fgl_test", overwrite=FALSE)
- Create R tables from the database tables using the tbl() function.
tddf_fgl.tr <- tbl(con, "fgl_train") tddf_fgl.te <- tbl(con, "fgl_test")
- Create two different Decision Forest models with the training datasets using the td_decision_forest_mle tdplyr analytic function.
glass_rf_list_1 <- td_decision_forest_mle( formula = (type ~ RI + Na + Mg + Al + Si + K + Ca + Ba + Fe), tree.type = "classification", data = tddf_fgl.tr, ntree = 5) glass_rf_list_2 <- td_decision_forest_mle( formula = (type ~ RI + Na + Mg + Al + Si + K + Ca + Ba + Fe), tree.type = "classification", data = tddf_fgl.tr, ntree = 6, mtry = 3)
Use the same formula to represent the glass type as a function of all independent variables.
The difference is in that the second model contains one additional decision tree depth level, and is mandated (mtry=3) to randomly sample three variables from each input at each split.
- Predict on the test dataset for each model using the td_decision_forest_predict_sqle tdplyr analytic function.
td_decision_forest_predict_sqle( object = glass_rf_list_1, newdata = tddf_fgl.te, id.column = "rowID" ) td_decision_forest_predict_sqle( object = glass_rf_list_2, newdata = tddf_fgl.te, id.column = "rowID" )