Use the Decision Forest Model with Teradata R Package - 17.00 - Using the Decision Forest Model with Teradata R Package - Teradata R Package

Teradata® R Package User Guide

prodname
Teradata R Package
vrm_release
17.00
created_date
November 2020
category
User Guide
featnum
B700-4005-090K
This example uses the dataset "fgl" found in the R package "MASS". This dataset includes nine different measurements on 214 samples of different types of glass. A tenth column indicates the type of glass, classifying the samples into one of six types.
  1. Load the "MASS" package and perform preliminary tasks, including add a "rowID" ID column to uniquely identify the data rows, and create an "fgl_with_rowids" dataset from the "rowID" column and the "fgl" dataset.
    library(MASS)
    fgl_with_rowids <- cbind(rownames(fgl), fgl)
    newColNames <- c("rowID", "RI", "Na", "Mg", "Al", "Si", "K", "Ca", "Ba", "Fe", "type")
    
    colnames(fgl_with_rowids) <- newColNames
    
    Name the resulting dataset columns appropriately.
  2. Divide the data into training and test datasets using the split() function.
    glass_types <- split(fgl_with_rowids, fgl_with_rowids$type)
    To make sure that the training set has representatives of each of the six types of glass, split the table by the "type" column.
  3. Use "glass_types" to get the data of each individual type in separate lists.
    WinF <- glass_types[[1]]
    WinNF <- glass_types[[2]]
    Veh <- glass_types[[3]]
    Con <- glass_types[[4]]
    Tabl <- glass_types[[5]]
    Head <- glass_types[[6]]
    
  4. Divide the observations for each type into training and test subsets.
    In this example, use 70% the observations as training data and the rest 30% as test data.
    WinF_train_indices <- sample(1:nrow(WinF), 0.7*nrow(WinF))
    WinF.test <- WinF[-WinF_train_indices,]
    WinF.train <- WinF[WinF_train_indices,]
    
    WinNF_train_indices <- sample(1:nrow(WinNF), 0.7*nrow(WinNF))
    WinNF.test <- WinNF[-WinNF_train_indices,]
    WinNF.train <- WinNF[WinNF_train_indices,]
    
    Veh_train_indices <- sample(1:nrow(Veh), 0.7*nrow(Veh))
    Veh.test <- Veh[-Veh_train_indices,]
    Veh.train <- Veh[Veh_train_indices,]
    
    Con_train_indices <- sample(1:nrow(Con), 0.7*nrow(Con))
    Con.test <- Con[-Con_train_indices,]
    Con.train <- Con[Con_train_indices,]
    
    Tabl_train_indices <- sample(1:nrow(Tabl), 0.7*nrow(Tabl))
    Tabl.test <- Tabl[-Tabl_train_indices,]
    Tabl.train <- Tabl[Tabl_train_indices,]
    
    Head_train_indices <- sample(1:nrow(Head), 0.7*nrow(Head))
    Head.test <- Head[-Head_train_indices,]
    Head.train <- Head[Head_train_indices,]
    
  5. Combine the training and test subsets for each type to create the training and test datasets "fgl.tr" and "fgl.te", respectively.
    fgl.tr <- rbind(WinNF.train, Con.train, Tabl.train, Veh.train, WinF.train, Head.train)
    
    fgl.te <- rbind(WinNF.test, Con.test, Tabl.test, Veh.test, WinF.test, Head.test)
    
  6. Save the training and test datasets into the database using the copy_to() function.
    copy_to(con, fgl.tr, name="fgl_train", overwrite=FALSE)
    
    copy_to(con, fgl.te, name="fgl_test", overwrite=FALSE)
    
  7. Create R tables from the database tables using the tbl() function.
    tddf_fgl.tr <- tbl(con, "fgl_train")
    
    tddf_fgl.te <- tbl(con, "fgl_test")
    
  8. Create two different Decision Forest models with the training datasets using the td_decision_forest_mle tdplyr analytic function.
    glass_rf_list_1 <- td_decision_forest_mle(
      formula = (type ~ RI + Na + Mg + Al + Si + K + Ca + Ba + Fe),
      tree.type = "classification",
      data = tddf_fgl.tr,
      ntree = 5)
    
    glass_rf_list_2 <- td_decision_forest_mle(
      formula = (type ~ RI + Na + Mg + Al + Si + K + Ca + Ba + Fe),
      tree.type = "classification",
      data = tddf_fgl.tr,
      ntree = 6,
      mtry = 3)
    

    Use the same formula to represent the glass type as a function of all independent variables.

    The difference is in that the second model contains one additional decision tree depth level, and is mandated (mtry=3) to randomly sample three variables from each input at each split.

  9. Predict on the test dataset for each model using the td_decision_forest_predict_sqle tdplyr analytic function.
    td_decision_forest_predict_sqle(
      object = glass_rf_list_1,
      newdata = tddf_fgl.te,
      id.column = "rowID"
    )
    
    td_decision_forest_predict_sqle(
      object = glass_rf_list_2,
      newdata = tddf_fgl.te,
      id.column = "rowID"
    )