This example shows how to create a model and use the model to make predictions on a new dataset. It uses the "Boston" dataset from the MASS package. This example splits the "Boston" dataset into separate training and test datasets to create and evaluate the performance of the model.
-
Add row identifiers to the dataset.
library(MASS) Boston.wi <- data.frame(id=as.integer(row.names(Boston)), Boston )
-
Divide the dataset into training and test datasets.
train=sample(1:nrow(Boston.wi), 400) Boston.wi.train = Boston.wi [train,] Boston.wi.test= Boston.wi [-train,]
-
Create the virtual data frames.
ta.dropTable("boston_tr", schemaName = "public") ta.dropTable("boston_te", schemaName = "public") tadf.boston.train <- ta.create(Boston.wi.train, table = 'boston_tr', schemaName = 'public', tableType = 'fact', partitionKey = 'id') tadf.boston.test <- ta.create(Boston.wi.test, table = 'boston_te', schemaName = 'public', tableType = 'fact', partitionKey = 'id')
-
Create an R function to build a linear regression model using the columns 'lstat', 'crim', 'rad', and 'zn' to predict 'medv'.
lm_model <- function(tadf ) { model <- lm(data=tadf, medv~lstat+crim+rad+zn) return(model) }
-
Create the predict function.
predict_lm <- function(tadf, model ) { out <- predict(model, newdata=tadf) return(out) }
-
Use the function aa.apply() to create the model.
boston_model<-aa.apply(tadf.boston.train, FUN=lm_model, out.format=list(type="object"))
As the output is a model, the output type is "object". -
Use the aa.apply function to apply the predict function created in Step 5 to the test dataset.
aa_predict <- aa.apply(tadf.boston.test, FUN = predict_lm, FUN.args=list(boston_model[1]), out.format=list(columns=c("id","medv_pred"), columnTypes=c("integer","numeric")))
The first ten rows of the output are shown here.
> aa_predict id medv_pred 1 113 19.503383 2 161 29.071735 3 221 25.407370 4 260 28.196376 5 284 32.731455 6 364 21.083813 7 464 24.901682 8 488 23.922381 9 165 23.459021 10 272 28.461022
-
Compare the predicted and observed values for 'medv'.
-
Create a dataframe containing only the 'id' and observed 'medv' values.
aa_obs<-tadf.boston.test[,c("id","medv")]
-
Use the ta.join() function to create a table containing the observed ( 'medv') and predicted ('medv_pred') values for each 'id' in the original test dataset.
ta.join(aa_obs, aa_predict, by="id")
The first few rows of output are shown here.
medv medv_pred x.id y.id 1 22.2 28.548877 63 63 2 18.5 24.704342 115 115 3 24.4 29.219659 251 251 4 21.6 26.948964 314 314 5 23.1 27.910150 322 322 6 5.0 9.131461 406 406 7 20.4 21.392737 108 108 8 15.6 20.316601 156 156 9 50.0 31.023614 164 164 10 30.5 30.666012 192 192
-
Create a dataframe containing only the 'id' and observed 'medv' values.