Description
The RandomSample (td_random_sample_mle
) function takes a data set and
uses a specified sampling method to output one or more random samples.
Each sample has exactly the number of rows specified.
Usage
td_random_sample_mle ( data = NULL, num.sample = NULL, weight.column = NULL, sampling.mode = "Basic", distance = "EUCLIDEAN", input.columns = NULL, as.categories = NULL, category.weights = NULL, categorical.distance = "OVERLAP", seed.column = NULL, seed = NULL, over.sampling.rate = 1, iteration.num = 5, setid.as.first.column = TRUE, data.sequence.column = NULL )
Arguments
data |
Required Argument. |
num.sample |
Required Argument. |
weight.column |
Optional Argument. |
sampling.mode |
Optional Argument.
Briefly, at each iteration, the probability that a row is selected is proportional to the value in the 'weight.column' multiplied by the distance from the nearest row in the set of selected rows (as in kmeans++). However, the kmeans|| algorithm oversamples at each iteration, significantly reducing the required number of iterations; therefore, the resulting set of rows might have more than k data points. Each row in the resulting set is then weighted by the number of rows in the tbl_teradata that are closer to that row than to any other selected row, and the rows are clustered to produce exactly k rows. Tip: For optimal performance, use "kmeans++" when the desired sample size is less than 15 and "kmeans||" otherwise. Default Value: "Basic" Permitted Values: Basic, KMeans++, KMeans|| |
distance |
Required Argument for kmeans++ and kmeans|| sampling. |
input.columns |
Required Argument for kmeans++ and kmeans|| sampling. |
as.categories |
Required Argument for kmeans++ and kmeans|| sampling. It specifies the names of the input tbl_teradata columns that contain numerical variables to treat as categorical variables. |
category.weights |
Required Argument for kmeans++ and kmeans|| sampling. |
categorical.distance |
Required Argument for kmeans++ and kmeans|| sampling. If specifies the function for computing the distance between categorical variables.
|
seed.column |
Optional Argument. |
seed |
Optional Argument. Specifies the random seed used to initialize the algorithm. |
over.sampling.rate |
Optional Argument. |
iteration.num |
Optional Argument. |
setid.as.first.column |
Optional Argument. |
data.sequence.column |
Optional Argument. |
Value
Function returns an object of class "td_random_sample_mle" which is a
named list containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator
using name: result.
Examples
# Get the current context/connection con <- td_get_context()$connection # Load example data. loadExampleData("randomsample_example", "fs_input", "fs_input1") # Create remote tibble objects. The input tables have observations # of 11 variables for different models of cars. fs_input <- tbl(con, "fs_input") fs_input1 <- tbl(con, "fs_input1") # Example 1 - This example uses basic sampling to select 3 sample # sets of sizes 2, 3 and 1 rows, weighted by car weight. td_random_sample_out1 <- td_random_sample_mle(data = fs_input, num.sample = c(2,3,1), weight.column = "wt", ) # Example 2 - This example uses KMeans++ sampling with the Manhattan # distance metric, and treats the numeric variables cyl, gear, and # carb as categorical variables. td_random_sample_out2 <- td_random_sample_mle(data = fs_input, num.sample = 10, sampling.mode = "KMeans++", distance = "manhattan", input.columns = c('mpg','cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb'), as.categories = c("cyl","gear","carb"), category.weights = c(1000,10,100,100,100), seed.column = c("model"), seed = 1 ) # Example 3 - This example uses KMeans|| sampling with the Manhattan # distance metric for the numerical variables and the Hamming # distance metric for the categorical variables. td_random_sample_out3 <- td_random_sample_mle(data = fs_input1, num.sample = 20, sampling.mode = "KMeans||", distance = "MANHATTAN", input.columns = c('mpg','cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb'), as.categories = c("cyl","gear","carb"), category.weights = c(1000,10,100,100,100), categorical.distance = "HAMMING", seed.column = c("model"), seed = 1, iteration.num = 2 )