Description
The RandomSample (td_random_sample_mle
) function takes a data set and
uses a specified sampling method to output one or more random samples.
Each sample has exactly the number of rows specified.
Usage
td_random_sample_mle ( data = NULL, num.sample = NULL, weight.column = NULL, sampling.mode = "Basic", distance = "EUCLIDEAN", input.columns = NULL, as.categories = NULL, category.weights = NULL, categorical.distance = "OVERLAP", seed.column = NULL, seed = NULL, over.sampling.rate = 1, iteration.num = 5, setid.as.first.column = TRUE, data.sequence.column = NULL )
Arguments
data 
Required Argument. 
num.sample 
Required Argument. 
weight.column 
Optional Argument. 
sampling.mode 
Optional Argument.
Briefly, at each iteration, the probability that a row is selected is proportional to the value in the 'weight.column' multiplied by the distance from the nearest row in the set of selected rows (as in kmeans++). However, the kmeans algorithm oversamples at each iteration, significantly reducing the required number of iterations; therefore, the resulting set of rows might have more than k data points. Each row in the resulting set is then weighted by the number of rows in the tbl_teradata that are closer to that row than to any other selected row, and the rows are clustered to produce exactly k rows. Tip: For optimal performance, use "kmeans++" when the desired sample size is less than 15 and "kmeans" otherwise. Default Value: "Basic" Permitted Values: Basic, KMeans++, KMeans 
distance 
Required Argument for kmeans++ and kmeans sampling. 
input.columns 
Required Argument for kmeans++ and kmeans sampling. 
as.categories 
Required Argument for kmeans++ and kmeans sampling. It specifies the names of the input tbl_teradata columns that contain numerical variables to treat as categorical variables. 
category.weights 
Required Argument for kmeans++ and kmeans sampling. 
categorical.distance 
Required Argument for kmeans++ and kmeans sampling. If specifies the function for computing the distance between categorical variables.

seed.column 
Optional Argument. 
seed 
Optional Argument. Specifies the random seed used to initialize the algorithm. 
over.sampling.rate 
Optional Argument. 
iteration.num 
Optional Argument. 
setid.as.first.column 
Optional Argument. 
data.sequence.column 
Optional Argument. 
Value
Function returns an object of class "td_random_sample_mle" which is a
named list containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator
using name: result.
Examples
# Get the current context/connection con < td_get_context()$connection # Load example data. loadExampleData("randomsample_example", "fs_input", "fs_input1") # Create remote tibble objects. The input tables have observations # of 11 variables for different models of cars. fs_input < tbl(con, "fs_input") fs_input1 < tbl(con, "fs_input1") # Example 1  This example uses basic sampling to select 3 sample # sets of sizes 2, 3 and 1 rows, weighted by car weight. td_random_sample_out1 < td_random_sample_mle(data = fs_input, num.sample = c(2,3,1), weight.column = "wt", ) # Example 2  This example uses KMeans++ sampling with the Manhattan # distance metric, and treats the numeric variables cyl, gear, and # carb as categorical variables. td_random_sample_out2 < td_random_sample_mle(data = fs_input, num.sample = 10, sampling.mode = "KMeans++", distance = "manhattan", input.columns = c('mpg','cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb'), as.categories = c("cyl","gear","carb"), category.weights = c(1000,10,100,100,100), seed.column = c("model"), seed = 1 ) # Example 3  This example uses KMeans sampling with the Manhattan # distance metric for the numerical variables and the Hamming # distance metric for the categorical variables. td_random_sample_out3 < td_random_sample_mle(data = fs_input1, num.sample = 20, sampling.mode = "KMeans", distance = "MANHATTAN", input.columns = c('mpg','cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb'), as.categories = c("cyl","gear","carb"), category.weights = c(1000,10,100,100,100), categorical.distance = "HAMMING", seed.column = c("model"), seed = 1, iteration.num = 2 )