Description
The RandomSample function takes a data set and uses a specified sampling method to output one or more random samples. Each sample has exactly the number of rows specified.
Usage
td_random_sample_mle (
data = NULL,
num.sample = NULL,
weight.column = NULL,
sampling.mode = "Basic",
distance = "EUCLIDEAN",
input.columns = NULL,
as.categories = NULL,
category.weights = NULL,
categorical.distance = "OVERLAP",
seed.column = NULL,
seed = NULL,
over.sampling.rate = 1.0,
iteration.num = 5,
setid.as.first.column = TRUE,
data.sequence.column = NULL
)
Arguments
data |
Required Argument. |
num.sample |
Required Argument. |
weight.column |
Optional Argument. |
sampling.mode |
Optional Argument.
Briefly, at each iteration, the probability that a row is selected is
proportional to the value in the "weight.column" multiplied by the
distance from the nearest row in the set of selected rows (as in
kmeans++). However, the kmeans|| algorithm oversamples at each
iteration, significantly reducing the required number of iterations;
therefore, the resulting set of rows might have more than k data
points. Each row in the resulting set is then weighted by the number
of rows in the tbl_teradata that are closer to that row than to any
other selected row, and the rows are clustered to produce exactly k
rows. Tip: For optimal performance, use "kmeans++" when the desired
sample size is less than 15 and "kmeans||" otherwise. |
distance |
Required Argument for kmeans++ and kmeans|| sampling.
|
input.columns |
Required Argument for kmeans++ and kmeans|| sampling. |
as.categories |
Required Argument for kmeans++ and kmeans|| sampling. |
category.weights |
Required Argument for kmeans++ and kmeans|| sampling. |
categorical.distance |
Required Argument for kmeans++ and kmeans|| sampling.
|
seed.column |
Optional Argument. |
seed |
Optional Argument. |
over.sampling.rate |
Optional Argument. |
iteration.num |
Optional Argument. |
setid.as.first.column |
Optional Argument. |
data.sequence.column |
Optional Argument. |
Value
Function returns an object of class "td_random_sample_mle" which is a
named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator
using the name: result.
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("randomsample_example", "fs_input", "fs_input1")
# Create object(s) of class "tbl_teradata". The input tbl_teradata have observations
# of 11 variables for different models of cars.
fs_input <- tbl(con, "fs_input")
fs_input1 <- tbl(con, "fs_input1")
# Example 1: This example uses basic sampling to select 3 sample
# sets of sizes 2, 3 and 1 rows, weighted by car weight.
td_random_sample_out1 <- td_random_sample_mle(data = fs_input,
num.sample = c(2,3,1),
weight.column = "wt",
)
# Example 2: This example uses KMeans++ sampling with the Manhattan
# distance metric, and treats the numeric variables cyl, gear, and
# carb as categorical variables.
td_random_sample_out2 <- td_random_sample_mle(data = fs_input,
num.sample = 10,
sampling.mode = "KMeans++",
distance = "manhattan",
input.columns = c('mpg','cyl','disp','hp',
'drat','wt','qsec','vs',
'am','gear','carb'),
as.categories = c("cyl","gear","carb"),
category.weights = c(1000,10,100,100,100),
seed.column = c("model"),
seed = 1
)
# Example 3: This example uses KMeans|| sampling with the Manhattan
# distance metric for the numerical variables and the Hamming
# distance metric for the categorical variables.
td_random_sample_out3 <- td_random_sample_mle(data = fs_input1,
num.sample = 20,
sampling.mode = "KMeans||",
distance = "MANHATTAN",
input.columns = c('mpg','cyl','disp','hp',
'drat','wt','qsec','vs',
'am','gear','carb'),
as.categories = c("cyl","gear","carb"),
category.weights = c(1000,10,100,100,100),
categorical.distance = "HAMMING",
seed.column = c("model"),
seed = 1,
iteration.num = 2
)