Description
The RandomSample function takes a data set and uses a specified
sampling method to output one or more random samples. Each sample
has exactly the number of rows specified.
Usage
td_random_sample_mle (
data = NULL,
num.sample = NULL,
weight.column = NULL,
sampling.mode = "Basic",
distance = "EUCLIDEAN",
input.columns = NULL,
as.categories = NULL,
category.weights = NULL,
categorical.distance = "OVERLAP",
seed.column = NULL,
seed = NULL,
over.sampling.rate = 1.0,
iteration.num = 5,
setid.as.first.column = TRUE,
data.sequence.column = NULL
)
Arguments
data |
Required Argument.
Specifies the name of the tbl_teradata that contains the data set
from which to take samples.
|
num.sample |
Required Argument.
Specifies the sample sizes for the sample sets. For each "num.sample"
in this argument, the function selects a sample set that has sample size
number of rows. For example, specifying it as c(2,4,5) creates 3 sample
sets of size 2, 4, and 5 respectively. Similarly, specifying it as 10 creates
one sample set of size 10.
Types: integer OR vector of integers
|
weight.column |
Optional Argument.
Specifies the name of the column that contains weights for weighted
sampling. The "weight.column" must have a numeric SQL data
type. By default, rows have equal weight.
Types: character
|
sampling.mode |
Optional Argument.
Specifies the sampling mode and can be one of the following:
"Basic": Each input tbl_teradata ('data') row
has a probability of being selected that is proportional to its
weight. The weight of each row is in "weight.column".
"KMeans++": One row is selected in each of k iterations,
where k is the number of desired output rows. The first row is
selected randomly. In subsequent iterations, the probability of
a row being selected is proportional to the value in the
"weight.column" multiplied by the distance from the nearest row
in the set of selected rows. The distance is calculated using
the methods specified by the "distance" and "categorical.distance"
arguments.
"KMeans||": Enhanced version of kmeans++ that exploits parallel
architecture to accelerate the sampling process. The algorithm is described in the paper
Scalable kmeans++
by Bahmani et al.
Briefly, at each iteration, the probability that a row is selected is
proportional to the value in the "weight.column" multiplied by the
distance from the nearest row in the set of selected rows (as in
kmeans++). However, the kmeans|| algorithm oversamples at each
iteration, significantly reducing the required number of iterations;
therefore, the resulting set of rows might have more than k data
points. Each row in the resulting set is then weighted by the number
of rows in the tbl_teradata that are closer to that row than to any
other selected row, and the rows are clustered to produce exactly k
rows. Tip: For optimal performance, use "kmeans++" when the desired
sample size is less than 15 and "kmeans||" otherwise.
Default Value: "Basic"
Permitted Values: Basic, KMeans++, KMeans||
Types: character
|
distance |
Required Argument for kmeans++ and kmeans|| sampling.
Specifies the function for computing the distance between
numerical variables. Following functions can be specified:
"euclidean": The distance between two variables is defined in
"Euclidean Distance".
"manhattan": The distance beween two variables
is defined in "Manhattan Distance".
Default Value: "EUCLIDEAN"
Permitted Values: MANHATTAN, EUCLIDEAN
Types: character
|
input.columns |
Required Argument for kmeans++ and kmeans|| sampling.
Specifies the names of the input tbl_teradata columns to use
to calculate the distance between numerical variables.
Types: character OR vector of Strings (character)
|
as.categories |
Required Argument for kmeans++ and kmeans|| sampling.
Specifies the names of the input tbl_teradata columns that
contain numerical variables to treat as categorical variables.
Types: character OR vector of Strings (character)
|
category.weights |
Required Argument for kmeans++ and kmeans|| sampling.
Specifies the weights (numeric values) of the categorical variables,
including those that the "as.categories" argument specifies. Specify
the weights in the order (from left to right) that the variables appear
in the input tbl_teradata. When calculating the distance between two
rows, distances between categorical values are scaled by these weights.
Types: numeric OR vector of numerics
|
categorical.distance |
Required Argument for kmeans++ and kmeans|| sampling.
Specifies the function for computing the distance between
categorical variables.
"overlap": The distance between two variables is 0 if they are the
same and 1 if they are different.
"hamming": The distance beween two variables is the Hamming
distance between the strings that represent them. The strings must
have equal length.
Default Value: "OVERLAP"
Permitted Values: OVERLAP, HAMMING
Types: character
|
seed.column |
Optional Argument.
Specifies the names of the input tbl_teradata columns by which to partition
the input. Function calls that use the same input "data", "seed", and
"seed.column" output the same result. If you specify "seed.column", you
must also specify "seed".
Note: Ideally, the number of distinct values in the "seed.column" is the
same as the number of workers in the cluster. A very large number of
distinct values in the "seed.column" degrades function performance.
Types: character OR vector of Strings (character)
|
seed |
Optional Argument.
Specifies the random seed with which to initialize the algorithm (a
numeric value). If you specify this argument, then you must also specify
"seed.column".
Types: numeric
|
over.sampling.rate |
Optional Argument.
For kmeans|| sampling, specifies the oversampling rate (a numeric
value greater than 0.0). The function multiplies rate by "num.sample"
(for each "num.sample").
Default Value: 1.0
Types: numeric
|
iteration.num |
Optional Argument.
For kmeans|| sampling, specifies the number of iterations (an
integer value greater than 0).
Default Value: 5
Types: integer
|
setid.as.first.column |
Optional Argument.
Specifies whether the generated set_id values are to be included as
first column in output.
Note: "setid.as.first.column" argument support is only available
when tdplyr is connected to Vantage 1.1 or later versions.
Default Value: TRUE
Types: logical
|
data.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: character OR vector of Strings (character)
|
Value
Function returns an object of class "td_random_sample_mle" which is a
named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator
using the name: result.
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("randomsample_example", "fs_input", "fs_input1")
# Create object(s) of class "tbl_teradata". The input tbl_teradata have observations
# of 11 variables for different models of cars.
fs_input <- tbl(con, "fs_input")
fs_input1 <- tbl(con, "fs_input1")
# Example 1: This example uses basic sampling to select 3 sample
# sets of sizes 2, 3 and 1 rows, weighted by car weight.
td_random_sample_out1 <- td_random_sample_mle(data = fs_input,
num.sample = c(2,3,1),
weight.column = "wt",
)
# Example 2: This example uses KMeans++ sampling with the Manhattan
# distance metric, and treats the numeric variables cyl, gear, and
# carb as categorical variables.
td_random_sample_out2 <- td_random_sample_mle(data = fs_input,
num.sample = 10,
sampling.mode = "KMeans++",
distance = "manhattan",
input.columns = c('mpg','cyl','disp','hp',
'drat','wt','qsec','vs',
'am','gear','carb'),
as.categories = c("cyl","gear","carb"),
category.weights = c(1000,10,100,100,100),
seed.column = c("model"),
seed = 1
)
# Example 3: This example uses KMeans|| sampling with the Manhattan
# distance metric for the numerical variables and the Hamming
# distance metric for the categorical variables.
td_random_sample_out3 <- td_random_sample_mle(data = fs_input1,
num.sample = 20,
sampling.mode = "KMeans||",
distance = "MANHATTAN",
input.columns = c('mpg','cyl','disp','hp',
'drat','wt','qsec','vs',
'am','gear','carb'),
as.categories = c("cyl","gear","carb"),
category.weights = c(1000,10,100,100,100),
categorical.distance = "HAMMING",
seed.column = c("model"),
seed = 1,
iteration.num = 2
)