Teradata R Package Function Reference - 16.20 - RandomSample - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
16.20
created_date
February 2020
category
Programming Reference
featnum
B700-4007-098K

Description

The RandomSample (td_random_sample_mle) function takes a data set and uses a specified sampling method to output one or more random samples. Each sample has exactly the number of rows specified.

Usage

  td_random_sample_mle (
      data = NULL,
      num.sample = NULL,
      weight.column = NULL,
      sampling.mode = "Basic",
      distance = "EUCLIDEAN",
      input.columns = NULL,
      as.categories = NULL,
      category.weights = NULL,
      categorical.distance = "OVERLAP",
      seed.column = NULL,
      seed = NULL,
      over.sampling.rate = 1,
      iteration.num = 5,
      setid.as.first.column = TRUE,
      data.sequence.column = NULL
  )

Arguments

data

Required Argument.
Specifies the name of the tbl_teradata that contains the data set from which to take samples.

num.sample

Required Argument.
Specifies the sample sizes for the sample sets. For each "num.sample" in this argument, the function selects a sample set that has sample size number of rows. For example, specifying it as c(2,4,5) creates 3 sample sets of size 2, 4, and 5 respectively. Similarly, specifying it as 10 creates one sample set of size 10.

weight.column

Optional Argument.
Specifies the name of the column that contains weights for weighted sampling. The 'weight.column' must have a numeric SQL data type. By default, rows have equal weight.

sampling.mode

Optional Argument.
Specifies the sampling mode and can be one of the following:

  1. "Basic" (default): Each input tbl_teradata ('data') row has a probability of being selected that is proportional to its weight. The weight of each row is in 'weight.column'.

  2. "KMeans++": One row is selected in each of k iterations, where k is the number of desired output rows. The first row is selected randomly. In subsequent iterations, the probability of a row being selected is proportional to the value in the 'weight.column' multiplied by the distance from the nearest row in the set of selected rows. The distance is calculated using the methods specified by the 'distance' and 'categorical.distance' arguments.

  3. "KMeans||": Enhanced version of kmeans++ that exploits parallel architecture to accelerate the sampling process. The algorithm is described in the paper Scalable kmeans++ by Bahmani et al (http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).

Briefly, at each iteration, the probability that a row is selected is proportional to the value in the 'weight.column' multiplied by the distance from the nearest row in the set of selected rows (as in kmeans++). However, the kmeans|| algorithm oversamples at each iteration, significantly reducing the required number of iterations; therefore, the resulting set of rows might have more than k data points. Each row in the resulting set is then weighted by the number of rows in the tbl_teradata that are closer to that row than to any other selected row, and the rows are clustered to produce exactly k rows. Tip: For optimal performance, use "kmeans++" when the desired sample size is less than 15 and "kmeans||" otherwise. Default Value: "Basic" Permitted Values: Basic, KMeans++, KMeans||

distance

Required Argument for kmeans++ and kmeans|| sampling.
For kmeans++ and kmeans|| sampling, specifies the function for computing the distance between numerical variables.
Default Value: "EUCLIDEAN"
Permitted Values: MANHATTAN, EUCLIDEAN

input.columns

Required Argument for kmeans++ and kmeans|| sampling.
It specifies the names of the input tbl_teradata columns to use to calculate the distance between numerical variables.

as.categories

Required Argument for kmeans++ and kmeans|| sampling. It specifies the names of the input tbl_teradata columns that contain numerical variables to treat as categorical variables.

category.weights

Required Argument for kmeans++ and kmeans|| sampling.
It specifies the weights (numeric values) of the categorical variables, including those specified by the 'as.categories' argument. Specify the weights in the order (from left to right) that the variables appear in the input table. When calculating the distance between two rows, distances between categorical values are scaled by these weights.

categorical.distance

Required Argument for kmeans++ and kmeans|| sampling. If specifies the function for computing the distance between categorical variables.

  1. "overlap": The distance between two variables is 0 if they are the same and 1 if they are different.

  2. "hamming": The distance beween two variables is the Hamming distance between the strings that represent them. The strings must have equal length.


Default Value: "OVERLAP"
Permitted Values: OVERLAP, HAMMING

seed.column

Optional Argument.
Specifies the names of the input tbl_teradata columns by which to partition the input. Function calls that use the same input 'data', 'seed', and 'seed.column' output the same result. If you specify 'seed.column', you must also specify 'seed'.
Note: Ideally, the number of distinct values in the 'seed.column' is the same as the number of workers in the cluster. A very large number of distinct values in the 'seed.column' degrades function performance.

seed

Optional Argument. Specifies the random seed used to initialize the algorithm.

over.sampling.rate

Optional Argument.
For kmeans|| sampling, specifies the oversampling rate (a numeric value greater than 0.0). The function multiplies rate by "num.sample" (for each "num.sample").
Default Value: 1

iteration.num

Optional Argument.
For kmeans|| sampling, specifies the number of iterations (a numeric value greater than 0).
Default Value: 5

setid.as.first.column

Optional Argument.
Specifies whether the generated set_id values are to be included as first column in output. Note: setid.as.first.column argument support is only available when tdplyr is connected to Vantage 1.1 or later versions.
Default Value: TRUE
Types: logical

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.

Value

Function returns an object of class "td_random_sample_mle" which is a named list containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator using name: result.

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("randomsample_example", "fs_input", "fs_input1")
    
    # Create remote tibble objects. The input tables have observations 
    # of 11 variables for different models of cars.
    fs_input <- tbl(con, "fs_input")
    fs_input1 <- tbl(con, "fs_input1")
    
    # Example 1 - This example uses basic sampling to select 3 sample 
    # sets of sizes 2, 3 and 1 rows, weighted by car weight.
    td_random_sample_out1 <- td_random_sample_mle(data = fs_input,
                                              num.sample = c(2,3,1),
                                              weight.column = "wt",
                                             )
    
    # Example 2 - This example uses KMeans++ sampling with the Manhattan
    # distance metric, and treats the numeric variables cyl, gear, and
    # carb as categorical variables.
    td_random_sample_out2 <- td_random_sample_mle(data = fs_input,
                                              num.sample = 10,
                                              sampling.mode = "KMeans++",
                                              distance = "manhattan",
                                              input.columns = c('mpg','cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb'),
                                              as.categories = c("cyl","gear","carb"),
                                              category.weights = c(1000,10,100,100,100),
                                              seed.column = c("model"),
                                              seed = 1
                                             )
    
    # Example 3 - This example uses KMeans|| sampling with the Manhattan 
    # distance metric for the numerical variables and the Hamming 
    # distance metric for the categorical variables.
    td_random_sample_out3 <- td_random_sample_mle(data = fs_input1,
                                              num.sample = 20,
                                              sampling.mode = "KMeans||",
                                              distance = "MANHATTAN",
                                              input.columns = c('mpg','cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb'),
                                              as.categories = c("cyl","gear","carb"),
                                              category.weights = c(1000,10,100,100,100),
                                              categorical.distance = "HAMMING",
                                              seed.column = c("model"),
                                              seed = 1,
                                              iteration.num = 2
                                             )