Teradata R Package Function Reference | 17.00 - 17.00 - RandomSample - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
17.00
created_date
September 2020
category
Programming Reference
featnum
B700-4007-090K

Description

The RandomSample function takes a data set and uses a specified sampling method to output one or more random samples. Each sample has exactly the number of rows specified.

Usage

  td_random_sample_mle (
      data = NULL,
      num.sample = NULL,
      weight.column = NULL,
      sampling.mode = "Basic",
      distance = "EUCLIDEAN",
      input.columns = NULL,
      as.categories = NULL,
      category.weights = NULL,
      categorical.distance = "OVERLAP",
      seed.column = NULL,
      seed = NULL,
      over.sampling.rate = 1.0,
      iteration.num = 5,
      setid.as.first.column = TRUE,
      data.sequence.column = NULL
  )

Arguments

data

Required Argument.
Specifies the name of the tbl_teradata that contains the data set from which to take samples.

num.sample

Required Argument.
Specifies the sample sizes for the sample sets. For each "num.sample" in this argument, the function selects a sample set that has sample size number of rows. For example, specifying it as c(2,4,5) creates 3 sample sets of size 2, 4, and 5 respectively. Similarly, specifying it as 10 creates one sample set of size 10.
Types: integer OR vector of integers

weight.column

Optional Argument.
Specifies the name of the column that contains weights for weighted sampling. The "weight.column" must have a numeric SQL data type. By default, rows have equal weight.
Types: character

sampling.mode

Optional Argument.
Specifies the sampling mode and can be one of the following:

  1. "Basic": Each input tbl_teradata ('data') row has a probability of being selected that is proportional to its weight. The weight of each row is in "weight.column".

  2. "KMeans++": One row is selected in each of k iterations, where k is the number of desired output rows. The first row is selected randomly. In subsequent iterations, the probability of a row being selected is proportional to the value in the "weight.column" multiplied by the distance from the nearest row in the set of selected rows. The distance is calculated using the methods specified by the "distance" and "categorical.distance" arguments.

  3. "KMeans||": Enhanced version of kmeans++ that exploits parallel architecture to accelerate the sampling process. The algorithm is described in the paper Scalable kmeans++ by Bahmani et al.

Briefly, at each iteration, the probability that a row is selected is proportional to the value in the "weight.column" multiplied by the distance from the nearest row in the set of selected rows (as in kmeans++). However, the kmeans|| algorithm oversamples at each iteration, significantly reducing the required number of iterations; therefore, the resulting set of rows might have more than k data points. Each row in the resulting set is then weighted by the number of rows in the tbl_teradata that are closer to that row than to any other selected row, and the rows are clustered to produce exactly k rows. Tip: For optimal performance, use "kmeans++" when the desired sample size is less than 15 and "kmeans||" otherwise.
Default Value: "Basic"
Permitted Values: Basic, KMeans++, KMeans||
Types: character

distance

Required Argument for kmeans++ and kmeans|| sampling.
Specifies the function for computing the distance between numerical variables. Following functions can be specified:

  1. "euclidean": The distance between two variables is defined in "Euclidean Distance".

  2. "manhattan": The distance beween two variables is defined in "Manhattan Distance".


Default Value: "EUCLIDEAN"
Permitted Values: MANHATTAN, EUCLIDEAN
Types: character

input.columns

Required Argument for kmeans++ and kmeans|| sampling.
Specifies the names of the input tbl_teradata columns to use to calculate the distance between numerical variables.
Types: character OR vector of Strings (character)

as.categories

Required Argument for kmeans++ and kmeans|| sampling.
Specifies the names of the input tbl_teradata columns that contain numerical variables to treat as categorical variables.
Types: character OR vector of Strings (character)

category.weights

Required Argument for kmeans++ and kmeans|| sampling.
Specifies the weights (numeric values) of the categorical variables, including those that the "as.categories" argument specifies. Specify the weights in the order (from left to right) that the variables appear in the input tbl_teradata. When calculating the distance between two rows, distances between categorical values are scaled by these weights.
Types: numeric OR vector of numerics

categorical.distance

Required Argument for kmeans++ and kmeans|| sampling.
Specifies the function for computing the distance between categorical variables.

  1. "overlap": The distance between two variables is 0 if they are the same and 1 if they are different.

  2. "hamming": The distance beween two variables is the Hamming distance between the strings that represent them. The strings must have equal length.


Default Value: "OVERLAP"
Permitted Values: OVERLAP, HAMMING
Types: character

seed.column

Optional Argument.
Specifies the names of the input tbl_teradata columns by which to partition the input. Function calls that use the same input "data", "seed", and "seed.column" output the same result. If you specify "seed.column", you must also specify "seed".
Note: Ideally, the number of distinct values in the "seed.column" is the same as the number of workers in the cluster. A very large number of distinct values in the "seed.column" degrades function performance.
Types: character OR vector of Strings (character)

seed

Optional Argument.
Specifies the random seed with which to initialize the algorithm (a numeric value). If you specify this argument, then you must also specify "seed.column".
Types: numeric

over.sampling.rate

Optional Argument.
For kmeans|| sampling, specifies the oversampling rate (a numeric value greater than 0.0). The function multiplies rate by "num.sample" (for each "num.sample").
Default Value: 1.0
Types: numeric

iteration.num

Optional Argument.
For kmeans|| sampling, specifies the number of iterations (an integer value greater than 0).
Default Value: 5
Types: integer

setid.as.first.column

Optional Argument.
Specifies whether the generated set_id values are to be included as first column in output.
Note: "setid.as.first.column" argument support is only available when tdplyr is connected to Vantage 1.1 or later versions.
Default Value: TRUE
Types: logical

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_random_sample_mle" which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator using the name: result.

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("randomsample_example", "fs_input", "fs_input1")

    # Create object(s) of class "tbl_teradata". The input tbl_teradata have observations
    # of 11 variables for different models of cars.
    fs_input <- tbl(con, "fs_input")
    fs_input1 <- tbl(con, "fs_input1")

    # Example 1: This example uses basic sampling to select 3 sample
    # sets of sizes 2, 3 and 1 rows, weighted by car weight.
    td_random_sample_out1 <- td_random_sample_mle(data = fs_input,
                                                  num.sample = c(2,3,1),
                                                  weight.column = "wt",
                                                 )

    # Example 2: This example uses KMeans++ sampling with the Manhattan
    # distance metric, and treats the numeric variables cyl, gear, and
    # carb as categorical variables.
    td_random_sample_out2 <- td_random_sample_mle(data = fs_input,
                                                  num.sample = 10,
                                                  sampling.mode = "KMeans++",
                                                  distance = "manhattan",
                                                  input.columns = c('mpg','cyl','disp','hp',
                                                                    'drat','wt','qsec','vs',
                                                                    'am','gear','carb'),
                                                  as.categories = c("cyl","gear","carb"),
                                                  category.weights = c(1000,10,100,100,100),
                                                  seed.column = c("model"),
                                                  seed = 1
                                                 )

    # Example 3: This example uses KMeans|| sampling with the Manhattan
    # distance metric for the numerical variables and the Hamming
    # distance metric for the categorical variables.
    td_random_sample_out3 <- td_random_sample_mle(data = fs_input1,
                                                  num.sample = 20,
                                                  sampling.mode = "KMeans||",
                                                  distance = "MANHATTAN",
                                                  input.columns = c('mpg','cyl','disp','hp',
                                                                    'drat','wt','qsec','vs',
                                                                    'am','gear','carb'),
                                                  as.categories = c("cyl","gear","carb"),
                                                  category.weights = c(1000,10,100,100,100),
                                                  categorical.distance = "HAMMING",
                                                  seed.column = c("model"),
                                                  seed = 1,
                                                  iteration.num = 2
                                                 )