Teradata R Package Function Reference | 17.00 - 17.00 - Sampling - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
17.00
created_date
September 2020
category
Programming Reference
featnum
B700-4007-090K

Description

The Sample function draws rows randomly from the input tbl_teradata.

Usage

  td_sampling_mle (
      data = NULL,
      data.partition.column = "ANY",
      data.order.column = NULL,
      summary.data = NULL,
      summary.data.order.column = NULL,
      stratum.column = NULL,
      strata = NULL,
      sample.fraction = NULL,
      approx.sample.size = NULL,
      seed = 0,
      data.sequence.column = NULL,
      summary.data.sequence.column = NULL
   )

Arguments

data

Required Argument.
Specifies the tbl_teradata containing the data to be sampled.

data.partition.column

Optional Argument
Specifies Partition By columns for "data".
Values to this argument can be provided as a vector, if multiple columns are used for partition.
Default Value: ANY
Types: character OR vector of Strings (character)

data.order.column

Optional Argument.
Specifies Order By columns for "data".
Values to this argument can be provided as a vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

summary.data

Optional Argument.
Specifies the summary tbl_teradata containing the stratum count information.

summary.data.order.column

Optional Argument.
Specifies Order By columns for "summary.data".
Values to this argument can be provided as a vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

stratum.column

Optional Argument.
Specifies the name of the column that contains the sample conditions. If the function has only one input tbl_teradata, then sampling condition column is in the "data". If the function has two input tbl_teradata, "data" and "summary.data", then the sampling condition column is in the "summary.data".
Types: character

strata

Optional Argument.
Specifies the sample conditions that appear in the condition column specified by "stratum.column". If "strata" specifies a condition that does not appear in condition column, then the function issues an error message.
Types: character OR vector of characters

sample.fraction

Optional Argument.
Specifies one or more fractions to use in sampling the data. (Syntax options that do not use "sample.fraction" require "approx.sample.size".) If you specify only one fraction, then the function uses the specified fraction for all strata defined by the sample conditions. If you specify more than one fractions, then the function uses each fraction for sampling a particular stratum defined by the condition arguments.
Note: For conditional sampling with variable sample sizes, specify one fraction for each condition that you specify with the strata argument.
Types: numeric OR vector of numerics

approx.sample.size

Optional Argument.
Specifies one or more approximate sample sizes to use in sampling the data. (Syntax options that do not use "approx.sample.size" require "sample.fraction".) Each sample size is approximate because the function maps the size to the sample fractions and then generates the sample data. If you specify only one size, then it represents the total sample size for the entire population. If you also specify the strata argument, then the function proportionally generates sample units for each stratum. If you specify more than one size, then each size corresponds to a stratum, and the function uses each size to generate sample units for the corresponding stratum.
Note: For conditional sampling with argument "approx.sample.size", specify one size for each condition that you specify with the "strata" argument.
Types: integer OR vector of integers

seed

Optional Argument.
Specifies the random seed used to initialize the algorithm.
Default Value: 0
Types: numeric

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

summary.data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "summary.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_sampling_mle" which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator using the name: result.

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("sampling_example", "students", "score_category")

    # Create object(s) of class "tbl_teradata".
    students <- tbl(con, "students")
    score_category <- tbl(con, "score_category")

    # Example 1: This example selects a sample of approximately 20%
    # of the rows in the students tbl_teradata.
    td_sampling_out1 <- td_sampling_mle(data = students,
                                        sample.fraction = 0.2,
                                        seed = 2
                                        )

    # Example 2: This example applies sampling rates 20%, 30%, and 40%
    # to categories fair, very good, and excellent, respectively, and rounds
    # the number sampled to the nearest integer.
    td_sampling_out2 <- td_sampling_mle(data = score_category,
                                        data.partition.column = "stratum",
                                        stratum.column = "stratum",
                                        strata = c("fair", "very good", "excellent"),
                                        sample.fraction = c(0.2, 0.3, 0.4),
                                        seed = 2
                                        )

    # Example 3: This examples demonstrates conditional sampling with Approximate
    # Sample Size.
    # score_summary groups the score_category tbl_teradata based on the stratum
    # column and also has their corresponding count.
    score_summary <- score_category %>% select(stratum) %>% count(stratum) %>%
                      mutate(stratum_count = as.integer(n)) %>% select(-n)

    td_sampling_out3 <- td_sampling_mle(data=score_category,
                                        summary.data=score_summary,
                                        stratum.column='stratum',
                                        strata=c('excellent','fair','very good'),
                                        approx.sample.size=c(5,10,5),
                                        seed=2
                                        )