Teradata Package for R Function Reference | 17.00 - td_sample - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

Teradata® Package for R Function Reference

Product
Teradata Package for R
Release Number
17.00
Published
July 2021
Language
English (United States)
Last Update
2023-08-08
dita:id
B700-4007
NMT
no
Product Category
Teradata Vantage
Selects one or more samples of rows

Description

This function reduces the number of rows to be considered for further processing by returning one or more samples of rows. Sampling can be done in either of the three ways mentioned below:

  1. Specifying list of numbers (number of rows in each sample)

  2. Specifying list of fractions (proportion of the total number of rows in each sample)

  3. Specifying list of numbers/fractions based on conditions (stratified sampling)

Stratified random sampling is a sampling method that divides a heterogeneous population of interest into homogeneous subgroups, or strata, and then takes a random sample from each of those subgroups. The arguments 'when_then' and 'case_else' help in stratified sampling.
Usage notes for the arguments 'n', 'case_else' and each 'then' element in the argument 'when_then':

  1. No more than 16 samples can be requested per count or fraction list i.e. the arguments cannot take a list of elements with more than 16 samples.

  2. Sum of elements in the list containing fraction values should not be greater than 1 and each value should be greater than 0.

  3. If the list contains a float value greater than 1, e.g. c(3, 2.4), then the floor value is considered for sampling i.e. first sample contains 3 rows and second sample contains 2 rows.

Note :

  1. A new column 'sampleid' is added to the sampled data to determine the sample set each row belongs to in the sample. If the parent tbl object already has the column 'sampleid', this column is removed in the sampled data. This case occurs when multiple or consecutive sample operations are performed. To retain the 'sampleid' columns across multiple td_sample operations, the column must be renamed using mutate function.

  2. If the number of samples requested exceed the number of rows available, the sample size is reduced to the number of remaining rows when the argument 'with.replacement' is set to FALSE.

Usage

td_sample(
  df = NULL,
  n = NULL,
  with.replacement = FALSE,
  randomize = FALSE,
  when_then = NULL,
  case_else = c()
)

Arguments

df

Required Argument.
Specifies the teradata_tbl object from which rows are sampled.

n

Optional Argument.
Required if the argument 'when_then' is not specified.
Specifies the number of rows or proportion of rows to be sampled.
Types: numeric OR vector of numerics
Examples:

  1. c(1, 2) - to get 2 samples, one containing 1 row and the other containing 2 rows

  2. 2 - to get one sample containing two rows

  3. c(0.3, 0.5) - to get 2 samples, one containing 30% of rows and the other containing 50% of rows of all samples

with.replacement

Optional Argument.
Specifies if sampling should be done with replacement or not.
If this argument is FALSE, sampling is done without replacement.
Default value : FALSE.
Types: logical

randomize

Optional Argument.
Specifies whether rows are sampled randomly across AMPs (RANDOMIZED ALLOCATION) or proportionate to the number of qualified rows per AMP (PROPORTIONAL ALLOCATION). If this argument is FALSE, sampling is done proportionate to the number of qualified rows per AMP.
Note : The proportional allocation option does not provide a simple random sample of the entire population. It provides a random sample stratified by AMPs, but it is much faster, especially for very large samples.
Default value : FALSE.
Types: logical

when_then

Optional Argument.
Required when the argument 'n' is not specified.
Specifies the string conditions ('when' element) and the number of samples ('then' element) required to generate sample rows for each 'when' condition.
Default value : NULL
Types : Named list of numbers OR named list of vector of numbers.
Example: Suppose sampling is to be performed on a teradata_tbl object having columns 'col1' and 'col2'.
when_then <- list("col1 < 20" = 2, "col2 <> 'value'" = c(1, 3), "col1 = 30" = c(2, 1))
The equivalent when - then clauses for this example argument is:

  1. WHEN col1 < 20 THEN 2

  2. WHEN col2 <> 'value' THEN 1, 3

  3. WHEN col1 = 30 THEN 2, 1

case_else

Optional Argument.
Specifies the number of samples to be sampled from rows where none of the conditions in 'when_then' are met.
Default value : c()
Types : numeric OR vector of numerics
Examples : The equivalent else clause for the argument value:

  1. c(1,2) is ELSE 1, 2

  2. 2 is ELSE 2

  3. c(0.3, 0.5) is ELSE 0.3, 0.5

Value

A 'tbl' object containing the sampled data.

See Also

sample, td_sampling

Examples


# Get remote data source connection.
con <- td_get_context()$connection

# Creates the table "antiselect_input" if it is not present already.
loadExampleData("antiselect_example", "antiselect_input")

# Creates a teradata_tbl object.
df <- tbl(con, "antiselect_input") # Table contain 7 rows in total.

# Example 1: Get two samples of 3 rows and 2 rows each.
td_sample(df = df, n = c(3,2))

# Example 2: Get a sample of 3 rows. Note that all the rows have sampleid = 1.
td_sample(df = df, n = 3)

# Example 3: Get 50% of total rows. Here, it is 50% of 7 rows.
td_sample(df = df, n = 0.5)

# Example 4: Get 10 rows from a tbl object of 7 rows using with.replacement = TRUE.
#            'randomize = TRUE' will ensure sampling is done across AMPs in large datasets.
td_sample(df = df, n = 10, with.replacement = TRUE, randomize = TRUE)

# Example 5: Get 5 rows which satisfy the condition 'orderid < 300' from a tbl object.
#            Here, only three rows are returned as the total number of rows which satisfy this
#            condition is 3. If with.replacement = TRUE is specified, then 5 rows will be
#            returned.
td_sample(df, when_then = list("orderid < 300" = 5))

# Example 6: Get 4 rows (1 row in first sample and 3 rows in second sample) which satisfy the
#            condition 'orderid < 300' from a tbl object.
#            Here, only 2 rows have sampleid = 2 as the total number of rows which satisfy
#            this condition is 3. If with.replacement = TRUE is specified, then 3 rows having
#            sampleid = 2 will be returned.
td_sample(df, when_then = list("orderid < 300" = c(1,3)))

# Example 7: Using stratified sampling with multiple conditions : 4 rows (1 row in first sample
#            and 3 rows in second sample) when orderid < 300 and 2 rows when priority != "high".
td_sample(df, when_then = list("orderid < 300" = c(1,3), "priority <> 'high'" = 2))

# Example 8: Using 'case_else' argument for stratified sampling : 2 rows when orderid < 300 and
#            3 rows from the remaining rows (rows which doesn't satisfy orderid < 300).
td_sample(df, when_then = list("orderid < 300" = 2), case_else = 3)