td_sample() | Teradata R Package - 17.00 - td_sample() - Teradata R Package

Teradata® R Package User Guide

prodname
Teradata R Package
vrm_release
17.00
created_date
November 2020
category
User Guide
featnum
B700-4005-090K
The td_sample() function reduces the number of rows to be considered for further processing by returning one or more samples of rows. Sampling can be done in either of the three ways mentioned below:
  • Specifying list of numbers (number of rows in each sample).
  • Specifying list of fractions (proportion of the total number of rows in each sample).
  • Specifying list of numbers or fractions based on conditions (stratified sampling).

For more information about the function and its arguments, Users can use the command help(td_sample) from R console.

Example Setup

  • Load example dataset.
    > loadExampleData("antiselect_example", "antiselect_input")
  • Create object(s) of class "tbl_teradata".
    > df <- tbl(con, "antiselect_input")
  • Reduce the number of columns.
    > df <- df %>% select(rowids, orderid, priority, quantity)
    

Example 1: Get two samples of 3 rows and 2 rows each

In this example, the column 'sampleid' takes 2 values. Each sampleid value represents one sample.

> df1 <- td_sample(df = df, n = c(3, 2))
> df1
# Source:   SQL [?? x 5]
# Database: [Teradata 16.20.50.01] [Teradata Native Driver 17.0.0.2] [TDAPUSER@<hostname>/TDAPUSERDB]
  rowids orderid priority      quantity sampleid
   <int>   <int> <chr>            <int>    <int>
1     49     293 high                49        1
2     86     515 not specified       21        2
3     50     293 high                27        1
4     85     515 not specified       19        1
5     97     613 high                12        2

Example 2: Get a sample of 3 rows

In this example, all the rows have 'sampleid' set to 1.

> df1 <- td_sample(df = df, n = 3)
> df1
# Source:   SQL [?? x 5]
# Database: [Teradata 16.20.50.01] [Teradata Native Driver 17.0.0.2] [TDAPUSER@<hostname>/TDAPUSERDB]
  rowids orderid priority      quantity sampleid
   <int>   <int> <chr>            <int>    <int>
1     85     515 not specified       19        1
2      1       3 low                  6        1
3     50     293 high                27        1

Example 3: Get two samples each containing 30% and 50% of rows

> df1 <- td_sample(df = df, n = c(0.3, 0.5))
> df1
# Source:   SQL [?? x 5]
# Database: [Teradata 16.20.50.01] [Teradata Native Driver 17.0.0.2] [TDAPUSER@<hostname>/TDAPUSERDB]
  rowids orderid priority      quantity sampleid
   <int>   <int> <chr>            <int>    <int>
1     80     483 high                30        2
2     97     613 high                12        1
3     85     515 not specified       19        2
4      1       3 low                  6        2
5     50     293 high                27        1
6     49     293 high                49        2

Example 4: Get 10 rows from a tbl_teradata object of 7 rows setting with.replacement = TRUE

In this example, setting 'randomize' to TRUE ensures that sampling is done across AMPs in large datasets.

> df1 <- td_sample(df = df, n = 10, with.replacement = TRUE, randomize = TRUE)
> df1
# Source:   SQL [?? x 5]
# Database: [Teradata 16.20.50.01] [Teradata Native Driver 17.0.0.2] [TDAPUSER@<hostname>/TDAPUSERDB]
   rowids orderid priority quantity sampleid
    <int>   <int> <chr>       <int>    <int>
 1     97     613 high           12        1
 2     97     613 high           12        1
 3     80     483 high           30        1
 4     49     293 high           49        1
 5     49     293 high           49        1
 6     49     293 high           49        1
 7     49     293 high           49        1
 8     97     613 high           12        1
 9     97     613 high           12        1
10     97     613 high           12        1
Note that 10 rows are returned and there are duplicates for some rows.

Example 5: Using stratified sampling with multiple conditions

In this example, multiple conditions are used: get 4 rows (1 row in first sample and 3 rows in second sample) when 'orderid' is less than 300 and 2 rows when 'priority' is not equal to 'high'.

Only 3 rows are returned when 'orderid' is less than 300, because actual tbl_teradata contains only 3 rows satisfying this condition.
> df1 <- td_sample(df, when_then = list("orderid < 300" = c(1, 3), "priority <> 'high'" = 2))
> df1
# Source:   SQL [?? x 5]
# Database: [Teradata 16.20.50.01] [Teradata Native Driver 17.0.0.2] [TDAPUSER@<hostname>/TDAPUSERDB]
  rowids orderid priority      quantity sampleid
   <int>   <int> <chr>            <int>    <int>
1      1       3 low                  6        2
2     50     293 high                27        1
3     49     293 high                49        2
4     86     515 not specified       21        3
5     85     515 not specified       19        3