td_sample() | Teradata R Package - td_sample() - Teradata Package for R

Teradata® Package for R User Guide

Product
Teradata Package for R
Release Number
17.00
Published
July 2021
Language
English (United States)
Last Update
2023-08-08
dita:mapPath
yih1585763700215.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
B700-4005
Product Category
Teradata Vantage
The td_sample() function reduces the number of rows to be considered for further processing by returning one or more samples of rows. Sampling can be done in one of the following three ways:
  • Specifying list of numbers (number of rows in each sample).
  • Specifying list of fractions (proportion of the total number of rows in each sample).
  • Specifying list of numbers or fractions based on conditions (stratified sampling).

For more information about the function and its arguments, Users can use the command help(td_sample) from R console.

Example Setup

  • Load example dataset.
    > loadExampleData("antiselect_example", "antiselect_input")
  • Create object(s) of class "tbl_teradata".
    > df <- tbl(con, "antiselect_input")
  • Reduce the number of columns.
    > df <- df %>% select(rowids, orderid, priority, quantity)
    

Example 1: Get two samples of 3 rows and 2 rows each

In this example, the column 'sampleid' takes 2 values. Each sampleid value represents one sample.

> df1 <- td_sample(df = df, n = c(3, 2))
> df1
# Source:   SQL [?? x 5]
# Database: [Teradata 16.20.50.01] [Teradata Native Driver 17.0.0.2] [TDAPUSER@<hostname>/TDAPUSERDB]
  rowids orderid priority      quantity sampleid
   <int>   <int> <chr>            <int>    <int>
1     49     293 high                49        1
2     86     515 not specified       21        2
3     50     293 high                27        1
4     85     515 not specified       19        1
5     97     613 high                12        2

Example 2: Get a sample of 3 rows

In this example, all the rows have 'sampleid' set to 1.

> df1 <- td_sample(df = df, n = 3)
> df1
# Source:   SQL [?? x 5]
# Database: [Teradata 16.20.50.01] [Teradata Native Driver 17.0.0.2] [TDAPUSER@<hostname>/TDAPUSERDB]
  rowids orderid priority      quantity sampleid
   <int>   <int> <chr>            <int>    <int>
1     85     515 not specified       19        1
2      1       3 low                  6        1
3     50     293 high                27        1

Example 3: Get two samples each containing 30% and 50% of rows

> df1 <- td_sample(df = df, n = c(0.3, 0.5))
> df1
# Source:   SQL [?? x 5]
# Database: [Teradata 16.20.50.01] [Teradata Native Driver 17.0.0.2] [TDAPUSER@<hostname>/TDAPUSERDB]
  rowids orderid priority      quantity sampleid
   <int>   <int> <chr>            <int>    <int>
1     80     483 high                30        2
2     97     613 high                12        1
3     85     515 not specified       19        2
4      1       3 low                  6        2
5     50     293 high                27        1
6     49     293 high                49        2

Example 4: Get 10 rows from a tbl_teradata object of 7 rows setting with.replacement = TRUE

In this example, setting 'randomize' to TRUE ensures that sampling is done across AMPs in large datasets.

> df1 <- td_sample(df = df, n = 10, with.replacement = TRUE, randomize = TRUE)
> df1
# Source:   SQL [?? x 5]
# Database: [Teradata 16.20.50.01] [Teradata Native Driver 17.0.0.2] [TDAPUSER@<hostname>/TDAPUSERDB]
   rowids orderid priority quantity sampleid
    <int>   <int> <chr>       <int>    <int>
 1     97     613 high           12        1
 2     97     613 high           12        1
 3     80     483 high           30        1
 4     49     293 high           49        1
 5     49     293 high           49        1
 6     49     293 high           49        1
 7     49     293 high           49        1
 8     97     613 high           12        1
 9     97     613 high           12        1
10     97     613 high           12        1
Note that 10 rows are returned and there are duplicates for some rows.

Example 5: Using stratified sampling with multiple conditions

In this example, multiple conditions are used: get 4 rows (1 row in first sample and 3 rows in second sample) when 'orderid' is less than 300 and 2 rows when 'priority' is not equal to 'high'.

Only 3 rows are returned when 'orderid' is less than 300, because actual tbl_teradata contains only 3 rows satisfying this condition.
> df1 <- td_sample(df, when_then = list("orderid < 300" = c(1, 3), "priority <> 'high'" = 2))
> df1
# Source:   SQL [?? x 5]
# Database: [Teradata 16.20.50.01] [Teradata Native Driver 17.0.0.2] [TDAPUSER@<hostname>/TDAPUSERDB]
  rowids orderid priority      quantity sampleid
   <int>   <int> <chr>            <int>    <int>
1      1       3 low                  6        2
2     50     293 high                27        1
3     49     293 high                49        2
4     86     515 not specified       21        3
5     85     515 not specified       19        3