The td_sample() function reduces the number of rows to be considered for further processing by returning one or more samples of rows. Sampling can be done in one of the following three ways:
- Specifying list of numbers (number of rows in each sample).
- Specifying list of fractions (proportion of the total number of rows in each sample).
- Specifying list of numbers or fractions based on conditions (stratified sampling).
For more information about the function and its arguments, Users can use the command help(td_sample) from R console.
Example Setup
- Load example dataset.
> loadExampleData("antiselect_example", "antiselect_input")
- Create object(s) of class "tbl_teradata".
> df <- tbl(con, "antiselect_input")
- Reduce the number of columns.
> df <- df %>% select(rowids, orderid, priority, quantity)
Example 1: Get two samples of 3 rows and 2 rows each
In this example, the column 'sampleid' takes 2 values. Each sampleid value represents one sample.
> df1 <- td_sample(df = df, n = c(3, 2))
> df1 # Source: SQL [?? x 5] # Database: [Teradata 16.20.50.01] [Teradata Native Driver 17.0.0.2] [TDAPUSER@<hostname>/TDAPUSERDB] rowids orderid priority quantity sampleid <int> <int> <chr> <int> <int> 1 49 293 high 49 1 2 86 515 not specified 21 2 3 50 293 high 27 1 4 85 515 not specified 19 1 5 97 613 high 12 2
Example 2: Get a sample of 3 rows
In this example, all the rows have 'sampleid' set to 1.
> df1 <- td_sample(df = df, n = 3)
> df1 # Source: SQL [?? x 5] # Database: [Teradata 16.20.50.01] [Teradata Native Driver 17.0.0.2] [TDAPUSER@<hostname>/TDAPUSERDB] rowids orderid priority quantity sampleid <int> <int> <chr> <int> <int> 1 85 515 not specified 19 1 2 1 3 low 6 1 3 50 293 high 27 1
Example 3: Get two samples each containing 30% and 50% of rows
> df1 <- td_sample(df = df, n = c(0.3, 0.5))
> df1 # Source: SQL [?? x 5] # Database: [Teradata 16.20.50.01] [Teradata Native Driver 17.0.0.2] [TDAPUSER@<hostname>/TDAPUSERDB] rowids orderid priority quantity sampleid <int> <int> <chr> <int> <int> 1 80 483 high 30 2 2 97 613 high 12 1 3 85 515 not specified 19 2 4 1 3 low 6 2 5 50 293 high 27 1 6 49 293 high 49 2
Example 4: Get 10 rows from a tbl_teradata object of 7 rows setting with.replacement = TRUE
In this example, setting 'randomize' to TRUE ensures that sampling is done across AMPs in large datasets.
> df1 <- td_sample(df = df, n = 10, with.replacement = TRUE, randomize = TRUE)
> df1 # Source: SQL [?? x 5] # Database: [Teradata 16.20.50.01] [Teradata Native Driver 17.0.0.2] [TDAPUSER@<hostname>/TDAPUSERDB] rowids orderid priority quantity sampleid <int> <int> <chr> <int> <int> 1 97 613 high 12 1 2 97 613 high 12 1 3 80 483 high 30 1 4 49 293 high 49 1 5 49 293 high 49 1 6 49 293 high 49 1 7 49 293 high 49 1 8 97 613 high 12 1 9 97 613 high 12 1 10 97 613 high 12 1Note that 10 rows are returned and there are duplicates for some rows.
Example 5: Using stratified sampling with multiple conditions
In this example, multiple conditions are used: get 4 rows (1 row in first sample and 3 rows in second sample) when 'orderid' is less than 300 and 2 rows when 'priority' is not equal to 'high'.
Only 3 rows are returned when 'orderid' is less than 300, because actual tbl_teradata contains only 3 rows satisfying this condition.
> df1 <- td_sample(df, when_then = list("orderid < 300" = c(1, 3), "priority <> 'high'" = 2))
> df1 # Source: SQL [?? x 5] # Database: [Teradata 16.20.50.01] [Teradata Native Driver 17.0.0.2] [TDAPUSER@<hostname>/TDAPUSERDB] rowids orderid priority quantity sampleid <int> <int> <chr> <int> <int> 1 1 3 low 6 2 2 50 293 high 27 1 3 49 293 high 49 2 4 86 515 not specified 21 3 5 85 515 not specified 19 3