Introduction to tdplyr functions
- td_nrow
- td_sample

Introduction to tdplyr functions

This section introduces some useful tdplyr functions and their usage. Each function has a section in this vignette. Each section can be read in isolation, except where references are made to other sections or other vignettes. Each section also has a subsection called “Things to Note” for good practices and specific behavior to consider when using these tdplyr specific functions with Teradata Vantage.

Once the connection is established using td_create_context, you can load the example table antiselect_inputand create tbl_teradata using the following statements:

loadExampleData("antiselect_example", "antiselect_input")

# Create tbl_teradata object.
df <- tbl(con, "antiselect_input")

# Create a tibble from tbl_teradata object.
df1 <- as_tibble(df)

# Create a R dataframe from tbl_teradata object.
df2 <- as.data.frame(df)

td_nrow

The function td_nrow retrieves number of rows in a tibble or data.frame object or tbl_teradata object.

# Print the tbl_teradata object.
df
#> # Source:   table<antiselect_input> [?? x 13]
#> # Database: Teradata
#>   rowids orderid orderdate priority quantity  sales discount shipmode
#>    <int>   <int> <chr>     <chr>       <int>  <dbl>    <dbl> <chr>   
#> 1      1       3 2010-10-~ low             6 2.62e2     0.04 regular~
#> 2     85     515 2010-08-~ not spe~       19 3.94e2     0.08 regular~
#> 3     86     515 2010-08-~ not spe~       21 1.47e2     0.05 regular~
#> 4     49     293 2012-10-~ high           49 1.01e4     0.07 deliver~
#> 5     97     613 2011-06-~ high           12 9.35e1     0.03 regular~
#> 6     50     293 2012-10-~ high           27 2.45e2     0.01 regular~
#> # ... with more rows, and 5 more variables: custname <chr>,
#> #   province <chr>, region <chr>, custsegment <chr>, prodcat <chr>

# Get number of rows of tbl_teradata object.
td_nrow(df)
#> [1] 7

# Get number of rows of tibble object.
td_nrow(df1)
#> [1] 7

# Get number of rows of R dataframe.
td_nrow(df2)
#> [1] 7

Notes

The nrow function of R works for dataframes and tibble objects but not for tbl_teradata objects. So, to get the number of rows in a tbl_teradata object, you should use the function td_nrow.
Please refer to dplyr function tally which is used to get number of the rows in R data.frames.

td_sample

The function td_sample reduces the number of rows to be considered for further processing by returning one or more samples of rows. Sampling can be done in either of the three ways mentioned below:

Specifying list of numbers (number of rows in each sample)
Specifying list of fractions (proportion of the total number of rows in each sample)
Specifying list of numbers/fractions based on conditions (stratified sampling)

For more information about the function and its arguments, one can use the command help(td_sample) from R console.

# Reducing the number of columns.
df <- df %>% select(rowids, orderid, priority, quantity)

# Example 1: Get two samples of 3 rows and 2 rows each. Note that the column 'sampleid' takes 2 values. Each sampleid value represents one sample.
td_sample(df = df, n = c(3, 2))
#> # Source:   SQL [?? x 5]
#> # Database: Teradata
#>   rowids orderid priority      quantity sampleid
#>    <int>   <int> <chr>            <int>    <int>
#> 1     97     613 high                12        2
#> 2     86     515 not specified       21        1
#> 3     85     515 not specified       19        1
#> 4     50     293 high                27        1
#> 5     80     483 high                30        2

# Example 2: Get a sample of 3 rows. Note that all the rows have sampleid = 1.
td_sample(df = df, n = 3)
#> # Source:   SQL [?? x 5]
#> # Database: Teradata
#>   rowids orderid priority      quantity sampleid
#>    <int>   <int> <chr>            <int>    <int>
#> 1     85     515 not specified       19        1
#> 2     97     613 high                12        1
#> 3     50     293 high                27        1

# Example 3: Get 50% of total rows.
td_sample(df = df, n = 0.5)
#> # Source:   SQL [?? x 5]
#> # Database: Teradata
#>   rowids orderid priority      quantity sampleid
#>    <int>   <int> <chr>            <int>    <int>
#> 1     97     613 high                12        1
#> 2     86     515 not specified       21        1
#> 3      1       3 low                  6        1
#> 4     80     483 high                30        1

# Example 4: Get two samples each containing 30% and 50% of rows.
td_sample(df = df, n = c(0.3, 0.5))
#> # Source:   SQL [?? x 5]
#> # Database: Teradata
#>   rowids orderid priority      quantity sampleid
#>    <int>   <int> <chr>            <int>    <int>
#> 1      1       3 low                  6        2
#> 2     85     515 not specified       19        1
#> 3     86     515 not specified       21        2
#> 4     97     613 high                12        2
#> 5     50     293 high                27        2
#> 6     80     483 high                30        1

# Example 5: Get 10 rows from a tbl_teradata object of 7 rows without setting with.replacement. Default value of the argument 'with.replacement' is FALSE. Note that only 7 rows are returned as the table doesn't contain 10 rows.
df1 <- td_sample(df = df, n = 10)

# Print tbl_teradata object.
df1
#> # Source:   SQL [?? x 5]
#> # Database: Teradata
#>   rowids orderid priority      quantity sampleid
#>    <int>   <int> <chr>            <int>    <int>
#> 1      1       3 low                  6        1
#> 2     85     515 not specified       19        1
#> 3     86     515 not specified       21        1
#> 4     49     293 high                49        1
#> 5     97     613 high                12        1
#> 6     50     293 high                27        1
#> # ... with more rows

# Check the number of rows of 'df1'.
td_nrow(df1)
#> [1] 7

# Example 6: Get 10 rows from a tbl_teradata object of 7 rows setting with.replacement to TRUE. 'randomize = TRUE'ensures sampling is done across AMPs in large datasets. Note that 10 rows are returned and there are duplicates for some rows.
df1 <- td_sample(df = df, n = 10, with.replacement = TRUE, randomize = TRUE)

# Print tbl_teradata object.
df1
#> # Source:   SQL [?? x 5]
#> # Database: Teradata
#>   rowids orderid priority quantity sampleid
#>    <int>   <int> <chr>       <int>    <int>
#> 1     50     293 high           27        1
#> 2     97     613 high           12        1
#> 3     97     613 high           12        1
#> 4     97     613 high           12        1
#> 5     97     613 high           12        1
#> 6     80     483 high           30        1
#> # ... with more rows

# Check the number of rows of 'df1'.
td_nrow(df1)
#> [1] 10

# Example 7 : Get 5 rows which satisfy the condition 'orderid < 300' from a tbl_teradata object. Here, only three rows are returned as the total number of rows which satisfy this condition is 3. 
df1 <- td_sample(df, when_then = list("orderid < 300" = 5))

# Print tbl_teradata object.
df1
#> # Source:   SQL [?? x 5]
#> # Database: Teradata
#>   rowids orderid priority quantity sampleid
#>    <int>   <int> <chr>       <int>    <int>
#> 1     49     293 high           49        1
#> 2      1       3 low             6        1
#> 3     50     293 high           27        1

# Check the number of rows of 'df1'.
td_nrow(df1)
#> [1] 3

# If with.replacement is set to TRUE, then 5 rows will be returned.
df1 <- td_sample(df, when_then = list("orderid < 300" = 5), with.replacement = TRUE)

# Check the number of rows of 'df1'.
td_nrow(df1)
#> [1] 5

# Example 8: Get 4 rows (1 row in first sample and 3 rows in second sample) which satisfy the condition 'orderid < 300' from a tbl_teradata object. Here, only 2 rows have sampleid = 2 as the total number of rows which satisfy this condition is 3.
df1 <- td_sample(df, when_then = list("orderid < 300" = c(1, 3)))

# Print tbl_teradata object.
df1
#> # Source:   SQL [?? x 5]
#> # Database: Teradata
#>   rowids orderid priority quantity sampleid
#>    <int>   <int> <chr>       <int>    <int>
#> 1     49     293 high           49        2
#> 2      1       3 low             6        1
#> 3     50     293 high           27        2

# Check the number of rows of 'df1'.
td_nrow(df1)
#> [1] 3

# If with.replacement is set to TRUE, then 3 rows having sampleid = 2 will be returned.
df1 <- td_sample(df, when_then = list("orderid < 300" = c(1, 3)), with.replacement = TRUE)

# Check the number of rows of 'df1'.
td_nrow(df1)
#> [1] 4

# Example 9: Using stratified sampling with multiple conditions : 4 rows (1 row in first sample and 3 rows in second sample) when orderid < 300 and 2 rows when priority != "high". Note that only 3 rows are returned when orderid < 300.
td_sample(df, when_then = list("orderid < 300" = c(1, 3), "priority <> 'high'" = 2))
#> # Source:   SQL [?? x 5]
#> # Database: Teradata
#>   rowids orderid priority      quantity sampleid
#>    <int>   <int> <chr>            <int>    <int>
#> 1     85     515 not specified       19        3
#> 2     49     293 high                49        2
#> 3     86     515 not specified       21        3
#> 4      1       3 low                  6        2
#> 5     50     293 high                27        1

# Example 10: Using 'case_else' argument for stratified sampling : 2 rows when orderid < 300 and 3 rows from the remaining rows (rows which doesn't satisfy orderid < 300).
td_sample(df, when_then = list("orderid < 300" = 2), case_else = 3)
#> # Source:   SQL [?? x 5]
#> # Database: Teradata
#>   rowids orderid priority      quantity sampleid
#>    <int>   <int> <chr>            <int>    <int>
#> 1     85     515 not specified       19        2
#> 2     49     293 high                49        1
#> 3     86     515 not specified       21        2
#> 4      1       3 low                  6        1
#> 5     80     483 high                30        2

Notes

No more than 16 samples can be requested per count or fraction list i.e. the arguments cannot take a list of elements with more than 16 samples.
Sum of elements in the list containing fraction values should not be greater than 1 and each value should be greater than 0.
If the list contains a float value greater than 1, e.g. c(3, 2.4), then the floor value is considered for sampling i.e. first sample contains 3 rows and second sample contains 2 rows.
A new column ‘sampleid’ is added to the sampled data to determine the sample set each row belongs to in the sample. If the parent tbl object already has the column ‘sampleid’, this column is removed in the sampled data. This case occurs when multiple or consecutive sample operations are performed. To retain the ‘sampleid’ columns across multiple td_sample operations, the column must be renamed using mutate function.
If the number of samples requested exceed the number of rows available, the sample size is reduced to the number of remaining rows when the argument with.replacement is set to FALSE.

Teradata R Package Function Reference - Introduction to tdplyr Functions - Teradata R Package - Look here for syntax, methods and examples for the functions included in the Teradata R Package.

Teradata® R Package Function Reference

Miscellaneous tdplyr functions

Introduction to tdplyr functions

td_nrow

Notes

td_sample

Notes