Teradata R Package Function Reference | 17.00 - 17.00 - Introduction to tdplyr Functions - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
17.00
created_date
September 2020
category
Programming Reference
featnum
B700-4007-090K

Introduction to tdplyr functions

This section introduces some useful tdplyr functions and their usage. Each function has a section in this vignette. Each section can be read in isolation, except where references are made to other sections or other vignettes. Each section also has a subsection called “Things to Note” for good practices and specific behavior to consider when using these tdplyr specific functions with Teradata Vantage.

Once the connection is established using td_create_context, you can load the example table antiselect_inputand create tbl_teradata using the following statements:

loadExampleData("antiselect_example", "antiselect_input")

# Create tbl_teradata object.
df <- tbl(con, "antiselect_input")

# Create a tibble from tbl_teradata object.
df1 <- as_tibble(df)

# Create a R dataframe from tbl_teradata object.
df2 <- as.data.frame(df)

td_nrow

The function td_nrow retrieves number of rows in a tibble or data.frame object or tbl_teradata object.

Note: The tbl_teradata objects are arranged to display ordered result while printing the output.

# Print the tbl_teradata object.
df %>% arrange(rowids)
#> # Source:     table<antiselect_input> [?? x 13]
#> # Database:   Teradata
#> # Ordered by: rowids
#>   rowids orderid orderdate priority quantity  sales discount shipmode custname
#>    <int>   <int> <chr>     <chr>       <int>  <dbl>    <dbl> <chr>    <chr>   
#> 1      1       3 2010-10-~ low             6   262.     0.04 regular~ muhamme~
#> 2     49     293 2012-10-~ high           49 10123.     0.07 deliver~ barry f~
#> 3     50     293 2012-10-~ high           27   245.     0.01 regular~ barry f~
#> 4     80     483 2011-07-~ high           30  4966.     0.08 regular~ clay ro~
#> 5     85     515 2010-08-~ not spe~       19   394.     0.08 regular~ carlos ~
#> 6     86     515 2010-08-~ not spe~       21   147.     0.05 regular~ carlos ~
#> # ... with more rows, and 4 more variables: province <chr>, region <chr>,
#> #   custsegment <chr>, prodcat <chr>

# Get number of rows of tbl_teradata object.
td_nrow(df)
#> integer64
#> [1] 7

# Get number of rows of tibble object.
td_nrow(df1)
#> [1] 7

# Get number of rows of R dataframe.
td_nrow(df2)
#> [1] 7

Notes

  • The nrow function of R works for dataframes and tibble objects but not for tbl_teradata objects. So, to get the number of rows in a tbl_teradata object, you should use the function td_nrow.
  • Please refer to dplyr function tally which is used to get number of the rows in R data.frames.

td_sample

The function td_sample reduces the number of rows to be considered for further processing by returning one or more samples of rows. Sampling can be done in either of the three ways mentioned below:

  • Specifying list of numbers (number of rows in each sample)
  • Specifying list of fractions (proportion of the total number of rows in each sample)
  • Specifying list of numbers/fractions based on conditions (stratified sampling)

For more information about the function and its arguments, one can use the command help(td_sample) from R console.

# Reducing the number of columns.
df <- df %>% select(rowids, orderid, priority, quantity)

# Example 1: Get two samples of 3 rows and 2 rows each. Note that the column 'sampleid' takes 2 values. Each sampleid value represents one sample.
df1 <- td_sample(df = df, n = c(3, 2))

# Print the results.
df1 %>% arrange(rowids)
#> # Source:     SQL [?? x 5]
#> # Database:   Teradata
#> # Ordered by: rowids
#>   rowids orderid priority      quantity sampleid
#>    <int>   <int> <chr>            <int>    <int>
#> 1     49     293 high                49        1
#> 2     50     293 high                27        2
#> 3     80     483 high                30        2
#> 4     85     515 not specified       19        1
#> 5     97     613 high                12        1

# Example 2: Get a sample of 3 rows. Note that all the rows have sampleid = 1.
df1 <- td_sample(df = df, n = 3)

# Print the results.
df1 %>% arrange(rowids)
#> # Source:     SQL [?? x 5]
#> # Database:   Teradata
#> # Ordered by: rowids
#>   rowids orderid priority quantity sampleid
#>    <int>   <int> <chr>       <int>    <int>
#> 1     50     293 high           27        1
#> 2     80     483 high           30        1
#> 3     97     613 high           12        1

# Example 3: Get 50% of total rows.
df1 <- td_sample(df = df, n = 0.5)

# Print the results.
df1 %>% arrange(rowids)
#> # Source:     SQL [?? x 5]
#> # Database:   Teradata
#> # Ordered by: rowids
#>   rowids orderid priority      quantity sampleid
#>    <int>   <int> <chr>            <int>    <int>
#> 1      1       3 low                  6        1
#> 2     50     293 high                27        1
#> 3     86     515 not specified       21        1
#> 4     97     613 high                12        1

# Example 4: Get two samples each containing 30% and 50% of rows.
df1 <- td_sample(df = df, n = c(0.3, 0.5))

# Print the results.
df1 %>% arrange(rowids)
#> # Source:     SQL [?? x 5]
#> # Database:   Teradata
#> # Ordered by: rowids
#>   rowids orderid priority      quantity sampleid
#>    <int>   <int> <chr>            <int>    <int>
#> 1      1       3 low                  6        2
#> 2     50     293 high                27        1
#> 3     80     483 high                30        2
#> 4     85     515 not specified       19        2
#> 5     86     515 not specified       21        2
#> 6     97     613 high                12        1

# Example 5: Get 10 rows from a tbl_teradata object of 7 rows without setting with.replacement. Default value of the argument 'with.replacement' is FALSE. Note that only 7 rows are returned as the table doesn't contain 10 rows.
df1 <- td_sample(df = df, n = 10)

# Print the results.
df1 %>% arrange(rowids)
#> # Source:     SQL [?? x 5]
#> # Database:   Teradata
#> # Ordered by: rowids
#>   rowids orderid priority      quantity sampleid
#>    <int>   <int> <chr>            <int>    <int>
#> 1      1       3 low                  6        1
#> 2     49     293 high                49        1
#> 3     50     293 high                27        1
#> 4     80     483 high                30        1
#> 5     85     515 not specified       19        1
#> 6     86     515 not specified       21        1
#> # ... with more rows

# Check the number of rows of 'df1'.
td_nrow(df1)
#> integer64
#> [1] 7

# Example 6: Get 10 rows from a tbl_teradata object of 7 rows setting with.replacement to TRUE. 'randomize = TRUE'ensures sampling is done across AMPs in large datasets. Note that 10 rows are returned and there are duplicates for some rows.
df1 <- td_sample(df = df, n = 10, with.replacement = TRUE, randomize = TRUE)

# Print the results.
df1 %>% arrange(rowids)
#> # Source:     SQL [?? x 5]
#> # Database:   Teradata
#> # Ordered by: rowids
#>   rowids orderid priority quantity sampleid
#>    <int>   <int> <chr>       <int>    <int>
#> 1     49     293 high           49        1
#> 2     49     293 high           49        1
#> 3     49     293 high           49        1
#> 4     80     483 high           30        1
#> 5     80     483 high           30        1
#> 6     97     613 high           12        1
#> # ... with more rows

# Check the number of rows of 'df1'.
td_nrow(df1)
#> integer64
#> [1] 10

# Example 7 : Get 5 rows which satisfy the condition 'orderid < 300' from a tbl_teradata object. Here, only three rows are returned as the total number of rows which satisfy this condition is 3.
df1 <- td_sample(df, when_then = list("orderid < 300" = 5))

# Print the results.
df1 %>% arrange(rowids)
#> # Source:     SQL [?? x 5]
#> # Database:   Teradata
#> # Ordered by: rowids
#>   rowids orderid priority quantity sampleid
#>    <int>   <int> <chr>       <int>    <int>
#> 1      1       3 low             6        1
#> 2     49     293 high           49        1
#> 3     50     293 high           27        1

# Check the number of rows of 'df1'.
td_nrow(df1)
#> integer64
#> [1] 3

# If with.replacement is set to TRUE, then 5 rows will be returned.
df1 <- td_sample(df, when_then = list("orderid < 300" = 5), with.replacement = TRUE)

# Check the number of rows of 'df1'.
td_nrow(df1)
#> integer64
#> [1] 5

# Example 8: Get 4 rows (1 row in first sample and 3 rows in second sample) which satisfy the condition 'orderid < 300' from a tbl_teradata object. Here, only 2 rows have sampleid = 2 as the total number of rows which satisfy this condition is 3.
df1 <- td_sample(df, when_then = list("orderid < 300" = c(1, 3)))

# Print the results.
df1 %>% arrange(rowids)
#> # Source:     SQL [?? x 5]
#> # Database:   Teradata
#> # Ordered by: rowids
#>   rowids orderid priority quantity sampleid
#>    <int>   <int> <chr>       <int>    <int>
#> 1      1       3 low             6        2
#> 2     49     293 high           49        1
#> 3     50     293 high           27        2

# Check the number of rows of 'df1'.
td_nrow(df1)
#> integer64
#> [1] 3

# If with.replacement is set to TRUE, then 3 rows having sampleid = 2 will be returned.
df1 <- td_sample(df, when_then = list("orderid < 300" = c(1, 3)), with.replacement = TRUE)

# Check the number of rows of 'df1'.
td_nrow(df1)
#> integer64
#> [1] 4

# Example 9: Using stratified sampling with multiple conditions : 4 rows (1 row in first sample and 3 rows in second sample) when orderid < 300 and 2 rows when priority != "high". Note that only 3 rows are returned when orderid < 300.
df1 <- td_sample(df, when_then = list("orderid < 300" = c(1, 3), "priority <> 'high'" = 2))

# Print the results.
df1 %>% arrange(rowids)
#> # Source:     SQL [?? x 5]
#> # Database:   Teradata
#> # Ordered by: rowids
#>   rowids orderid priority      quantity sampleid
#>    <int>   <int> <chr>            <int>    <int>
#> 1      1       3 low                  6        2
#> 2     49     293 high                49        1
#> 3     50     293 high                27        2
#> 4     85     515 not specified       19        3
#> 5     86     515 not specified       21        3

# Example 10: Using 'case_else' argument for stratified sampling : 2 rows when orderid < 300 and 3 rows from the remaining rows (rows which doesn't satisfy orderid < 300).
df1 <- td_sample(df, when_then = list("orderid < 300" = 2), case_else = 3)

# Print the results.
df1 %>% arrange(rowids)
#> # Source:     SQL [?? x 5]
#> # Database:   Teradata
#> # Ordered by: rowids
#>   rowids orderid priority      quantity sampleid
#>    <int>   <int> <chr>            <int>    <int>
#> 1      1       3 low                  6        1
#> 2     50     293 high                27        1
#> 3     80     483 high                30        2
#> 4     86     515 not specified       21        2
#> 5     97     613 high                12        2

Notes

  • No more than 16 samples can be requested per count or fraction list i.e. the arguments cannot take a list of elements with more than 16 samples.
  • Sum of elements in the list containing fraction values should not be greater than 1 and each value should be greater than 0.
  • If the list contains a float value greater than 1, e.g. c(3, 2.4), then the floor value is considered for sampling i.e. first sample contains 3 rows and second sample contains 2 rows.
  • A new column ‘sampleid’ is added to the sampled data to determine the sample set each row belongs to in the sample. If the parent tbl object already has the column ‘sampleid’, this column is removed in the sampled data. This case occurs when multiple or consecutive sample operations are performed. To retain the ‘sampleid’ columns across multiple td_sample operations, the column must be renamed using mutate function.
  • If the number of samples requested exceed the number of rows available, the sample size is reduced to the number of remaining rows when the argument with.replacement is set to FALSE.