Miscellaneous tdplyr functions
Introduction to tdplyr functions
This section introduces some useful tdplyr functions and their usage. Each function has a section in this vignette. Each section can be read in isolation, except where references are made to other sections or other vignettes. Each section also has a subsection called “Things to Note” for good practices and specific behavior to consider when using these tdplyr specific functions with Teradata Vantage.
Once the connection is established using td_create_context, you can load the example table antiselect_input
and create tbl_teradata using the following statements:
loadExampleData("antiselect_example", "antiselect_input")
# Create tbl_teradata object.
df <- tbl(con, "antiselect_input")
# Create a tibble from tbl_teradata object.
df1 <- as_tibble(df)
# Create a R dataframe from tbl_teradata object.
df2 <- as.data.frame(df)
td_nrow
The function td_nrow
retrieves number of rows in a tibble or data.frame object or tbl_teradata object.
# Print the tbl_teradata object.
df
#> # Source: table<antiselect_input> [?? x 13]
#> # Database: Teradata
#> rowids orderid orderdate priority quantity sales discount shipmode
#> <int> <int> <chr> <chr> <int> <dbl> <dbl> <chr>
#> 1 1 3 2010-10-~ low 6 2.62e2 0.04 regular~
#> 2 85 515 2010-08-~ not spe~ 19 3.94e2 0.08 regular~
#> 3 86 515 2010-08-~ not spe~ 21 1.47e2 0.05 regular~
#> 4 49 293 2012-10-~ high 49 1.01e4 0.07 deliver~
#> 5 97 613 2011-06-~ high 12 9.35e1 0.03 regular~
#> 6 50 293 2012-10-~ high 27 2.45e2 0.01 regular~
#> # ... with more rows, and 5 more variables: custname <chr>,
#> # province <chr>, region <chr>, custsegment <chr>, prodcat <chr>
# Get number of rows of tbl_teradata object.
td_nrow(df)
#> [1] 7
# Get number of rows of tibble object.
td_nrow(df1)
#> [1] 7
# Get number of rows of R dataframe.
td_nrow(df2)
#> [1] 7
Notes
- The
nrow
function of R works for dataframes and tibble objects but not for tbl_teradata objects. So, to get the number of rows in a tbl_teradata object, you should use the functiontd_nrow
. - Please refer to dplyr function
tally
which is used to get number of the rows in R data.frames.
td_sample
The function td_sample
reduces the number of rows to be considered for further processing by returning one or more samples of rows. Sampling can be done in either of the three ways mentioned below:
- Specifying list of numbers (number of rows in each sample)
- Specifying list of fractions (proportion of the total number of rows in each sample)
- Specifying list of numbers/fractions based on conditions (stratified sampling)
For more information about the function and its arguments, one can use the command help(td_sample)
from R console.
# Reducing the number of columns.
df <- df %>% select(rowids, orderid, priority, quantity)
# Example 1: Get two samples of 3 rows and 2 rows each. Note that the column 'sampleid' takes 2 values. Each sampleid value represents one sample.
td_sample(df = df, n = c(3, 2))
#> # Source: SQL [?? x 5]
#> # Database: Teradata
#> rowids orderid priority quantity sampleid
#> <int> <int> <chr> <int> <int>
#> 1 97 613 high 12 2
#> 2 86 515 not specified 21 1
#> 3 85 515 not specified 19 1
#> 4 50 293 high 27 1
#> 5 80 483 high 30 2
# Example 2: Get a sample of 3 rows. Note that all the rows have sampleid = 1.
td_sample(df = df, n = 3)
#> # Source: SQL [?? x 5]
#> # Database: Teradata
#> rowids orderid priority quantity sampleid
#> <int> <int> <chr> <int> <int>
#> 1 85 515 not specified 19 1
#> 2 97 613 high 12 1
#> 3 50 293 high 27 1
# Example 3: Get 50% of total rows.
td_sample(df = df, n = 0.5)
#> # Source: SQL [?? x 5]
#> # Database: Teradata
#> rowids orderid priority quantity sampleid
#> <int> <int> <chr> <int> <int>
#> 1 97 613 high 12 1
#> 2 86 515 not specified 21 1
#> 3 1 3 low 6 1
#> 4 80 483 high 30 1
# Example 4: Get two samples each containing 30% and 50% of rows.
td_sample(df = df, n = c(0.3, 0.5))
#> # Source: SQL [?? x 5]
#> # Database: Teradata
#> rowids orderid priority quantity sampleid
#> <int> <int> <chr> <int> <int>
#> 1 1 3 low 6 2
#> 2 85 515 not specified 19 1
#> 3 86 515 not specified 21 2
#> 4 97 613 high 12 2
#> 5 50 293 high 27 2
#> 6 80 483 high 30 1
# Example 5: Get 10 rows from a tbl_teradata object of 7 rows without setting with.replacement. Default value of the argument 'with.replacement' is FALSE. Note that only 7 rows are returned as the table doesn't contain 10 rows.
df1 <- td_sample(df = df, n = 10)
# Print tbl_teradata object.
df1
#> # Source: SQL [?? x 5]
#> # Database: Teradata
#> rowids orderid priority quantity sampleid
#> <int> <int> <chr> <int> <int>
#> 1 1 3 low 6 1
#> 2 85 515 not specified 19 1
#> 3 86 515 not specified 21 1
#> 4 49 293 high 49 1
#> 5 97 613 high 12 1
#> 6 50 293 high 27 1
#> # ... with more rows
# Check the number of rows of 'df1'.
td_nrow(df1)
#> [1] 7
# Example 6: Get 10 rows from a tbl_teradata object of 7 rows setting with.replacement to TRUE. 'randomize = TRUE'ensures sampling is done across AMPs in large datasets. Note that 10 rows are returned and there are duplicates for some rows.
df1 <- td_sample(df = df, n = 10, with.replacement = TRUE, randomize = TRUE)
# Print tbl_teradata object.
df1
#> # Source: SQL [?? x 5]
#> # Database: Teradata
#> rowids orderid priority quantity sampleid
#> <int> <int> <chr> <int> <int>
#> 1 50 293 high 27 1
#> 2 97 613 high 12 1
#> 3 97 613 high 12 1
#> 4 97 613 high 12 1
#> 5 97 613 high 12 1
#> 6 80 483 high 30 1
#> # ... with more rows
# Check the number of rows of 'df1'.
td_nrow(df1)
#> [1] 10
# Example 7 : Get 5 rows which satisfy the condition 'orderid < 300' from a tbl_teradata object. Here, only three rows are returned as the total number of rows which satisfy this condition is 3.
df1 <- td_sample(df, when_then = list("orderid < 300" = 5))
# Print tbl_teradata object.
df1
#> # Source: SQL [?? x 5]
#> # Database: Teradata
#> rowids orderid priority quantity sampleid
#> <int> <int> <chr> <int> <int>
#> 1 49 293 high 49 1
#> 2 1 3 low 6 1
#> 3 50 293 high 27 1
# Check the number of rows of 'df1'.
td_nrow(df1)
#> [1] 3
# If with.replacement is set to TRUE, then 5 rows will be returned.
df1 <- td_sample(df, when_then = list("orderid < 300" = 5), with.replacement = TRUE)
# Check the number of rows of 'df1'.
td_nrow(df1)
#> [1] 5
# Example 8: Get 4 rows (1 row in first sample and 3 rows in second sample) which satisfy the condition 'orderid < 300' from a tbl_teradata object. Here, only 2 rows have sampleid = 2 as the total number of rows which satisfy this condition is 3.
df1 <- td_sample(df, when_then = list("orderid < 300" = c(1, 3)))
# Print tbl_teradata object.
df1
#> # Source: SQL [?? x 5]
#> # Database: Teradata
#> rowids orderid priority quantity sampleid
#> <int> <int> <chr> <int> <int>
#> 1 49 293 high 49 2
#> 2 1 3 low 6 1
#> 3 50 293 high 27 2
# Check the number of rows of 'df1'.
td_nrow(df1)
#> [1] 3
# If with.replacement is set to TRUE, then 3 rows having sampleid = 2 will be returned.
df1 <- td_sample(df, when_then = list("orderid < 300" = c(1, 3)), with.replacement = TRUE)
# Check the number of rows of 'df1'.
td_nrow(df1)
#> [1] 4
# Example 9: Using stratified sampling with multiple conditions : 4 rows (1 row in first sample and 3 rows in second sample) when orderid < 300 and 2 rows when priority != "high". Note that only 3 rows are returned when orderid < 300.
td_sample(df, when_then = list("orderid < 300" = c(1, 3), "priority <> 'high'" = 2))
#> # Source: SQL [?? x 5]
#> # Database: Teradata
#> rowids orderid priority quantity sampleid
#> <int> <int> <chr> <int> <int>
#> 1 85 515 not specified 19 3
#> 2 49 293 high 49 2
#> 3 86 515 not specified 21 3
#> 4 1 3 low 6 2
#> 5 50 293 high 27 1
# Example 10: Using 'case_else' argument for stratified sampling : 2 rows when orderid < 300 and 3 rows from the remaining rows (rows which doesn't satisfy orderid < 300).
td_sample(df, when_then = list("orderid < 300" = 2), case_else = 3)
#> # Source: SQL [?? x 5]
#> # Database: Teradata
#> rowids orderid priority quantity sampleid
#> <int> <int> <chr> <int> <int>
#> 1 85 515 not specified 19 2
#> 2 49 293 high 49 1
#> 3 86 515 not specified 21 2
#> 4 1 3 low 6 1
#> 5 80 483 high 30 2
Notes
- No more than 16 samples can be requested per count or fraction list i.e. the arguments cannot take a list of elements with more than 16 samples.
- Sum of elements in the list containing fraction values should not be greater than 1 and each value should be greater than 0.
- If the list contains a float value greater than 1, e.g.
c(3, 2.4)
, then the floor value is considered for sampling i.e. first sample contains 3 rows and second sample contains 2 rows. - A new column ‘sampleid’ is added to the sampled data to determine the sample set each row belongs to in the sample. If the parent tbl object already has the column ‘sampleid’, this column is removed in the sampled data. This case occurs when multiple or consecutive sample operations are performed. To retain the ‘sampleid’ columns across multiple
td_sample
operations, the column must be renamed usingmutate
function. - If the number of samples requested exceed the number of rows available, the sample size is reduced to the number of remaining rows when the argument
with.replacement
is set to FALSE.