Description
This function reduces the number of rows to be considered for further processing by returning one or more samples of rows. Sampling can be done in either of the three ways mentioned below:
Specifying list of numbers (number of rows in each sample)
Specifying list of fractions (proportion of the total number of rows in each sample)
Specifying list of numbers/fractions based on conditions (stratified sampling)
Stratified random sampling is a sampling method that divides a heterogeneous population of
interest into homogeneous subgroups, or strata, and then takes a random sample from each of
those subgroups. The arguments 'when_then' and 'case_else' help in stratified sampling.
Usage notes for the arguments 'n', 'case_else' and each 'then' element in the argument
'when_then':
No more than 16 samples can be requested per count or fraction list i.e. the arguments cannot take a list of elements with more than 16 samples.
Sum of elements in the list containing fraction values should not be greater than 1 and each value should be greater than 0.
If the list contains a float value greater than 1, e.g.
c(3, 2.4)
, then the floor value is considered for sampling i.e. first sample contains 3 rows and second sample contains 2 rows.
Note :
A new column 'sampleid' is added to the sampled data to determine the sample set each row belongs to in the sample. If the parent tbl object already has the column 'sampleid', this column is removed in the sampled data. This case occurs when multiple or consecutive sample operations are performed. To retain the 'sampleid' columns across multiple
td_sample
operations, the column must be renamed usingmutate
function.If the number of samples requested exceed the number of rows available, the sample size is reduced to the number of remaining rows when the argument 'with.replacement' is set to FALSE.
Usage
td_sample( df = NULL, n = NULL, with.replacement = FALSE, randomize = FALSE, when_then = NULL, case_else = c() )
Arguments
df |
Required Argument. |
n |
Optional Argument.
|
with.replacement |
Optional Argument. |
randomize |
Optional Argument. |
when_then |
Optional Argument.
|
case_else |
Optional Argument.
|
Value
A 'tbl' object containing the sampled data.
See Also
sample
, td_sampling
Examples
# Get remote data source connection. con <- td_get_context()$connection # Creates the table "antiselect_input" if it is not present already. loadExampleData("antiselect_example", "antiselect_input") # Creates a teradata_tbl object. df <- tbl(con, "antiselect_input") # Table contain 7 rows in total. # Example 1: Get two samples of 3 rows and 2 rows each. td_sample(df = df, n = c(3,2)) # Example 2: Get a sample of 3 rows. Note that all the rows have sampleid = 1. td_sample(df = df, n = 3) # Example 3: Get 50% of total rows. Here, it is 50% of 7 rows. td_sample(df = df, n = 0.5) # Example 4: Get 10 rows from a tbl object of 7 rows using with.replacement = TRUE. # 'randomize = TRUE' will ensure sampling is done across AMPs in large datasets. td_sample(df = df, n = 10, with.replacement = TRUE, randomize = TRUE) # Example 5: Get 5 rows which satisfy the condition 'orderid < 300' from a tbl object. # Here, only three rows are returned as the total number of rows which satisfy this # condition is 3. If with.replacement = TRUE is specified, then 5 rows will be # returned. td_sample(df, when_then = list("orderid < 300" = 5)) # Example 6: Get 4 rows (1 row in first sample and 3 rows in second sample) which satisfy the # condition 'orderid < 300' from a tbl object. # Here, only 2 rows have sampleid = 2 as the total number of rows which satisfy # this condition is 3. If with.replacement = TRUE is specified, then 3 rows having # sampleid = 2 will be returned. td_sample(df, when_then = list("orderid < 300" = c(1,3))) # Example 7: Using stratified sampling with multiple conditions : 4 rows (1 row in first sample # and 3 rows in second sample) when orderid < 300 and 2 rows when priority != "high". td_sample(df, when_then = list("orderid < 300" = c(1,3), "priority <> 'high'" = 2)) # Example 8: Using 'case_else' argument for stratified sampling : 2 rows when orderid < 300 and # 3 rows from the remaining rows (rows which doesn't satisfy orderid < 300). td_sample(df, when_then = list("orderid < 300" = 2), case_else = 3)