Description
The Sample (td_sampling_mle
) function draws rows randomly from the input table.
Usage
td_sampling_mle (
data = NULL,
data.partition.column = "ANY",
data.order.column = NULL,
summary.data = NULL,
summary.data.order.column = NULL,
stratum.column = NULL,
strata = NULL,
sample.fraction = NULL,
approx.sample.size = NULL,
seed = NULL,
data.sequence.column = NULL,
summary.data.sequence.column = NULL
)
Arguments
data |
Required Argument.
Specifies the tbl_teradata containing the data to be sampled.
|
data.partition.column |
Optional Argument.
Specifies the Partition By columns for data.
Values to this argument can be provided as vector, if multiple
columns are used for partition.
Default Value: ANY
Types: character OR vector of Strings (character)
|
data.order.column |
Optional Argument.
Order By columns for data.
Values to this argument can be provided as vector, if multiple
columns are used for ordering.
Types: character OR vector of Strings (character)
|
summary.data |
Optional Argument.
Specifies the summary tbl_teradata containing the stratum count information.
|
summary.data.order.column |
Optional Argument.
Order By columns for summary.data.
Values to this argument can be provided as vector, if multiple
columns are used for ordering.
Types: character OR vector of Strings (character)
|
stratum.column |
Optional Argument.
Specifies the name of the column that contains the sample conditions.
If the function has only one input tbl_teradata, then sampling
condition column is in the 'data' table. If the function
has two input tbl_teradata, 'data' and 'summary.data', then the sampling
condition column is in the 'summary.data' table.
|
strata |
Optional Argument.
Specifies the sample conditions that appear in the sampling condition
column specified by 'stratum.column'. If this argument specifies a condition
that does not appear in the sampling condition column, then the function
issues an error message.
|
sample.fraction |
Optional Argument.
Specifies one or more fractions to use in sampling the data. (Syntax
options that do not use 'sample.fraction' require 'approx.sample.size'.)
If you specify only one fraction, then the function uses the specified
fraction for all strata defined by the sample conditions. If you specify
more than one fractions, then the function uses each fraction for sampling a
particular stratum defined by the condition arguments.
Note: For conditional sampling with variable sample sizes, specify one fraction
for each condition that you specify with the strata argument.
|
approx.sample.size |
Optional Argument.
Specifies one or more approximate sample sizes to use in sampling the
data. (Syntax options that do not use 'approx.sample.size' require
'sample.fraction'.) Each sample size is approximate because the
function maps the size to the sample fractions and then generates the
sample data. If you specify only one size, then it represents the
total sample size for the entire population. If you also specify the
strata argument, then the function proportionally generates sample
units for each stratum. If you specify more than one size, then each
size corresponds to a stratum, and the function uses each size to
generate sample units for the corresponding stratum.
Note: For conditional sampling with argument "approx.sample.size", specify
one size for each condition that you specify with the strata argument.
|
seed |
Optional Argument.
Specifies the random seed used to initialize the algorithm.
|
data.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
|
summary.data.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "summary.data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
|
Value
Function returns an object of class "td_sampling_mle" which is a named
list containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator
using name: result.
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("sampling_example", "students", "score_category")
# Create remote tibble objects.
students <- tbl(con, "students")
score_category <- tbl(con, "score_category")
# Example 1 - This example selects a sample of approximately 20%
# of the rows in the student table.
td_sampling_out1 <- td_sampling_mle(data = students,
sample.fraction = 0.2,
seed = 2
)
# Example 2 - This example applies sampling rates 20%, 30%, and 40%
# to categories fair, very good, and excellent, respectively, and rounds
# the number sampled to the nearest integer.
td_sampling_out2 <- td_sampling_mle(data = score_category,
data.partition.column = "stratum",
stratum.column = "stratum",
strata = c("fair", "very good", "excellent"),
sample.fraction = c(0.2, 0.3, 0.4),
seed = 2
)
# Example 3 - This examples demonstrates conditional sampling with Approximate
# Sample Size.
# score_summary groups the score_category table based on the stratum
# column and also has their corresponding count.
score_summary <- score_category %>% select(stratum) %>% count(stratum) %>%
mutate(stratum_count = as.integer(n)) %>% select(-n)
td_sampling_out3 <- td_sampling_mle(data=score_category,
summary.data=score_summary,
stratum.column='stratum',
strata=c('excellent','fair','very good'),
approx.sample.size=c(5,10,5),
seed=2
)