When connected to a Teradata database, the Sample analysis function randomly selects rows from a table or view, producing one or more samples based on a specified number of rows or a fraction of the total number of rows. The sampled rows may be stored in a single table, in a separate table for each sample, or in a single table with a view created for each sample. Options are provided for sampling with or without replacement of rows, randomized allocation or proportional allocation by AMP, and stratified or simple random sampling. These options are described more fully below.
Sampling is performed without replacement by default. This means that each row sampled in a request is unique and once sampled is not replaced in the sampling pool for that request. Therefore, it is not possible to sample more rows than exist in the sampled table, and if multiple samples are requested they are mutually exclusive. When sampling with replacement is requested, each sampled row is immediately returned to the sampling pool and may therefore be selected multiple times. If multiple samples are requested with replacement, the samples are not necessarily mutually exclusive.
The default row allocation method is proportional, allocating the requested rows across the Teradata AMPs as a function of the number of rows on each AMP. This is technically not a simple random sample because it does not include all possible sample sets. It is, however, much faster than randomized allocation, especially for large sample sizes, and should have sufficient randomness for most applications. When randomized allocation is requested, row selections are allocated across the AMPs by simulating simple random sampling, a process that can be comparatively slow.
By default, the Sample analysis function performs simple random sampling. This means that each possible set of the requested size has an equal probability of being selected (subject to the limitations of proportional allocation noted above). An option is however provided for stratified random sampling, wherein the available rows are divided into groups or strata based on stated conditions prior to samples of a requested size or sizes being taken.
The Sample analysis is parameterized by specifying the table and column(s) to analyze, options unique to Sample analysis, as well as specifying the desired results and SQL or Expert Options.