Rules and Restrictions for Using the SAMPLE Clause
- If a fraction_description causes no rows to be returned, you get a warning.
- If a count_description cannot be completely satisfied, you get a warning and the sample size is reduced to the number of remaining rows.
- No more than 16 samples can be requested per fraction description or count description.
- A sampling request cannot be repeated. The identical sampling query run twice against the same data reports different rows in the result.
- Sampling can be used in a derived table, view, or INSERT SELECT to reduce the number of rows to be considered for further computation.
- If an INSERT SELECT statement specifies a SAMPLE clause that selects a set of rows from a source MULTISET table, but inserts those rows into a target SET table, and the sampled row set contains duplicates, the number of rows inserted into the target SET table may be fewer than the number requested in the SAMPLE clause.
The condition occurs because SET tables reject attempts to insert duplicate rows into them. The result is that the INSERT portion of the INSERT SELECT statement inserts only distinct rows into SET target tables. Therefore, the number of rows inserted into the target table can be fewer than the number specified in the SAMPLE clause.
For example, if an INSERT SELECT statement SAMPLE clause requests a sample size of 10 rows, and the collected sample contains duplicate rows from the MULTISET source table, the system rejects the duplicate instances when trying to insert the sampled rows into the SET table and inserts only the distinct rows from the sample set. That is, you can request a sample of 10 rows, but the actual number of rows inserted into the target table may be fewer than 10 if there are duplicate rows in the sampled row set.
The system does not return any warning or information message when this condition occurs.
- You cannot specify a SAMPLE clause in a WHERE, HAVING, QUALIFY, or ON clause of a subquery.
- You cannot specify the SAMPLE clause in a SELECT statement that uses the set operators UNION, INTERSECT, or MINUS.
- You can specify a scalar UDF for search_condition only if the UDF is invoked within an expression and returns a value expression.
- You cannot specify a SAMPLE clause in a query that specifies the GROUP BY clause and any of the following extended grouping options.
- CUBE
- ROLLUP
- GROUPING SETS
An exception is a query where the GROUP BY clause and extended grouping option appear in a derived table or view and the SAMPLE clause appears in the outer query.
- A SELECT statement that includes the TOP n operator cannot also specify the SAMPLE clause.
SAMPLE
- SAMPLE operates on the evaluated output of the table expression, which can include a WHERE clause and GROUP BY, HAVING, or QUALIFY clauses, sampling the result according to user specification.
- You can specify sampling either with or without replacement.
- You can specify sample allocation as either randomized or proportional.
- You can use the SAMPLEID expression to identify the samples to which the rows belong. See SAMPLEID Expression.
Simple Random Sampling
Simple random sampling is a procedure in which every possible set of the requested size has an equal probability of being selected.
Stratified Random Sampling
Stratified random sampling (also called proportional or quota random sampling) is a sampling method that divides a heterogeneous population of interest into homogeneous subgroups, or strata, and then takes a random sample from each of those subgroups.
The result of this homogeneous stratification of the population is that stratified random sampling represents not only the overall population, but also key subgroups. For example, a retail application may divide a customer population into subgroups composed of customers who pay for their purchases with cash, those who pay by check, and those who buy on credit.
Sampling with or without Replacement
The WITH REPLACEMENT option specifies that sampling is to be done with replacement. The default is sampling without replacement. Sampling without replacement is assumed implicitly if you do not specify WITH REPLACEMENT explicitly.
When sampling with replacement, a sampled row returns to the sampling pool. Therefore, a row may be sampled multiple times. More rows may be sampled than the number of rows in the input. When multiple samples are requested, all the samples are from the whole population and therefore not mutually exclusive.
Sampling without replacement is analogous to selecting rows from a SET table. Each sampled row is unique. Sampled rows do not return to the sampling pool. Requesting more samples than the cardinality of the table returns an error or warning. Multiple samples are mutually exclusive.
The magnitude of the difference of the results of these two methods varies with the size of the sample relative to the population. The smaller the sample relative to the population, the less the difference in the results of sampling with or without replacement.
Randomized and Proportional Row Allocation
Randomized allocation means that the requested rows are allocated across the AMPs by simulating random sampling. This process is slow, especially for large sample sizes, but provides a simple random sample for the system as a whole.
The default row allocation method is proportional. Therefore, the requested rows are allocated across the AMPs as a function of the number of rows on each AMP. This method is much faster than randomized allocation. Because proportional allocation does not include all possible sample sets, the resulting sample is not a simple random sample, but has sufficient randomness to suit the needs of most applications.
Simple random sampling, meaning that each element in the population has an equal and independent probability of being sampled, is used for each AMP in the system regardless of the specified allocation method.
One way to decide on the appropriate allocation method for your application is to determine whether stratifying the sampling input across the AMPs to achieve the corresponding performance gain is acceptable, or whether you need to consider the table as a whole.
The SAMPLEID value is simply 1, 2, 3, … n across n specified samples regardless of stratification. See SAMPLEID Expression.