SAMPLE Clause
Purpose
Reduces the number of rows to be considered for further processing by returning one or more samples of rows specified either as a list of fractions of the total number of rows or as a list of numbers of rows from the SELECT query.
Syntax
where:
Syntax element … |
Specifies … |
WITH REPLACEMENT |
whether sampling is done by returning each sampled row to the table for possible redundant sampling or by withholding sampled rows from resampling. If you specify WITH REPLACEMENT, then it is possible to request more samples than there are rows in the table. Sampling without replacement is the default. You select it implicitly by not specifying WITH REPLACEMENT. |
RANDOMIZED ALLOCATION |
whether rows are sampled randomly across AMPS (RANDOMIZED ALLOCATION) or proportionate to the number of qualified rows per AMP (proportional allocation). The proportional allocation option does not provide a simple random sample of the entire population. It provides a random sample stratified by AMPs, but it is much faster, especially for very large samples. Proportional is the default. You select it implicitly by not specifying RANDOMIZED ALLOCATION. |
fraction_description |
any set of unsigned floating point constant numbers in the closed interval (0,1) that specifies the percentage of rows to be sampled for a true search condition. This is a comma-separated list of fractions, the sum of which must not exceed 1. The value set specifies the percentage of the homogeneous subgroup defined by search_condition to be sampled for the report. No more than 16 samples can be requested per fraction description. |
count_description |
a set of positive integer constants that specifies the number of rows to be sampled for a true search condition. A warning is returned if there are not enough rows in the result to satisfy the sampling request completely. No more than 16 samples can be requested per count description. |
WHEN |
to test a set of conditions for truth. |
search_condition |
an evaluation predicate that defines each homogeneous subgroup in the sample set. You can only specify a scalar UDF for search_condition if it is invoked within an expression and returns a value expression. You cannot specify expressions that contain LOBs in a search condition unless you first cast them to another type or pass them to a function whose result is not a LOB (for example, casting a BLOB to BYTE or VARBYTE or casting a CLOB to CHARACTER or VARCHAR). |
THEN |
to apply the specified sampling fraction description or count description to the sample. |
ELSE |
to apply the specified sampling fraction description or count description to the sample if none of the WHEN condition predicates evaluates to true. |
END |
the termination of the WHEN … THEN … ELSE clause. |
ANSI Compliance
The SAMPLE clause is a Teradata extension to the ANSI SQL:2011 standard.
Rules and Restrictions for Using the SAMPLE Clause
The rules and restrictions are:
The condition occurs because SET tables reject attempts to insert duplicate rows into them. The result is that the INSERT portion of the INSERT … SELECT statement inserts only distinct rows into SET target tables. As a result, the number of rows inserted into the target table can be fewer than the number specified in the SAMPLE clause.
For example, if an INSERT … SELECT statement SAMPLE clause requests a sample size of 10 rows, and there are duplicate rows from the MULTISET source table in the collected sample, the system rejects the duplicate instances when it attempts to insert the sampled rows into the SET table and inserts only the distinct rows from the sample set. In other words, you could request a sample of 10 rows, but the actual number of rows inserted into the target table could be fewer than 10 if there are duplicate rows in the sampled row set.
The system does not return any warning or information message when this condition occurs.
An exception is a query where the GROUP BY clause and extended grouping option appear in a derived table or view and the SAMPLE clause appears in the outer query.
About SAMPLE
Simple Random Sampling
Simple random sampling is a procedure in which every possible set of the requested size has an equal probability of being selected.
Stratified Random Sampling
Stratified random sampling, sometimes called proportional or quota random sampling, is a sampling method that divides a heterogeneous population of interest into homogeneous subgroups, or strata, and then takes a random sample from each of those subgroups.
The result of this homogeneous stratification of the population is that stratified random sampling represents not only the overall population, but also key subgroups. For example, a retail application might divide a customer population into subgroups composed of customers who pay for their purchases with cash, those who pay by check, and those who buy on credit.
Sampling With or Without Replacement
The WITH REPLACEMENT option specifies that sampling is to be done with replacement. The default is sampling without replacement. Sampling without replacement is assumed implicitly if you do not specify WITH REPLACEMENT explicitly.
When sampling with replacement, a sampled row, once sampled, is returned to the sampling pool. As a result, a row might be sampled multiple times. Because of this, it is possible to sample more rows than the number of rows in the input. When multiple samples are requested, all the samples are from the whole population and therefore not mutually exclusive.
Sampling without replacement is analogous to selecting rows from a SET table in that each row sampled is unique, and once a row is sampled, is not returned to the sampling pool. As a result, requesting a number of samples greater than the cardinality of the table returns an error or warning. Whenever multiple samples are requested, they are mutually exclusive.
The magnitude of the difference of the results obtained by these two methods varies with the size of the sample relative to the population. The smaller the sample relative to the population, the less the difference in the results of sampling with or without replacement.
Randomized and Proportional Row Allocation
Randomized allocation means that the requested rows are allocated across the AMPs by simulating random sampling. This is a slow process, especially for large sample sizes, but it provides a simple random sample for the system as a whole.
The default row allocation method is proportional. This means that the requested rows are allocated across the AMPs as a function of the number of rows on each AMP. This method is much faster than randomized allocation, especially for large sample sizes. Because proportional allocation does not include all possible sample sets, the resulting sample is not a simple random sample, but it has sufficient randomness to suit the needs of most applications.
Note that simple random sampling, meaning that each element in the population has an equal and independent probability of being sampled, is employed for each AMP in the system regardless of the specified allocation method.
One way to decide on the appropriate allocation method for your application is to determine whether it is acceptable to stratify the sampling input across the AMPs to achieve the corresponding performance gain, or whether you need to consider the table as a whole.
The SAMPLEID value is simply 1, 2, 3, … n across n specified samples regardless of stratification. That is, for the following SAMPLE clause,
|
SAMPLE WHEN state = 'CA' THEN
|
0.3,
|
0.2
|
ELSE
|
0.5,
|
0.2
|
the SAMPLEID correspondence would be: |
1 |
2 |
|
3 |
4 |
Example 1: Using fraction_description
Suppose you want to generate three mutually exclusive sample sets of a customer table for a neural net analysis. The desired percentages are as follows:
Note that the sum does not exceed 100%.
A SELECT statement to generate the desired result looks like the following.
SELECT customer_id, age, income, marital_status, SAMPLEID
FROM customer_table
SAMPLE 0.6, 0.25, 0.15;
The result might look something like the following table.
customer_id
-----------
|
age
---
|
income
------
|
marital_status
--------------
|
SAMPLEID
--------
|
1362549
|
17
|
0
|
1
|
1
|
1362650
|
21
|
17,804
|
2
|
1
|
1362605
|
34
|
16,957
|
2
|
1
|
1362672
|
50
|
16,319
|
2
|
3
|
1362486
|
76
|
10,701
|
3
|
1
|
1362500
|
40
|
56,708
|
1
|
3
|
1362489
|
35
|
55,888
|
3
|
2
|
1362498
|
60
|
9,849
|
2
|
1
|
1362551
|
27
|
23,085
|
1
|
1
|
1362503
|
18
|
5,787
|
1
|
2
|
Sample 1 is the training group, Sample 2 is the test group, and Sample 3 is the validation group.
Example 2: Using count_description
Check if your customers are in at least 100 cities:
SELECT COUNT (DISTINCT city)
FROM (SELECT city
FROM customer_table
SAMPLE 1000) TEMP;
If customer_table is large, the SAMPLE 1000 clause would not require a full scan of the table and the sort for DISTINCT would only handle 1,000 rows.
A 1,000 row sample is more than 95 percent accurate for estimating if the number of distinct values is greater than 100.
If you were to make a similar query without including the SAMPLE clause, the query would first have to sort the large customer_table before performing the DISTINCT. For example:
SELECT COUNT (DISTINCT city)
FROM customer_table;
Examples Using Stratified Sampling
This table provides the data used in the examples that follow.
SELECT *
FROM stores;
storeid city state
----------- --------------- -----
2 Green Bay WI
7 San Diego CA
5 San Jose CA
8 Los Angeles CA
3 Madison WI
1 Racine WI
6 San Francisco CA
4 Milwaukee WI
Example 3: Stratified Sampling and Proportional Allocation Without Replacement
The following query uses proportional allocation by default to sample, without replacement, 25 percent of the rows for WI and 50 percent of the rows for CA:
SELECT city, state, SAMPLEID
FROM stores
SAMPLE WHEN state = 'WI' THEN 0.25
WHEN state = 'CA' THEN 0.5
END
ORDER BY 3;
city state SAMPLEID
--------------- ----- -----------
Milwaukee WI 1
San Diego CA 2
San Jose CA 2
Example 4: Stratified Sampling and Proportional Allocation With Replacement
The following query uses proportional allocation by default with replacement to sample two samples of 3 rows and 1 row, respectively, which are not mutually exclusive, for WI and two samples of 2 rows each for CA, which are not mutually exclusive:
SELECT city, state, SAMPLEID
FROM stores
SAMPLE WITH REPLACEMENT
WHEN state = 'WI' THEN 3, 1
WHEN state = 'CA' THEN 2, 2
END
ORDER BY 3;
city state SAMPLEID
--------------- ----- -----------
Green Bay WI 1
Madison WI 1
Madison WI 1
Racine WI 2
San Diego CA 3
San Jose CA 3
San Diego CA 4
San Jose CA 4
Example 5: Stratified Sampling With Randomized Allocation Without Replacement
The following query uses randomized allocation without replacement to sample two mutually exclusive samples of 25 percent and 50 percent, respectively, of the rows from WI and two mutually exclusive samples of 25 percent each for CA:
SELECT city, state, SAMPLEID
FROM stores
SAMPLE RANDOMIZED ALLOCATION
WHEN state = 'WI' THEN 0.25, 0.5
WHEN state = 'CA' THEN 0.25, 0.25
END
ORDER BY 3;
city state SAMPLEID
--------------- ----- -----------
Green Bay WI 1
Milwaukee WI 2
Madison WI 2
San Diego CA 3
San Francisco CA 4
Example 6: Stratified Sampling With Randomized Allocation With Replacement
The following query uses randomized allocation with replacement to sample three rows for WI, and two nonspecific, non‑WI rows:
SELECT city, state, SAMPLEID
FROM stores
SAMPLE WITH REPLACEMENT RANDOMIZED ALLOCATION
WHEN state = 'WI' THEN 3
ELSE 2
END
ORDER BY 3;
city state SAMPLEID
--------------- ----- -----------
Racine WI 1
Racine WI 1
Madison WI 1
San Diego CA 2
San Diego CA 2
Example 7: Stratified Sampling With Randomized Allocation With Replacement
The following query samples three rows with replacement using randomized allocation:
SELECT city, state
FROM stores
SAMPLE WITH REPLACEMENT RANDOMIZED ALLOCATION 3;
city state
--------------- -----
San Diego CA
Madison WI
San Jose CA
Example 8: SAMPLE and the PERIOD Value Expression
The following example shows how you can specify a SAMPLE clause on a PERIOD value expression, where period_of_stay is the PERIOD value expression.
SELECT emp_no, period_of_stay, SAMPLEID
FROM employee
SAMPLE 0.5,0.5;