OTF statistics are used by the Optimizer for cardinality estimation during query planning. These stats can come from two sources:
- DBC.StatsTbl
- Metadata JSON file
If both the sources are available, the Optimizer prioritizes DBC.StatsTbl.
Following are the four primary usage scenarios:
- Key Spool Cardinality Estimation (KSCE)
- Single Table Cardinality Estimation (STCE)
- Join Cardinality Estimation (JCE)
- Aggregation Cardinality Estimation (ACE)
The cardinality estimation and confidence levels associated with STCE, JCE, and ACE on OTF tables follow established patterns, ensuring consistency with existing methodologies. For KSCE, the following mechanisms are used:
| Confidence | Estimated Cardinality | ||
|---|---|---|---|
| RET from Manifest List | NO confidence, regardless of whether statistics are collected or single-table conditions are specified, because there is no information provided about the number of manifest files available to the Optimizer | MIN of
|
|
| RET from Manifest Files | No single-table conditions are specified on partitioning columns | HIGH confidence | The number of data files |
| Single-Table conditions on partitioning columns with OTF statistics | LOW confidence | The number of data files * data row selectivity | |
| Single-Table conditions on partitioning columns without OTF statistics | NO confidence | The number of data files * 33% | |
Examples: STCE and ACE
- STCE
- Actual cardinality: 6
- Without stats: RET step shows NO confidence, estimated rows = 3
- With stats: RET step shows HIGH confidence, estimated rows = 6
- ACE
- Actual cardinality: 6
- Without stats: SUM step shows NO confidence, estimated rows = 18
- With stats: SUM step shows LOW confidence, estimated rows = 6
These examples highlight how OTF statistics improve estimation accuracy and confidence during query planning.
| With Statistics for OTF Tables | Without Statistics for OTF Tables | |
|---|---|---|
| STCE | explain
sel city from datalake_iceberg_glue.vim.t1_part
where city = 'portland';
Explanation
---------------------------------------------------------------------------
...
4) We do an all-AMPs RETRIEVE step in TD_MAP1 from a single column
partition of datalake_iceberg_glue .vim.t1_part
by way of external metadata in Spool 3 (Last Use) with a condition
of ("datalake_iceberg_glue .vim.t1_part.city = 'portland'") into
Spool 1 (group_amps), which is built locally on the AMPs. The
size of Spool 1 is estimated with high confidence to be 6 rows (
96,126 bytes). The estimated time for this step is 0.02 seconds.
...
|
explain
sel city from datalake_iceberg_glue.vim.t1_part
where city = 'portland';
Explanation
---------------------------------------------------------------------------
...
4) We do an all-AMPs RETRIEVE step in TD_MAP1 from a single column
partition of datalake_iceberg_glue .vim.t1_part
by way of external metadata in Spool 3 (Last Use) with a condition
of ("datalake_iceberg_glue .vim.t1_part.city = 'portland'") into
Spool 1 (group_amps), which is built locally on the AMPs. The
size of Spool 1 is estimated with no confidence to be 3 rows (
48,063 bytes). The estimated time for this step is 0.02 seconds.
...
|
| ACE | explain
sel city, count(*)
from datalake_iceberg_glue.vim.t1_part
group by city;
Explanation
---------------------------------------------------------------------------
...
5) We do an all-AMPs SUM step in TD_MAP1 to aggregate from Spool 2
(Last Use) by way of an all-rows scan, grouping by field1 (
datalake_iceberg_glue .vim.t1_part.city). Aggregate intermediate
results are computed globally, then placed in Spool 6 in TD_MAP1.
The size of Spool 6 is estimated with low confidence to be 6 rows
(128,142 bytes). The estimated time for this step is 0.03 seconds.
...
|
explain
sel city, count(*)
from datalake_iceberg_glue.vim.t1_part
group by city;
Explanation
---------------------------------------------------------------------------
...
5) We do an all-AMPs SUM step in TD_MAP1 to aggregate from Spool 2
(Last Use) by way of an all-rows scan, grouping by field1 (
datalake_iceberg_glue .vim.t1_part.city). Aggregate intermediate
results are computed globally, then placed in Spool 6 in TD_MAP1.
The size of Spool 6 is estimated with no confidence to be 18 rows
(384,426 bytes). The estimated time for this step is 0.03 seconds.
...
|