In real-world use cases, some partitions contain significantly more data than others. This is especially true in time series manipulation—a single partition in the input table can contain a time series with billions of data points.
Sometimes the input table cannot be further partitioned with the PARTITION BY clause. The most common reasons are:
- The table has no column or combination of columns that can be used to further partition the data.
- The table contains an ordered data set, and to analyze one row, a function must consider adjacent rows. Simply slicing the table makes analysis of boundary data impossible. The boundary of each subpartition must include duplicate rows from the neighboring partition.
One vworker must process an entire partition. Therefore, severe imbalance in the partitions causes severe load imbalance across vworkers.