The primary purpose of the RowDistribution function is to examine the distribution of data across vworkers. Skewed data distribution can decrease the performance (calculation speed) of some analytic functions. For more information, and instructions for interpreting the results of this function, see Determining if Data Skew Might Impact Performance.
Another use of the RowDistribution function is to get the configuration (number of worker nodes and vworkers) of the ML Engine cluster.
RowDistribution Syntax, Version 1.6
SELECT * FROM RowDistribution (
ON { table | view | (query) }
) AS alias;
RowDistribution Input
The input table can have any schema.
RowDistribution Output
Column | Data Type | Description |
---|---|---|
task_index | INTEGER | Identifier of vworker. |
ip_address | CLOB | IP address of worker node containing vworker. |
row_count | BIGINT | Number of input table rows stored on vworker. |
RowDistribution Example
Input: housing_train, as in GLM Example: Gaussian Distribution Analysis
SQL Call:
SELECT * FROM RowDistribution ( ON housing_train ) AS dt;
Output:
task_index | ip_address | row_count ------------+--------------+----------- 0 | 10.25.17.133 | 162 2 | 10.25.17.131 | 163 1 | 10.25.17.129 | 167 (3 rows)