To address the issue of nondeterministic data transfer and its impact on the repeatability and performance of the its functions, ML Engine provides the UniqueID syntax element, which is available for every function except nPath.
When used with a Seed syntax element, the UniqueID syntax element guarantees repeatable results.
When nonuniform distribution of rows across workers affects function performance, the UniqueID syntax element improves performance by guaranteeing uniform distribution. (See Determining if Data Skew Might Impact Performance.)
UniqueID Syntax Element Syntax
[ UniqueID ('[input_table_alias:]column' [,...]) ]
- input_table_alias
- [Required if function has multiple input tables, optional otherwise.] Alias of an input table for the function.
- column [,...]
- One or more columns that form a unique row identifier in the input table whose alias is input_table_alias. The data type of a column can be any DISTRIBUTIONTYPE; that is, any of the following:
- SMALLINT
- INTEGER
- BIGINT
- NUMERIC
- VARCHAR
Examples: Single Input Table
Omit optional input_table_alias:
SELECT * FROM Sampling (
ON students AS InputTable PARTITION BY ANY
USING
UniqueID ('id')
SampleFraction('0.2')
Seed('2')
) as dt;
Specify optional input_table_alias:
SELECT * FROM Canopy ( ON computers_train1 AS InputTable USING UniqueID ('InputTable:id') LooseDistance ('1000') TightDistance ('500') ) AS dt;
Examples: Multiple Input Tables
Columns from multiple input tables:
SELECT * FROM KNN ( ON computers_train1_clustered AS TrainingData ON computers_test1 AS TestData OUT TABLE OutputTable(knn_output_uid) USING K (50) ResponseColumn ('computer_category') DistanceFeatures ('price','speed','hd','ram','screen') VotingWeight (1) IDColumn ('id') UniqueId ('TrainingData:id','TestData:id') ) AS dt;
Column from only one of multiple input tables:
SELECT * FROM GMM ( ON (SELECT 1) AS init_params PARTITION BY 1 ON gmm_iris_train AS InputTable OUT TABLE OutputTable (gmm_output_ex1_uid) USING UniqueID ('InputTable:id') ClusterNum (3) CovarianceType ('spherical') MaxIterNum (10) PackOutput (1) ) AS dt;
Example: Nested Functions
SELECT * FROM NaiveBayesReduce ( ON NaiveBayesMap ( ON nb_iris_input_train AS input1 USING UniqueID ('input1:id') ResponseColumn ('species') NumericInputs ('[1:4]') ) PARTITION BY class_nb ) AS dt;
Example: Input Table without Unique Row Identifier
Input table iris_attribute_train does not have a unique row identifier. The following two statements create a new table, iris_attr_uniqueid, which is identical to iris_attribute_train but has an additional unique id column named global_id.
CREATE MULTISET TABLE iris_attr_uniqueid (
global_id INTEGER GENERATED ALWAYS AS IDENTITY(START WITH 1 NO CYCLE),
pid INTEGER,
attribute VARCHAR(30),
attrvalue FLOAT
);
INSERT INTO iris_attr_uniqueid
SELECT null, pid, attribute, attrvalue FROM iris_attribute_train;
Use iris_attr_uniqueid in the function call:
SELECT * FROM DecisionTree ( ON iris_attr_uniqueid AS AttributeTable ON iris_response_train AS ResponseTable OUT TABLE OutputTable (iris_attribute_output_uid) OUT TABLE IntermediateSplitsTable (splits_small) USING UniqueID ('responsetable:pid','attributetable:global_id') NumSplits ('3') SplitMeasure ('gini') MaxDepth ('10') IdColumns ('pid') AttributeNameColumns ('attribute') AttributeValueColumn ('attrvalue') ResponseColumn ('response') MinNodeSize ('10') ApproxSplits ('false') ) AS dt;
For more examples of creating unique row identifiers, see Ways to Create Unique Identifier Column.