1.1 - 8.10 - UniqueID Syntax Element - Teradata Vantage

Teradata Vantage™ - Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
1.1
8.10
Release Date
October 2019
Content Type
Programming Reference
Publication ID
B700-4003-079K
Language
English (United States)

To address the issue of nondeterministic data transfer and its impact on the repeatability and performance of the its functions, ML Engine provides the UniqueID syntax element, which is available for every function except nPath.

When used with a Seed syntax element, the UniqueID syntax element guarantees repeatable results.

When nonuniform distribution of rows across workers affects function performance, the UniqueID syntax element improves performance by guaranteeing uniform distribution. (See Determining if Data Skew Might Impact Performance.)

UniqueID Syntax Element Syntax

[ UniqueID ('[input_table_alias:]column' [,...]) ]
input_table_alias
[Required if function has multiple input tables, optional otherwise.] Alias of an input table for the function.
column [,...]
One or more columns that form a unique row identifier in the input table whose alias is input_table_alias. The data type of a column can be any DISTRIBUTIONTYPE; that is, any of the following:
  • SMALLINT
  • INTEGER
  • BIGINT
  • NUMERIC
  • VARCHAR
If the table whose alias is input_table_alias does not have a unique row identifier, you must create one. See Ways to Create Unique Identifier Column.

Examples: Single Input Table

Omit optional input_table_alias:

SELECT * FROM Sampling (
  ON students AS InputTable PARTITION BY ANY
  USING
  UniqueID ('id')
  SampleFraction('0.2')
  Seed('2')
) as dt;

Specify optional input_table_alias:

SELECT * FROM Canopy (
  ON computers_train1 AS InputTable
  USING
  UniqueID ('InputTable:id')
  LooseDistance ('1000')
  TightDistance ('500')
) AS dt;

Examples: Multiple Input Tables

Columns from multiple input tables:

SELECT * FROM KNN (
  ON computers_train1_clustered AS TrainingData
  ON computers_test1 AS TestData
  OUT TABLE OutputTable(knn_output_uid)
  USING
  K (50)
  ResponseColumn ('computer_category')
  DistanceFeatures ('price','speed','hd','ram','screen')
  VotingWeight (1)
  IDColumn ('id')
  UniqueId ('TrainingData:id','TestData:id')
) AS dt;

Column from only one of multiple input tables:

SELECT * FROM GMM (
  ON (SELECT 1) AS init_params PARTITION BY 1
  ON gmm_iris_train AS InputTable
  OUT TABLE OutputTable (gmm_output_ex1_uid)
  USING
  UniqueID ('InputTable:id')
  ClusterNum (3)
  CovarianceType ('spherical')
  MaxIterNum (10)
  PackOutput (1)
) AS dt;

Example: Nested Functions

SELECT * FROM NaiveBayesReduce (
  ON NaiveBayesMap (
    ON nb_iris_input_train AS input1
    USING
    UniqueID ('input1:id')
    ResponseColumn ('species')
    NumericInputs ('[1:4]')
  ) PARTITION BY class_nb
) AS dt;

Example: Input Table without Unique Row Identifier

Input table iris_attribute_train does not have a unique row identifier. The following two statements create a new table, iris_attr_uniqueid, which is identical to iris_attribute_train but has an additional unique id column named global_id.

CREATE MULTISET TABLE iris_attr_uniqueid (
  global_id INTEGER GENERATED ALWAYS AS IDENTITY(START WITH 1 NO CYCLE),
  pid INTEGER,
  attribute VARCHAR(30),
  attrvalue FLOAT
);

INSERT INTO iris_attr_uniqueid
  SELECT null, pid, attribute, attrvalue FROM iris_attribute_train;

Use iris_attr_uniqueid in the function call:

SELECT * FROM DecisionTree (
  ON iris_attr_uniqueid AS AttributeTable
  ON iris_response_train AS ResponseTable
  OUT TABLE OutputTable (iris_attribute_output_uid)
  OUT TABLE IntermediateSplitsTable (splits_small)
  USING
  UniqueID ('responsetable:pid','attributetable:global_id')

  NumSplits ('3')
  SplitMeasure ('gini')
  MaxDepth ('10')
  IdColumns ('pid')
  AttributeNameColumns ('attribute')
  AttributeValueColumn ('attrvalue')
  ResponseColumn ('response')
  MinNodeSize ('10')
  ApproxSplits ('false')
) AS dt;

For more examples of creating unique row identifiers, see Ways to Create Unique Identifier Column.