UniqueID Syntax Element - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product

Teradata Vantage

Release Number

8.10

1.1

Published

October 2019

Language

English (United States)

Last Update

2019-12-31

dita:mapPath

ima1540829771750.ditamap

dita:ditavalPath

jsj1481748799576.ditaval

dita:id

B700-4003

lifecycle

Product Category

Teradata Vantage™

To address the issue of nondeterministic data transfer and its impact on the repeatability and performance of the its functions, ML Engine provides the UniqueID syntax element, which is available for every function except nPath.

When used with a Seed syntax element, the UniqueID syntax element guarantees repeatable results.

When nonuniform distribution of rows across workers affects function performance, the UniqueID syntax element improves performance by guaranteeing uniform distribution. (See Determining if Data Skew Might Impact Performance.)

UniqueID Syntax Element Syntax

[ UniqueID ('[input_table_alias:]column' [,...]) ]

input_table_alias

[Required if function has multiple input tables, optional otherwise.] Alias of an input table for the function.

column [,...]

One or more columns that form a unique row identifier in the input table whose alias is input_table_alias. The data type of a column can be any DISTRIBUTIONTYPE; that is, any of the following:

SMALLINT
INTEGER
BIGINT
NUMERIC
VARCHAR

If the table whose alias is input_table_alias does not have a unique row identifier, you must create one. See Ways to Create Unique Identifier Column.

Examples: Single Input Table

Omit optional input_table_alias:

SELECT * FROM Sampling (
  ON students AS InputTable PARTITION BY ANY
  USING
  UniqueID ('id')
  SampleFraction('0.2')
  Seed('2')
) as dt;

Specify optional input_table_alias:

SELECT * FROM Canopy (
  ON computers_train1 AS InputTable
  USING
  UniqueID ('InputTable:id')
  LooseDistance ('1000')
  TightDistance ('500')
) AS dt;

Examples: Multiple Input Tables

Columns from multiple input tables:

SELECT * FROM KNN (
  ON computers_train1_clustered AS TrainingData
  ON computers_test1 AS TestData
  OUT TABLE OutputTable(knn_output_uid)
  USING
  K (50)
  ResponseColumn ('computer_category')
  DistanceFeatures ('price','speed','hd','ram','screen')
  VotingWeight (1)
  IDColumn ('id')
  UniqueId ('TrainingData:id','TestData:id')
) AS dt;

Column from only one of multiple input tables:

SELECT * FROM GMM (
  ON (SELECT 1) AS init_params PARTITION BY 1
  ON gmm_iris_train AS InputTable
  OUT TABLE OutputTable (gmm_output_ex1_uid)
  USING
  UniqueID ('InputTable:id')
  ClusterNum (3)
  CovarianceType ('spherical')
  MaxIterNum (10)
  PackOutput (1)
) AS dt;

Example: Nested Functions

SELECT * FROM NaiveBayesReduce (
  ON NaiveBayesMap (
    ON nb_iris_input_train AS input1
    USING
    UniqueID ('input1:id')
    ResponseColumn ('species')
    NumericInputs ('[1:4]')
  ) PARTITION BY class_nb
) AS dt;

Example: Input Table without Unique Row Identifier

Input table iris_attribute_train does not have a unique row identifier. The following two statements create a new table, iris_attr_uniqueid, which is identical to iris_attribute_train but has an additional unique id column named global_id.

CREATE MULTISET TABLE iris_attr_uniqueid (
  global_id INTEGER GENERATED ALWAYS AS IDENTITY(START WITH 1 NO CYCLE),
  pid INTEGER,
  attribute VARCHAR(30),
  attrvalue FLOAT
);

INSERT INTO iris_attr_uniqueid
  SELECT null, pid, attribute, attrvalue FROM iris_attribute_train;

Use iris_attr_uniqueid in the function call:

SELECT * FROM DecisionTree (
  ON iris_attr_uniqueid AS AttributeTable
  ON iris_response_train AS ResponseTable
  OUT TABLE OutputTable (iris_attribute_output_uid)
  OUT TABLE IntermediateSplitsTable (splits_small)
  USING
  UniqueID ('responsetable:pid','attributetable:global_id')

  NumSplits ('3')
  SplitMeasure ('gini')
  MaxDepth ('10')
  IdColumns ('pid')
  AttributeNameColumns ('attribute')
  AttributeValueColumn ('attrvalue')
  ResponseColumn ('response')
  MinNodeSize ('10')
  ApproxSplits ('false')
) AS dt;

For more examples of creating unique row identifiers, see Ways to Create Unique Identifier Column.