KModes Syntax Elements - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
8.10
1.1
Published
October 2019
Language
English (United States)
Last Update
2019-12-31
dita:mapPath
ima1540829771750.ditamap
dita:ditavalPath
jsj1481748799576.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢
OutputTable
Specify the name of the table in which to output the centroids of the clusters.
NumClusters
[Required if you omit InitialSeedTable, disallowed otherwise.] Specify the number of clusters. If you specify a single value, the function trains a single model with the specified number of clusters. If you specify multiple values, the function trains a model for each value.
ModelIDColumn
[Optional] Specify the name of the InitialSeedTable column that contains seed values for multiple models.
TargetColumns
Specify the input table columns to use for clustering.
StopThreshold
[Optional] Specify the convergence threshold. When the centroids move by less than threshold, the algorithm has converged. The threshold must be a nonnegative DOUBLE value.
Default: 0.0395
MaxIterNum
[Optional] Specify the maximum number of iterations that the algorithm runs before quitting if the convergence threshold is not met. The max_iterations must be a positive INTEGER.
Default: 10
NumericDistanceMethod
[Optional] Specify the distance metric for numeric dimensions.
Default: 'euclidean'
CategoricalDistanceMethod
[Optional] Specify the distance metric for categorical dimensions:
Option Description
overlap (Default) Distance is 0 if two points are in same category, 1 otherwise.
hamming Used for categories that are strings of equal length. Percentage of different characters.
CategoryWeights
[Optional] Specify the weight of each category in the KModes distance. Each weight must be a DOUBLE value.
Default behavior: All categories have equal weight.
NumericAsCategorical
[Optional] Specify the input table columns that contain numeric variables to interpret as categorical variables. These columns must have numeric SQL data types.
Default behavior: No numeric variables are treated as categorical variables.
Seed
[Optional] Specify the random seed the algorithm uses for repeatable results. The seed must be a LONG value.
If you specify Seed:
  • You must also specify SeedColumn.
  • You must specify NumClusters, not InitialSeedTable.
For repeatable results, use both the Seed and UniqueID syntax elements. For more information, see Nondeterministic Results and UniqueID Syntax Element.
SeedColumn
[Optional] Specify the names of the InputTable columns by which to partition the input. Function calls that use the same input data, seed, and seed_column output the same result. If you specify SeedColumn, you must also specify Seed.
Ideally, the number of distinct values in the seed_column is the same as the number of workers in the cluster. A very large number of distinct values in the seed_column degrades function performance.