KModes Arguments - Aster Analytics

Teradata Aster® Analytics Foundation User GuideUpdate 2

Product

Aster Analytics

Release Number

7.00.02

Published

September 2017

Language

English (United States)

Last Update

2018-04-17

dita:mapPath

uce1497542673292.ditamap

dita:ditavalPath

AA-notempfilter_pdf_output.ditaval

dita:id

B700-1022

lifecycle

Product Category

Software

InputTable

Specifies the name of the input table that contains the list of features by which to cluster the data.

OutputTable

Specifies the name of the table in which to output the centroids of the clusters.

InitialSeedTable

[Required if NumClusters is omitted, otherwise not allowed.] Specifies the name of the input table that contains the points that serve as initial cluster centers.

Initial seeds are specified by performing KMeans|| sampling using the FixedSample function.

NumClusters

[Required if InitialSeedTable is omitted, otherwise not allowed.] Specifies either the number of clusters. If you specify a single value, the function trains a single model with the specified number of clusters. If you specify multiple values, the function trains a model for each value.

ModelIDColumn

[Optional] Specifies the name of the initial_seed_table column that contains seed values for multiple models.

InputColumns

Specifies the input table columns to use for clustering.

Threshold

[Optional] Specifies the convergence threshold. When the centroids move by less than threshold, the algorithm has converged. The threshold must be a nonnegative DOUBLE value. Default: 0.0395.

MaxIterNum

[Optional] Specifies the maximum number of iterations that the algorithm runs before quitting if the convergence threshold is not met. The max_iterations must be a positive INTEGER. Default: 10.

Distance

[Optional] Specifies the distance metric for numeric dimensions. Default: 'euclidean'.

CategoricalDistance

[Optional] Specifies the distance metric for categorical dimensions:

overlap (Default):
Distance is 0 if two points are in the same category, 1 if they are in different categories.
hamming:
Used for categories that are strings of equal length. The percentage of characters that are different.

CategoryWeights

[Optional] Specifies the weight of each category in the KModes distance. Each weight must be a DOUBLE value. Default behavior: All categories have equal weight.

AsCategories

[Optional] Specifies the input table columns that contain numeric variables to interpret as categorical variables. These columns must have numeric SQL data types. Default behavior: No numeric variables are treated as categorical variables.

Seed

[Optional] Specifies the random seed with which to initialize the algorithm (a LONG value).

If you specify Seed:

You must also specify SeedColumn.
You must specify NumClusters, not InitialSeedTable.

SeedColumn

[Optional] Specifies the names of the input_table columns by which to partition the input. Function calls that use the same input data, seed, and seed_column output the same result. If you specify SeedColumn, you must also specify Seed.

Ideally, the number of distinct values in the seed_column is the same as the number of workers in the cluster. A very large number of distinct values in the seed_column degrades function performance.