- InputTable
- Specifies the name of the input table that contains the list of features by which to cluster the data.
- OutputTable
- Specifies the name of the table in which to output the centroids of the clusters.
- InitialSeedTable
- [Required if NumClusters is omitted, otherwise not allowed.] Specifies the name of the input table that contains the points that serve as initial cluster centers.
Initial seeds are specified by performing KMeans|| sampling using the FixedSample function.
- NumClusters
- [Required if InitialSeedTable is omitted, otherwise not allowed.] Specifies either the number of clusters. If you specify a single value, the function trains a single model with the specified number of clusters. If you specify multiple values, the function trains a model for each value.
- ModelIDColumn
- [Optional] Specifies the name of the initial_seed_table column that contains seed values for multiple models.
- InputColumns
- Specifies the input table columns to use for clustering.
- Threshold
- [Optional] Specifies the convergence threshold. When the centroids move by less than threshold, the algorithm has converged. The threshold must be a nonnegative DOUBLE value. Default: 0.0395.
- MaxIterNum
- [Optional] Specifies the maximum number of iterations that the algorithm runs before quitting if the convergence threshold is not met. The max_iterations must be a positive INTEGER. Default: 10.
- Distance
- [Optional] Specifies the distance metric for numeric dimensions. Default: 'euclidean'.
- CategoricalDistance
- [Optional] Specifies the distance metric for categorical dimensions:
-
overlap (Default):
Distance is 0 if two points are in the same category, 1 if they are in different categories.
-
hamming:
Used for categories that are strings of equal length. The percentage of characters that are different.
-
overlap (Default):
- CategoryWeights
- [Optional] Specifies the weight of each category in the KModes distance. Each weight must be a DOUBLE value. Default behavior: All categories have equal weight.
- AsCategories
- [Optional] Specifies the input table columns that contain numeric variables to interpret as categorical variables. These columns must have numeric SQL data types. Default behavior: No numeric variables are treated as categorical variables.
- Seed
- [Optional] Specifies the random seed with which to initialize the algorithm (a LONG value).If you specify Seed:
- You must also specify SeedColumn.
- You must specify NumClusters, not InitialSeedTable.
- SeedColumn
- [Optional] Specifies the names of the input_table columns by which to partition the input. Function calls that use the same input data, seed, and seed_column output the same result. If you specify SeedColumn, you must also specify Seed.Ideally, the number of distinct values in the seed_column is the same as the number of workers in the cluster. A very large number of distinct values in the seed_column degrades function performance.