The function has one required input table and one optional input table. The required input table contains the data points to be clustered with one dimension in each column.
|Column Name||Data Type||Description|
|dimension_i||Any||Data for dimension i. The table has columns dimension_1 through dimension_n, where n is the number of dimensions. Each dimension is a feature by which to cluster the data. You need not specify n; the function determines it automatically.|
If you do not provide a value in the NumClusters argument, you must provide an initial seed table. This table has the same schema as the preceding table.
The function allows you to try different combinations of seeds to generate multiple models simultaneously. You can then compare the model metrics to find the best model. There are two ways to generate multiple models:
- You can specify multiple values in the NumClusters argument. For example, NumClusters('3', '3', '4') fits 3 models, two with 3 clusters and one model with 4 clusters. It is good practice to try multiple initializations when fitting KModes, which is why you might use the same number more than once.
- You can use the function RandomSample to select multiple sets of rows from the input data table, and use these randomly selected samples as seeds. To do this, follow these steps:
Run RandomSample. Assign the argument NumSample a set of values x 1, x 2, …, x n where n is the number of different sets of rows to generate (this becomes the number of models later created by KModes) and x i is the number of seed rows to select for each model (this determines the number of clusters in model i later created by KModes).
Save the output of the RandomSample run to a table. This table has a column, set_id, that identifies each set of points.
In the KModes function call, set InitialSeedTable to the name of the table you generated, and specify ModelIdColumn('set_id').