- Specify or randomly select k initial cluster centroids.
- Assign each data point to the cluster that has the closest centroid.
- Recalculate the positions of the k centroids.
- Repeat steps 2 and 3 until the centroids no longer move.
The algorithm does not necessarily find the optimal configuration, as it depends significantly on the initial randomly selected cluster centers. You can run TD_KMeans multiple times to reduce the effect of this limitation.
You can also select initial centroids using the 'KMeans++' algorithm to overcome this limitation. The 'KMeans++' algorithm is a smarter way of choosing initial centroids for the KMeans clustering algorithm. The main idea is to select the initial centroids far away from each other. It reduces the possibility of initial centroids being chosen from the same cluster. 'KMeans++' improves the overall quality of clustering, and in some cases, can also speed up the convergence of the KMeans algorithm.
InputTable Usage Considerations
- This function does not consider the InputTable and InitialCentroidsTable Input rows that have a NULL entry in the specified TargetColumns.
- The function can produce deterministic output across different machine configurations if you provide the InitialCentroidsTable in the query.
- The function randomly samples the initial centroids from the InputTable, if you do not provide the InitialCentroidsTable in the query. In this case, you can use the Seed element to make the function output deterministic on a machine with an assigned configuration. However, using the Seed argument does not guarantee deterministic output across machines with different configurations.
- This function requires the UTF8 client character set for UNICODE data.
- This function does not support Pass Through Characters (PTCs).
For information about PTCs, see UNICODE PASS THROUGH.
- This function does not support KanjiSJIS or Graphic data types.