TD_KMeans Function | kmeans | Teradata Vantage - TD_KMeans - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
ft:locale
en-US
ft:lastEdition
2024-12-11
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905
The k-means algorithm groups a set of observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid). This algorithm minimizes the objective function, that is, the total Euclidean distance of all data points from the center of the cluster as follows:
  1. Specify or randomly select k initial cluster centroids.
  2. Assign each data point to the cluster that has the closest centroid.
  3. Recalculate the positions of the k centroids.
  4. Repeat steps 2 and 3 until the centroids no longer move.

The algorithm does not necessarily find the optimal configuration, as it depends significantly on the initial randomly selected cluster centers. You can run TD_KMeans multiple times to reduce the effect of this limitation.

You can also select initial centroids using the 'KMeans++' algorithm to overcome this limitation. The 'KMeans++' algorithm is a smarter way of choosing initial centroids for the KMeans clustering algorithm. The main idea is to select the initial centroids far away from each other. It reduces the possibility of initial centroids being chosen from the same cluster. 'KMeans++' improves the overall quality of clustering, and in some cases, can also speed up the convergence of the KMeans algorithm.

InputTable Usage Considerations

TD_KMeans also returns the within-cluster-squared-sum, which you can use to determine an optimal number of clusters using the Elbow method.
  • This function does not consider the InputTable and InitialCentroidsTable Input rows that have a NULL entry in the specified TargetColumns.
  • The function can produce deterministic output across different machine configurations if you provide the InitialCentroidsTable in the query.
  • The function randomly samples the initial centroids from the InputTable, if you do not provide the InitialCentroidsTable in the query. In this case, you can use the Seed element to make the function output deterministic on a machine with an assigned configuration. However, using the Seed argument does not guarantee deterministic output across machines with different configurations.
  • This function requires the UTF8 client character set for UNICODE data.
  • This function does not support Pass Through Characters (PTCs).

    For information about PTCs, see UNICODE PASS THROUGH.

  • This function does not support KanjiSJIS or Graphic data types.