TD_KMeans Function | kmeans | Teradata Vantage - TD_KMeans - Analytics Database

Database Analytic Functions

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Analytics Database
Release Number
17.20
Published
June 2022
Language
English (United States)
Last Update
2024-04-06
dita:mapPath
gjn1627595495337.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
jmh1512506877710
Product Category
Teradata Vantageā„¢
The k-means algorithm groups a set of observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid). This algorithm minimizes the objective function, that is, the total Euclidean distance of all data points from the center of the cluster as follows:
  1. Specify or randomly select k initial cluster centroids.
  2. Assign each data point to the cluster that has the closest centroid.
  3. Recalculate the positions of the k centroids.
  4. Repeat steps 2 and 3 until the centroids no longer move.

The algorithm does not necessarily find the optimal configuration, as it depends significantly on the initial randomly selected cluster centers. You can run the function multiple times to reduce the effect of this limitation.

You can also select initial centroids using the 'KMeans++' algorithm to overcome this limitation. The 'KMeans++' algorithm is a smarter way of choosing initial centroids for the KMeans clustering algorithm. The main idea is to select the initial centroids far away from each other. It reduces the possibility of initial centroids being chosen from the same cluster. 'KMeans++' improves the overall quality of clustering, and in some cases, can also speed up the convergence of the KMeans algorithm.

Also, this function returns the within-cluster-squared-sum, which you can use to determine an optimal number of clusters using the Elbow method.
  • This function does not consider the InputTable and InitialCentroidsTable Input rows that have a NULL entry in the specified TargetColumns.
  • The function can produce deterministic output across different machine configurations if you provide the InitialCentroidsTable in the query.
  • The function randomly samples the initial centroids from the InputTable, if you do not provide the InitialCentroidsTable in the query. In this case, you can use the Seed element to make the function output deterministic on a machine with an assigned configuration. However, using the Seed argument does not guarantee deterministic output across machines with different configurations.
  • This function requires the UTF8 client character set for UNICODE data.
  • This function does not support Pass Through Characters (PTCs).

    For information about PTCs, see International Character Set Support, B035-1125.

  • This function does not support KanjiSJIS or Graphic data types.